[mvapich-discuss] Fortran MPI_Wait() request error
Michael S. Long
mlong at seas.harvard.edu
Fri Oct 6 19:08:58 EDT 2017
Hi DK,
I hope to have a small reproducer working soon. I'll post it to github
when/if it works.
As some point I recall testing with another C compiler (likely GCC) but
it's been so many ways I've hit this problem I don't recall exactly.
Over the weekend I'll try with GCC & another version of IFORT.
Thanks. Have a good weekend.
ML
On 10/06/2017 06:37 PM, Panda, Dhabaleswar wrote:
> Hi, Michael,
>
> Thanks for your report here. Sorry to know that you are facing this
> error. Are you seeing this error with only ICC 15.0.0 or any other
> version ICC. Do you see this issue with gcc? Is it possible to get a
> small reproducer? This will help us to debug this issue quicker.
>
> Thanks,
>
> DK
>
> Sent from my iPhone
>
> On Oct 6, 2017, at 6:14 PM, Michael S. Long <mlong at seas.harvard.edu
> <mailto:mlong at seas.harvard.edu>> wrote:
>
>> Dear MVAPICH-Discuss,
>>
>> We are having a problem associated with MPI_IRecv & MPI_Wait in
>> Fortran90.
>>
>> Version 2.2b (2.3b also tested with out the same explicit result but
>> a hang at the same point)
>> Compiler: IFORT & ICC 15.0.0
>>
>> In a loop over one dimension in a 3D array across which data are
>> being broadcast, MPI_Wait() for several of the receive requests dies
>> with the following error:
>>
>>> Fatal error in PMPI_Wait: Other MPI error, error stack:
>>> PMPI_Wait(182)..................: 11MPI_Wait(request=0x23f6fea0,
>>> status=0x1) failed
>>> MPIR_Wait_impl(71)..............:
>>> MPIDI_CH3I_Progress(393)........:
>>> pkt_CTS_handler(321)............:
>>> MPID_nem_lmt_shm_start_send(273):
>>> MPID_nem_delete_shm_region(926).:
>>> MPIU_SHMW_Seg_detach(707).......: unable to remove shared memory -
>>> unlink No such file or directory
>>
>> What we've been able to determine is that at the call to MPI_IRecv(),
>> the associated MPI_Request is /not/ being allocated (it still returns
>> a successful return code). Specifically, the following things happen
>> with various tests:
>>
>> 1) MPI_Request_Get_Status() will usually segfault at any point
>> between the call to MPI_IRecv and MPI_Wait
>> 2) In the occasional chance that MPI_Request_Get_Status() doesn't
>> segfault, the resulting value of FLAG will be False and
>> 3) Querying the count values and buffer sizes for the associated
>> request gives 0 for both. These requests then fail at MPI_Wait().
>>
>> All request handles as seen in Fortran are valid values. i.e. there's
>> no NaN or anything like that. This may be clear in the error msg
>> above since the traceback is able to give a hex value for the handle
>> of the failing request within the C portion.
>> The program will proceed with SGI.
>>
>> Any help would be greatly appreciated. It is recognized that some
>> info might be missing, in which case please let me know.
>>
>> Sincerely,
>> Michael Long
>> _______________________________________________
>> mvapich-discuss mailing list
>> mvapich-discuss at cse.ohio-state.edu
>> <mailto:mvapich-discuss at cse.ohio-state.edu>
>> http://mailman.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>> <https://urldefense.proofpoint.com/v2/url?u=http-3A__mailman.cse.ohio-2Dstate.edu_mailman_listinfo_mvapich-2Ddiscuss&d=DwMFAg&c=WO-RGvefibhHBZq3fL85hQ&r=BIRf9xYfkW5P4uCLBhoPJ_fEeU9p6k35bpCUaKSzSno&m=FIg0ynmwox7YjCBm985zVv_q0OcCaUnrxz3seMtHyEA&s=PaXdgHsfo1jKWFfXSPI30K9NRk3gqu7Oxww5UeGKIaw&e=>
--
.............................
Research Associate
Atmospheric Chemistry Modeling Group
School of Engineering and Applied Sciences
Harvard University
Web : http://people.seas.harvard.edu/~mlong/
Email : mlong at seas.harvard.edu
mslong at virginia.edu
-----------------------------
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20171006/de124437/attachment-0001.html>
More information about the mvapich-discuss
mailing list