[mvapich-discuss] Fortran MPI_Wait() request error

Panda, Dhabaleswar panda at cse.ohio-state.edu
Fri Oct 6 18:37:50 EDT 2017


Hi, Michael,

Thanks for your report here. Sorry to know that you are facing this error. Are you seeing this error with only ICC 15.0.0 or any other version ICC. Do you see this issue with gcc? Is it possible to get a small reproducer? This will help us to debug this issue quicker.

Thanks,

DK

Sent from my iPhone

On Oct 6, 2017, at 6:14 PM, Michael S. Long <mlong at seas.harvard.edu<mailto:mlong at seas.harvard.edu>> wrote:

Dear MVAPICH-Discuss,

We are having a problem associated with MPI_IRecv & MPI_Wait in Fortran90.

Version 2.2b (2.3b also tested with out the same explicit result but a hang at the same point)
Compiler: IFORT & ICC 15.0.0

In a loop over one dimension in a 3D array across which data are being broadcast, MPI_Wait() for several of the receive requests dies with the following error:

Fatal error in PMPI_Wait: Other MPI error, error stack:
PMPI_Wait(182)..................: 11MPI_Wait(request=0x23f6fea0, status=0x1) failed
MPIR_Wait_impl(71)..............:
MPIDI_CH3I_Progress(393)........:
pkt_CTS_handler(321)............:
MPID_nem_lmt_shm_start_send(273):
MPID_nem_delete_shm_region(926).:
MPIU_SHMW_Seg_detach(707).......: unable to remove shared memory - unlink No such file or directory

What we've been able to determine is that at the call to MPI_IRecv(), the associated MPI_Request is not being allocated (it still returns a successful return code). Specifically, the following things happen with various tests:

1) MPI_Request_Get_Status() will usually segfault at any point between the call to MPI_IRecv and MPI_Wait
2) In the occasional chance that MPI_Request_Get_Status() doesn't segfault, the resulting value of FLAG will be False and
3) Querying the count values and buffer sizes for the associated request gives 0 for both. These requests then fail at MPI_Wait().

All request handles as seen in Fortran are valid values. i.e. there's no NaN or anything like that. This may be clear in the error msg above since the traceback is able to give a hex value for the handle of the failing request within the C portion.
The program will proceed with SGI.

Any help would be greatly appreciated. It is recognized that some info might be missing, in which case please let me know.

Sincerely,
Michael Long
_______________________________________________
mvapich-discuss mailing list
mvapich-discuss at cse.ohio-state.edu<mailto:mvapich-discuss at cse.ohio-state.edu>
http://mailman.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
-------------- next part --------------
A non-text attachment was scrubbed...
Name: winmail.dat
Type: application/ms-tnef
Size: 6119 bytes
Desc: not available
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20171006/ff6ba10e/attachment.bin>


More information about the mvapich-discuss mailing list