[mvapich-discuss] Fortran MPI_Wait() request error

Subramoni, Hari subramoni.1 at osu.edu
Fri Oct 6 19:12:35 EDT 2017


Hello,

It look like you're using the nemesis channel. Please note that the Nemesis interface has been deprecated in the latest MVAPICH2 release. We recommend using the OFA-IB-CH3 interface for best performance and scalability. The following secion of the MVAPICH2 userguide has more information on how to build MVAPICH2 for the OFA-IB-CH3 interface. If there isn't a particular reason for using the nemesis channel, could you please use the OFA-IB-CH3 channel and let us know if you face the issue?

http://mvapich.cse.ohio-state.edu/static/media/mvapich/mvapich2-2.3b-userguide.html#x1-120004.4

Thx,
Hari.

From: mvapich-discuss-bounces at cse.ohio-state.edu On Behalf Of Panda, Dhabaleswar
Sent: Friday, October 6, 2017 6:38 PM
To: Michael S. Long <mlong at seas.harvard.edu>
Cc: mvapich-discuss at cse.ohio-state.edu <mvapich-discuss at mailman.cse.ohio-state.edu>
Subject: Re: [mvapich-discuss] Fortran MPI_Wait() request error

Hi, Michael,

Thanks for your report here. Sorry to know that you are facing this error. Are you seeing this error with only ICC 15.0.0 or any other version ICC. Do you see this issue with gcc? Is it possible to get a small reproducer? This will help us to debug this issue quicker.

Thanks,

DK

Sent from my iPhone

On Oct 6, 2017, at 6:14 PM, Michael S. Long <mlong at seas.harvard.edu<mailto:mlong at seas.harvard.edu>> wrote:
Dear MVAPICH-Discuss,

We are having a problem associated with MPI_IRecv & MPI_Wait in Fortran90.

Version 2.2b (2.3b also tested with out the same explicit result but a hang at the same point)
Compiler: IFORT & ICC 15.0.0

In a loop over one dimension in a 3D array across which data are being broadcast, MPI_Wait() for several of the receive requests dies with the following error:


Fatal error in PMPI_Wait: Other MPI error, error stack:
PMPI_Wait(182)..................: 11MPI_Wait(request=0x23f6fea0, status=0x1) failed
MPIR_Wait_impl(71)..............:
MPIDI_CH3I_Progress(393)........:
pkt_CTS_handler(321)............:
MPID_nem_lmt_shm_start_send(273):
MPID_nem_delete_shm_region(926).:
MPIU_SHMW_Seg_detach(707).......: unable to remove shared memory - unlink No such file or directory

What we've been able to determine is that at the call to MPI_IRecv(), the associated MPI_Request is not being allocated (it still returns a successful return code). Specifically, the following things happen with various tests:

1) MPI_Request_Get_Status() will usually segfault at any point between the call to MPI_IRecv and MPI_Wait
2) In the occasional chance that MPI_Request_Get_Status() doesn't segfault, the resulting value of FLAG will be False and
3) Querying the count values and buffer sizes for the associated request gives 0 for both. These requests then fail at MPI_Wait().

All request handles as seen in Fortran are valid values. i.e. there's no NaN or anything like that. This may be clear in the error msg above since the traceback is able to give a hex value for the handle of the failing request within the C portion.
The program will proceed with SGI.

Any help would be greatly appreciated. It is recognized that some info might be missing, in which case please let me know.

Sincerely,
Michael Long
_______________________________________________
mvapich-discuss mailing list
mvapich-discuss at cse.ohio-state.edu<mailto:mvapich-discuss at cse.ohio-state.edu>
http://mailman.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
-------------- next part --------------
A non-text attachment was scrubbed...
Name: winmail.dat
Type: application/ms-tnef
Size: 10579 bytes
Desc: not available
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20171006/56ffecda/attachment.bin>


More information about the mvapich-discuss mailing list