[mvapich-discuss] Fortran MPI_Wait() request error

Michael S. Long mlong at seas.harvard.edu
Fri Oct 6 19:08:58 EDT 2017


Hi DK,

I hope to have a small reproducer working soon. I'll post it to github 
when/if it works.
As some point I recall testing with another C compiler (likely GCC) but 
it's been so many ways I've hit this problem I don't recall exactly.
Over the weekend I'll try with GCC & another version of IFORT.

Thanks. Have a good weekend.

ML

On 10/06/2017 06:37 PM, Panda, Dhabaleswar wrote:
> Hi, Michael,
>
> Thanks for your report here. Sorry to know that you are facing this 
> error. Are you seeing this error with only ICC 15.0.0 or any other 
> version ICC. Do you see this issue with gcc? Is it possible to get a 
> small reproducer? This will help us to debug this issue quicker.
>
> Thanks,
>
> DK
>
> Sent from my iPhone
>
> On Oct 6, 2017, at 6:14 PM, Michael S. Long <mlong at seas.harvard.edu 
> <mailto:mlong at seas.harvard.edu>> wrote:
>
>> Dear MVAPICH-Discuss,
>>
>> We are having a problem associated with MPI_IRecv & MPI_Wait in 
>> Fortran90.
>>
>> Version 2.2b (2.3b also tested with out the same explicit result but 
>> a hang at the same point)
>> Compiler: IFORT & ICC 15.0.0
>>
>> In a loop over one dimension in a 3D array across which data are 
>> being broadcast, MPI_Wait() for several of the receive requests dies 
>> with the following error:
>>
>>> Fatal error in PMPI_Wait: Other MPI error, error stack:
>>> PMPI_Wait(182)..................: 11MPI_Wait(request=0x23f6fea0, 
>>> status=0x1) failed
>>> MPIR_Wait_impl(71)..............:
>>> MPIDI_CH3I_Progress(393)........:
>>> pkt_CTS_handler(321)............:
>>> MPID_nem_lmt_shm_start_send(273):
>>> MPID_nem_delete_shm_region(926).:
>>> MPIU_SHMW_Seg_detach(707).......: unable to remove shared memory - 
>>> unlink No such file or directory
>>
>> What we've been able to determine is that at the call to MPI_IRecv(), 
>> the associated MPI_Request is /not/ being allocated (it still returns 
>> a successful return code). Specifically, the following things happen 
>> with various tests:
>>
>> 1) MPI_Request_Get_Status() will usually segfault at any point 
>> between the call to MPI_IRecv and MPI_Wait
>> 2) In the occasional chance that MPI_Request_Get_Status() doesn't 
>> segfault, the resulting value of FLAG will be False and
>> 3) Querying the count values and buffer sizes for the associated 
>> request gives 0 for both. These requests then fail at MPI_Wait().
>>
>> All request handles as seen in Fortran are valid values. i.e. there's 
>> no NaN or anything like that. This may be clear in the error msg 
>> above since the traceback is able to give a hex value for the handle 
>> of the failing request within the C portion.
>> The program will proceed with SGI.
>>
>> Any help would be greatly appreciated. It is recognized that some 
>> info might be missing, in which case please let me know.
>>
>> Sincerely,
>> Michael Long
>> _______________________________________________
>> mvapich-discuss mailing list
>> mvapich-discuss at cse.ohio-state.edu 
>> <mailto:mvapich-discuss at cse.ohio-state.edu>
>> http://mailman.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss 
>> <https://urldefense.proofpoint.com/v2/url?u=http-3A__mailman.cse.ohio-2Dstate.edu_mailman_listinfo_mvapich-2Ddiscuss&d=DwMFAg&c=WO-RGvefibhHBZq3fL85hQ&r=BIRf9xYfkW5P4uCLBhoPJ_fEeU9p6k35bpCUaKSzSno&m=FIg0ynmwox7YjCBm985zVv_q0OcCaUnrxz3seMtHyEA&s=PaXdgHsfo1jKWFfXSPI30K9NRk3gqu7Oxww5UeGKIaw&e=>

-- 
.............................
Research Associate
Atmospheric Chemistry Modeling Group
School of Engineering and Applied Sciences
Harvard University

Web   : http://people.seas.harvard.edu/~mlong/
Email : mlong at seas.harvard.edu
         mslong at virginia.edu
-----------------------------

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20171006/de124437/attachment-0001.html>


More information about the mvapich-discuss mailing list