[mvapich-discuss] Error in MPI_Neighbor_alltoallv

Phanisri Pradeep Pratapa ppratapa at gatech.edu
Mon Feb 29 23:40:02 EST 2016


Thanks, Hari.

Regards,

Pradeep

On Mon, Feb 29, 2016 at 11:19 PM, Hari Subramoni <subramoni.1 at osu.edu>
wrote:

> This issue has been resolved through discussions off the group. The fix
> will be available with the upcoming MVAPICH2 release.
>
> Thx,
> Hari.
> On Nov 19, 2015 3:08 PM, "Hari Subramoni" <subramoni.1 at osu.edu> wrote:
>
>> Hello,
>>
>> This is really an out of memory situation. We are working on a patch for
>> this. We will get back to you soon. Do you happen to have a reproducer for
>> the error? Could you also let us know your system configuration and the
>> version of MVAPICH2 you're using?
>>
>> Thx,
>> Hari.
>> On Nov 18, 2015 2:17 PM, "Phanisri Pradeep Pratapa" <ppratapa at gatech.edu>
>> wrote:
>>
>>> Hi,
>>>
>>> I am running a C++ code with MPI 3.0 through mvapich2/2.1.
>>>
>>> I use MPI_Neighbor_alltoallv in my code and it needs to be called in
>>> every iteration. I have created a periodic cartesian topology to enable
>>> local communication. I found that this function works correctly for a few
>>> iterations and then fails after that giving the following error:
>>>
>>> %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> *[cli_187]: aborting job:Fatal error in PMPI_Ineighbor_alltoallv:Other
>>> MPI error, error stack:PMPI_Ineighbor_alltoallv(229).......:
>>> MPI_Ineighbor_alltoallv(sendbuf=0x2aaac9fa5a20, sendcounts=0x2aaac81f1ab0,
>>> sdispls=0x2aaac81f05d0, sendtype=MPI_DOUBLE, recvbuf=0x2aaac9f96050,
>>> recvcounts=0x2aaac81f4470, rdispls=0x2aaac81f2f90, recvtype=MPI_DOUBLE,
>>> comm=comm=0x84000006, request=0x7fffffPMPI_Ineighbor_alltoallv(215).......:
>>> MPIR_Ineighbor_alltoallv_impl(112)..: MPIR_Ineighbor_alltoallv_default(78):
>>> MPID_Sched_recv(599)................: MPIDU_Sched_add_entry(425)..........:
>>> Out of memory*
>>> %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
>>>
>>> This happens only when each processor is communicating with all the
>>> processors (or more, since periodic) and the total number of processors is
>>> greater than or equal to 216 (4 nodes). The function works fine for all
>>> other cases I have tested. This happens for both blocking as well as
>>> non-blocking versions. Moreover I encounter this kind of behaviour probably
>>> about 8 out of 10 times I run the code (with the same inputs, commands,
>>> options etc.) and the other 2 times it actually works out successfully. I
>>> have debugged/run memory checks and found no memory leaks.
>>>
>>> There was a similar problem I found on this forum which someone else had
>>> experienced, but there seems to be no final response to it:
>>>
>>> http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/2014-June/005002.html
>>>
>>> Please let me know if somebody can help.
>>>
>>> Thank you,
>>>
>>> Regards,
>>>
>>> Pradeep
>>>
>>>
>>>
>>>
>>>
>>> _______________________________________________
>>> mvapich-discuss mailing list
>>> mvapich-discuss at cse.ohio-state.edu
>>> http://mailman.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>>>
>>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20160229/e72ec3ba/attachment-0001.html>


More information about the mvapich-discuss mailing list