[mvapich-discuss] Program stuck in MPI Framework Library

Dash, Ambika Ambika.Dash at kla-tencor.com
Wed Jun 27 14:05:03 EDT 2018


Hi,

We use MPI Framework in our software stack and on one of the nodes the MPI Send/Receive/Test calls are getting stuck as shown in the following back traces.
Could you help us on how to go about finding the root cause of this issue.

MPI_ISend back trace

#0  0x00007f10b4fa5773 in MPIDI_CH3I_MRAILI_Cq_poll_ib () from /usr/mpi/gcc/mvapich2-2.2a/lib/libmpi.so.12
#1  0x00007f10b4fa11cc in MPIDI_CH3I_MRAILI_Waiting_msg () from /usr/mpi/gcc/mvapich2-2.2a/lib/libmpi.so.12
#2  0x00007f10b4f77227 in MPIDI_CH3I_read_progress () from /usr/mpi/gcc/mvapich2-2.2a/lib/libmpi.so.12
#3  0x00007f10b4f76e3a in MPIDI_CH3I_Progress_test () from /usr/mpi/gcc/mvapich2-2.2a/lib/libmpi.so.12
#4  0x00007f10b4f69bb6 in MPID_Isend () from /usr/mpi/gcc/mvapich2-2.2a/lib/libmpi.so.12
#5  0x00007f10b4ef466a in PMPI_Isend () from /usr/mpi/gcc/mvapich2-2.2a/lib/libmpi.so.12
#6  0x0000000000476673 in sync_MPI_Isend (buf=0x7ef071b6e020, count=4719952, datatype=1275068673, dest=2, tag=0, comm=1140850688, request=0xba1728 <s_mpiTxRequestList+104>) at CommQueue.cpp:68
#7  0x000000000047876c in DataQueueTx::send (this=0x4753450) at CommQueue.cpp:968
#8  0x000000000047815d in CommQueueTx::send (this=0x46f92a8) at CommQueue.cpp:788
#9  0x00000000004f9980 in CommandManager::sendBuffer (this=0x434e740, dataBufferIndex=0) at CommandManager.cpp:865
#10 0x00000000004f99ba in CommandManager::sendBuffer (this=0x434e740, dataBuffer=0x7ef071b6e010) at CommandManager.cpp:873

MPI_Test

#0  0x00007f5e06e72773 in MPIDI_CH3I_MRAILI_Cq_poll_ib () from /usr/mpi/gcc/mvapich2-2.2a/lib/libmpi.so.12
#1  0x00007f5e06e6e1cc in MPIDI_CH3I_MRAILI_Waiting_msg () from /usr/mpi/gcc/mvapich2-2.2a/lib/libmpi.so.12
#2  0x00007f5e06e44227 in MPIDI_CH3I_read_progress () from /usr/mpi/gcc/mvapich2-2.2a/lib/libmpi.so.12
#3  0x00007f5e06e43e3a in MPIDI_CH3I_Progress_test () from /usr/mpi/gcc/mvapich2-2.2a/lib/libmpi.so.12
#4  0x00007f5e06dc9260 in MPIR_Test_impl () from /usr/mpi/gcc/mvapich2-2.2a/lib/libmpi.so.12
#5  0x00007f5e06dc95ab in PMPI_Test () from /usr/mpi/gcc/mvapich2-2.2a/lib/libmpi.so.12
#6  0x000000000047656f in sync_MPI_Test (mpiRequest=0x41d1dd8, flag=0x7f5e013ba3ec, mpi_status=0x7f5e013ba3f0) at CommQueue.cpp:49
#7  0x0000000000477836 in DataQueueRx::wait (this=0x41d1dd0) at CommQueue.cpp:528
#8  0x0000000000477259 in CommQueueRx::getBuffer (this=0x41eaba8) at CommQueue.cpp:365
#9  0x00000000004f906f in CommandManager::getNextCommand (this=0x3e3d720) at CommandManager.cpp:766

Thanks,
AP Dash


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20180627/fe0941c9/attachment.html>


More information about the mvapich-discuss mailing list