[mvapich-discuss] Hang in MPI_Isend/MPI_Recv combination
Dorian Krause
doriankrause at web.de
Thu Aug 13 12:47:16 EDT 2009
Hi,
again these 96 processors ...
My application hangs in a communication step which looks like this:
---------
Group A:
for all neighbors {
MPI_Isend(...);
}
MPI_Waitall(...);
MPI_Barrier();
----
Group B:
while(#messages to receive > 0) {
MPI_Probe(MPI_ANY_SOURCE, &stat);
q = stat.MPI_SOURCE
/* in subfunction: */
MPI_Probe(q, &stat)
q = stat.MPI_COUNT;
MPI_Recv(q, ...);
}
MPI_Barrier();
----
for more 96 processes this application hangs. Since I can't debug on
this scale, I used gdb to get backtraces. It tourned out that 94
processes are waiting in the barrier, One processor is trying to receive
a message (stuck in MPI_Recv) and one other is waiting in
MPI_Waitall(...). This looks fine, however the ranks do not match:
On the PE with rank 83, I have
#3 0x00000000004349b9 in PMPI_Recv (buf=0x1bd96010, count=202,
datatype=-1946157051, source=40, tag=374, comm=-1006632954, status=0x1)
at recv.c:156
and on PE with rank *12* I have
#3 0x00000000004368f4 in PMPI_Waitall (count=8,
array_of_requests=0x197e6b10, array_of_statuses=0x1)
at waitall.c:191
It seems that rank 40 slipped throught the MPI_Waitall eventhough he was
not supposed to do so ...
Please find attached the output files. There are three processes which
seem to be not in the barrier (2 on compute-0-3 and 1 on compute-0-13
but the one with the short backtrace on compute-0-3 is also in the
barrier as I could confirm by hand).
Any hints what might cause this error?
I'm using the trunk version of mvapich2 (check-out yesterday) and the
cluster consists of 14 LS22 blades (opteron) with 4x DDR Infiniband. I'm
not quiet sure which ofed version it is (it is delivered with the rocks
distribution and they are typically not very verbose concerning version
numbers ...).
Thanks for your help,
Dorian
-------------- next part --------------
A non-text attachment was scrubbed...
Name: gdbout.tar.gz
Type: application/x-gzip
Size: 17140 bytes
Desc: not available
Url : http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20090813/417da3f3/gdbout.tar-0001.bin
More information about the mvapich-discuss
mailing list