[mvapich-discuss] Hang in MPI_Isend/MPI_Recv combination

Dorian Krause doriankrause at web.de
Thu Aug 13 12:47:16 EDT 2009


Hi,

again these 96 processors ...

My application hangs in a communication step which looks like this:

---------
Group A:

    for all neighbors {
       MPI_Isend(...);
    }
   MPI_Waitall(...);

    MPI_Barrier();
----
Group B:
   
    while(#messages to receive > 0) {
       MPI_Probe(MPI_ANY_SOURCE, &stat);
       q = stat.MPI_SOURCE
       /* in subfunction: */
       MPI_Probe(q, &stat)
       q = stat.MPI_COUNT;
       MPI_Recv(q, ...);
    }
    MPI_Barrier();
----

for more 96 processes this application hangs. Since I can't debug on 
this scale, I used gdb to get backtraces. It tourned out that 94 
processes are waiting in the barrier, One processor is trying to receive 
a message (stuck in MPI_Recv) and one other is waiting in 
MPI_Waitall(...). This looks fine, however the ranks do not match:

On the PE with rank 83, I have

#3  0x00000000004349b9 in PMPI_Recv (buf=0x1bd96010, count=202,
    datatype=-1946157051, source=40, tag=374, comm=-1006632954, status=0x1)
    at recv.c:156

and on PE with rank *12* I have

#3  0x00000000004368f4 in PMPI_Waitall (count=8,
    array_of_requests=0x197e6b10, array_of_statuses=0x1)
    at waitall.c:191

It seems that rank 40 slipped throught the MPI_Waitall eventhough he was 
not supposed to do so ...

Please find attached the output files. There are three processes which 
seem to be not in the barrier (2 on compute-0-3 and 1 on compute-0-13 
but the one with the short backtrace on compute-0-3 is also in the 
barrier as I could confirm by hand).

Any hints what might cause this error?

I'm using the trunk version of mvapich2 (check-out yesterday) and the 
cluster consists of 14 LS22 blades (opteron) with 4x DDR Infiniband. I'm 
not quiet sure which ofed version it is (it is delivered with the rocks 
distribution and they are typically not very verbose concerning version 
numbers ...).

Thanks for your help,
Dorian







-------------- next part --------------
A non-text attachment was scrubbed...
Name: gdbout.tar.gz
Type: application/x-gzip
Size: 17140 bytes
Desc: not available
Url : http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20090813/417da3f3/gdbout.tar-0001.bin


More information about the mvapich-discuss mailing list