[mvapich-discuss] Hang in MPI_Isend/MPI_Recv combination

Dhabaleswar Panda panda at cse.ohio-state.edu
Mon Aug 31 23:10:22 EDT 2009


Hi,

We have made MVAPICH2 1.4RC2 release today. We have run your application
with this version and it seems to be working fine. Can you double-check
your application with this version.

Thanks,

DK

On Thu, 13 Aug 2009, Dorian Krause wrote:

> Hi,
>
> again these 96 processors ...
>
> My application hangs in a communication step which looks like this:
>
> ---------
> Group A:
>
>     for all neighbors {
>        MPI_Isend(...);
>     }
>    MPI_Waitall(...);
>
>     MPI_Barrier();
> ----
> Group B:
>
>     while(#messages to receive > 0) {
>        MPI_Probe(MPI_ANY_SOURCE, &stat);
>        q = stat.MPI_SOURCE
>        /* in subfunction: */
>        MPI_Probe(q, &stat)
>        q = stat.MPI_COUNT;
>        MPI_Recv(q, ...);
>     }
>     MPI_Barrier();
> ----
>
> for more 96 processes this application hangs. Since I can't debug on
> this scale, I used gdb to get backtraces. It tourned out that 94
> processes are waiting in the barrier, One processor is trying to receive
> a message (stuck in MPI_Recv) and one other is waiting in
> MPI_Waitall(...). This looks fine, however the ranks do not match:
>
> On the PE with rank 83, I have
>
> #3  0x00000000004349b9 in PMPI_Recv (buf=0x1bd96010, count=202,
>     datatype=-1946157051, source=40, tag=374, comm=-1006632954, status=0x1)
>     at recv.c:156
>
> and on PE with rank *12* I have
>
> #3  0x00000000004368f4 in PMPI_Waitall (count=8,
>     array_of_requests=0x197e6b10, array_of_statuses=0x1)
>     at waitall.c:191
>
> It seems that rank 40 slipped throught the MPI_Waitall eventhough he was
> not supposed to do so ...
>
> Please find attached the output files. There are three processes which
> seem to be not in the barrier (2 on compute-0-3 and 1 on compute-0-13
> but the one with the short backtrace on compute-0-3 is also in the
> barrier as I could confirm by hand).
>
> Any hints what might cause this error?
>
> I'm using the trunk version of mvapich2 (check-out yesterday) and the
> cluster consists of 14 LS22 blades (opteron) with 4x DDR Infiniband. I'm
> not quiet sure which ofed version it is (it is delivered with the rocks
> distribution and they are typically not very verbose concerning version
> numbers ...).
>
> Thanks for your help,
> Dorian
>
>
>
>
>
>
>
>



More information about the mvapich-discuss mailing list