[mvapich-discuss] Hang in MPI_Isend/MPI_Recv combination

Krishna Chaitanya Kandalla kandalla at cse.ohio-state.edu
Thu Aug 13 14:24:51 EDT 2009


Dorian,
           I have taken a quick look at the  set of back-traces. Is it 
possible to give us a copy of the application that you are running?
           I noticed that the application is possibly changing the 
topology before it gets inside the MPI layer and hangs. I am also 
guessing that the code snippet that you provided is related to what is 
going on inside  hgc::comm::Topology::barrier.  But, we dont quite know 
how the set "all neighbors" has been setup. If we can run the 
application on our systems here, it would be easier to figure out what 
is going on.

Thanks,
Krishna

Dorian Krause wrote:
> Hi,
>
> again these 96 processors ...
>
> My application hangs in a communication step which looks like this:
>
> ---------
> Group A:
>
>    for all neighbors {
>       MPI_Isend(...);
>    }
>   MPI_Waitall(...);
>
>    MPI_Barrier();
> ----
> Group B:
>      while(#messages to receive > 0) {
>       MPI_Probe(MPI_ANY_SOURCE, &stat);
>       q = stat.MPI_SOURCE
>       /* in subfunction: */
>       MPI_Probe(q, &stat)
>       q = stat.MPI_COUNT;
>       MPI_Recv(q, ...);
>    }
>    MPI_Barrier();
> ----
>
> for more 96 processes this application hangs. Since I can't debug on 
> this scale, I used gdb to get backtraces. It tourned out that 94 
> processes are waiting in the barrier, One processor is trying to 
> receive a message (stuck in MPI_Recv) and one other is waiting in 
> MPI_Waitall(...). This looks fine, however the ranks do not match:
>
> On the PE with rank 83, I have
>
> #3  0x00000000004349b9 in PMPI_Recv (buf=0x1bd96010, count=202,
>    datatype=-1946157051, source=40, tag=374, comm=-1006632954, 
> status=0x1)
>    at recv.c:156
>
> and on PE with rank *12* I have
>
> #3  0x00000000004368f4 in PMPI_Waitall (count=8,
>    array_of_requests=0x197e6b10, array_of_statuses=0x1)
>    at waitall.c:191
>
> It seems that rank 40 slipped throught the MPI_Waitall eventhough he 
> was not supposed to do so ...
>
> Please find attached the output files. There are three processes which 
> seem to be not in the barrier (2 on compute-0-3 and 1 on compute-0-13 
> but the one with the short backtrace on compute-0-3 is also in the 
> barrier as I could confirm by hand).
>
> Any hints what might cause this error?
>
> I'm using the trunk version of mvapich2 (check-out yesterday) and the 
> cluster consists of 14 LS22 blades (opteron) with 4x DDR Infiniband. 
> I'm not quiet sure which ofed version it is (it is delivered with the 
> rocks distribution and they are typically not very verbose concerning 
> version numbers ...).
>
> Thanks for your help,
> Dorian
>
>
>
>
>
>
>
> ------------------------------------------------------------------------
>
> _______________________________________________
> mvapich-discuss mailing list
> mvapich-discuss at cse.ohio-state.edu
> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>   


More information about the mvapich-discuss mailing list