[mvapich-discuss] Hang in MPI_Isend/MPI_Recv combination
Dhabaleswar Panda
panda at cse.ohio-state.edu
Mon Aug 31 23:10:22 EDT 2009
Hi,
We have made MVAPICH2 1.4RC2 release today. We have run your application
with this version and it seems to be working fine. Can you double-check
your application with this version.
Thanks,
DK
On Thu, 13 Aug 2009, Dorian Krause wrote:
> Hi,
>
> again these 96 processors ...
>
> My application hangs in a communication step which looks like this:
>
> ---------
> Group A:
>
> for all neighbors {
> MPI_Isend(...);
> }
> MPI_Waitall(...);
>
> MPI_Barrier();
> ----
> Group B:
>
> while(#messages to receive > 0) {
> MPI_Probe(MPI_ANY_SOURCE, &stat);
> q = stat.MPI_SOURCE
> /* in subfunction: */
> MPI_Probe(q, &stat)
> q = stat.MPI_COUNT;
> MPI_Recv(q, ...);
> }
> MPI_Barrier();
> ----
>
> for more 96 processes this application hangs. Since I can't debug on
> this scale, I used gdb to get backtraces. It tourned out that 94
> processes are waiting in the barrier, One processor is trying to receive
> a message (stuck in MPI_Recv) and one other is waiting in
> MPI_Waitall(...). This looks fine, however the ranks do not match:
>
> On the PE with rank 83, I have
>
> #3 0x00000000004349b9 in PMPI_Recv (buf=0x1bd96010, count=202,
> datatype=-1946157051, source=40, tag=374, comm=-1006632954, status=0x1)
> at recv.c:156
>
> and on PE with rank *12* I have
>
> #3 0x00000000004368f4 in PMPI_Waitall (count=8,
> array_of_requests=0x197e6b10, array_of_statuses=0x1)
> at waitall.c:191
>
> It seems that rank 40 slipped throught the MPI_Waitall eventhough he was
> not supposed to do so ...
>
> Please find attached the output files. There are three processes which
> seem to be not in the barrier (2 on compute-0-3 and 1 on compute-0-13
> but the one with the short backtrace on compute-0-3 is also in the
> barrier as I could confirm by hand).
>
> Any hints what might cause this error?
>
> I'm using the trunk version of mvapich2 (check-out yesterday) and the
> cluster consists of 14 LS22 blades (opteron) with 4x DDR Infiniband. I'm
> not quiet sure which ofed version it is (it is delivered with the rocks
> distribution and they are typically not very verbose concerning version
> numbers ...).
>
> Thanks for your help,
> Dorian
>
>
>
>
>
>
>
>
More information about the mvapich-discuss
mailing list