[mvapich-discuss] Hang in MPI_Isend/MPI_Recv combination
Dorian Krause
doriankrause at web.de
Tue Sep 1 14:41:40 EDT 2009
Hi,
thanks. Unfortunately, we have a downtime due to server-room
maintenance. I will test asap ...
Dorian
Dhabaleswar Panda wrote:
> Hi,
>
> We have made MVAPICH2 1.4RC2 release today. We have run your application
> with this version and it seems to be working fine. Can you double-check
> your application with this version.
>
> Thanks,
>
> DK
>
> On Thu, 13 Aug 2009, Dorian Krause wrote:
>
>
>> Hi,
>>
>> again these 96 processors ...
>>
>> My application hangs in a communication step which looks like this:
>>
>> ---------
>> Group A:
>>
>> for all neighbors {
>> MPI_Isend(...);
>> }
>> MPI_Waitall(...);
>>
>> MPI_Barrier();
>> ----
>> Group B:
>>
>> while(#messages to receive > 0) {
>> MPI_Probe(MPI_ANY_SOURCE, &stat);
>> q = stat.MPI_SOURCE
>> /* in subfunction: */
>> MPI_Probe(q, &stat)
>> q = stat.MPI_COUNT;
>> MPI_Recv(q, ...);
>> }
>> MPI_Barrier();
>> ----
>>
>> for more 96 processes this application hangs. Since I can't debug on
>> this scale, I used gdb to get backtraces. It tourned out that 94
>> processes are waiting in the barrier, One processor is trying to receive
>> a message (stuck in MPI_Recv) and one other is waiting in
>> MPI_Waitall(...). This looks fine, however the ranks do not match:
>>
>> On the PE with rank 83, I have
>>
>> #3 0x00000000004349b9 in PMPI_Recv (buf=0x1bd96010, count=202,
>> datatype=-1946157051, source=40, tag=374, comm=-1006632954, status=0x1)
>> at recv.c:156
>>
>> and on PE with rank *12* I have
>>
>> #3 0x00000000004368f4 in PMPI_Waitall (count=8,
>> array_of_requests=0x197e6b10, array_of_statuses=0x1)
>> at waitall.c:191
>>
>> It seems that rank 40 slipped throught the MPI_Waitall eventhough he was
>> not supposed to do so ...
>>
>> Please find attached the output files. There are three processes which
>> seem to be not in the barrier (2 on compute-0-3 and 1 on compute-0-13
>> but the one with the short backtrace on compute-0-3 is also in the
>> barrier as I could confirm by hand).
>>
>> Any hints what might cause this error?
>>
>> I'm using the trunk version of mvapich2 (check-out yesterday) and the
>> cluster consists of 14 LS22 blades (opteron) with 4x DDR Infiniband. I'm
>> not quiet sure which ofed version it is (it is delivered with the rocks
>> distribution and they are typically not very verbose concerning version
>> numbers ...).
>>
>> Thanks for your help,
>> Dorian
>>
>>
>>
>>
>>
>>
>>
>>
>>
>
>
>
More information about the mvapich-discuss
mailing list