[mvapich-discuss] Hang in MPI_Isend/MPI_Recv combination

Dorian Krause doriankrause at web.de
Tue Sep 1 14:41:40 EDT 2009


Hi,

thanks. Unfortunately, we have a downtime due to server-room 
maintenance. I will test asap ...

Dorian

Dhabaleswar Panda wrote:
> Hi,
>
> We have made MVAPICH2 1.4RC2 release today. We have run your application
> with this version and it seems to be working fine. Can you double-check
> your application with this version.
>
> Thanks,
>
> DK
>
> On Thu, 13 Aug 2009, Dorian Krause wrote:
>
>   
>> Hi,
>>
>> again these 96 processors ...
>>
>> My application hangs in a communication step which looks like this:
>>
>> ---------
>> Group A:
>>
>>     for all neighbors {
>>        MPI_Isend(...);
>>     }
>>    MPI_Waitall(...);
>>
>>     MPI_Barrier();
>> ----
>> Group B:
>>
>>     while(#messages to receive > 0) {
>>        MPI_Probe(MPI_ANY_SOURCE, &stat);
>>        q = stat.MPI_SOURCE
>>        /* in subfunction: */
>>        MPI_Probe(q, &stat)
>>        q = stat.MPI_COUNT;
>>        MPI_Recv(q, ...);
>>     }
>>     MPI_Barrier();
>> ----
>>
>> for more 96 processes this application hangs. Since I can't debug on
>> this scale, I used gdb to get backtraces. It tourned out that 94
>> processes are waiting in the barrier, One processor is trying to receive
>> a message (stuck in MPI_Recv) and one other is waiting in
>> MPI_Waitall(...). This looks fine, however the ranks do not match:
>>
>> On the PE with rank 83, I have
>>
>> #3  0x00000000004349b9 in PMPI_Recv (buf=0x1bd96010, count=202,
>>     datatype=-1946157051, source=40, tag=374, comm=-1006632954, status=0x1)
>>     at recv.c:156
>>
>> and on PE with rank *12* I have
>>
>> #3  0x00000000004368f4 in PMPI_Waitall (count=8,
>>     array_of_requests=0x197e6b10, array_of_statuses=0x1)
>>     at waitall.c:191
>>
>> It seems that rank 40 slipped throught the MPI_Waitall eventhough he was
>> not supposed to do so ...
>>
>> Please find attached the output files. There are three processes which
>> seem to be not in the barrier (2 on compute-0-3 and 1 on compute-0-13
>> but the one with the short backtrace on compute-0-3 is also in the
>> barrier as I could confirm by hand).
>>
>> Any hints what might cause this error?
>>
>> I'm using the trunk version of mvapich2 (check-out yesterday) and the
>> cluster consists of 14 LS22 blades (opteron) with 4x DDR Infiniband. I'm
>> not quiet sure which ofed version it is (it is delivered with the rocks
>> distribution and they are typically not very verbose concerning version
>> numbers ...).
>>
>> Thanks for your help,
>> Dorian
>>
>>
>>
>>
>>
>>
>>
>>
>>     
>
>
>   



More information about the mvapich-discuss mailing list