[mvapich-discuss] Re: "MPI_init" MPI jobs hang on startup.

Thu Oct 26 21:48:06 EDT 2006

Hi Lior, Chris, and the MVAPICH team,
In our linpack runs, we are losing messages in the Sendrecv loop of 
comm_exch_addr() on lines 192-201 of src/context/comm_rdma_init.c:

    for(i = 0; i < comm->np; i++) {
        /* Don't send to myself */
        if(i == comm->local_rank) continue;

        MPI_Sendrecv((void*)&send_pkt, sizeof(struct Coll_Addr_Exch),
                MPI_BYTE, comm->lrank_to_grank[i], ADDR_EXCHANGE_TAG,
                (void*)&(recv_pkt[i]),sizeof(struct Coll_Addr_Exch),
                MPI_BYTE, comm->lrank_to_grank[i], ADDR_EXCHANGE_TAG,
                MPI_COMM_WORLD, &(statarray[i]));
    }

Runs of 1024 processes or less get through MPI_Init().  However, runs of 
2048 tasks or more sometimes hang (==> race condition).  In the cases 
where we get hangs, typically MPI ranks 0 thru X get through MPI_Init, 
but tasks X thru N-1 get stuck.  The value of X varies from one run to 
the next.

Attaching with TotalView in one of these hangs, I could see that ranks 
0-12 made it through but 13-2047 were stuck.  In this case, I could also 
see that ranks 13-995 were stuck waiting for a message from 996, while 
996-2047 were all waiting for a message from rank 12.  So it seems that 
the messages from rank 12 to 996-2047 never made it through.  Although 
apparently rank 12 sent messages to everyone and received messages from 
everyone, since it made it past MPI_Init().  Given that we didn't get 
any apparent IB errors (like code=12), I'm wondering whether a message 
buffer may have been overwritten before it was processed.  Or maybe a 
message was delivered to the wrong buffer?

On a side note, it would seem that this startup loop would run much 
faster if nodes started communicating in a staggered manner.  It appears 
that everyone starts by sending to rank 0, and then a pipeline forms.  
Wouldn't it be better to have everyone start with their neighbor to the 
right and progress this way.  I think the current implementation will 
fire a flurry of messages to a single node, while the latter distributes 
the load better.  It may also work around the problem of the lost 
message, but we still need to figure out why messages are being lost.
-Adam

Lior Ofer wrote:

>Added Chris to the email list 
>He will contact you tomorrow 
>We will check if there is a hardcode barrier in the code for 1024 
>
>Lior
>
>______________________________________________________    
>Lior Ofer | 978.439.5416 (o)  | 339.221.1451 (m)
>Manager, US Customer Support center 
>Voltaire - The Grid Backbone
>www.voltaire.com
>lioro at voltaire.com
>No problem can withstand the assault of sustained thinking.  (Voltaire
>1778)
>
>
>-----Original Message-----
>From: Ira Weiny [mailto:weiny2 at llnl.gov] 
>Sent: Thursday, October 26, 2006 9:17 PM
>To: Lior Ofer
>Cc: tdhooge at llnl.gov; moody20 at llnl.gov; mhaskell at llnl.gov;
>mlleinin at hpcn.ca.sandia.gov
>Subject: Re: "MPI_init" MPI jobs hang on startup.
>
>Yes it seems that < 1024 is OK.
>
>Also, Adam has some more information which he will email out soon.
>
>Ira
>
>On Fri, 27 Oct 2006 03:06:08 +0200
>"Lior Ofer" <lioro at voltaire.com> wrote:
>
>  
>
>>Hi Ira 
>>Are you able to run It with lower than 1024? If yes what sis the max
>>Chris will contact you tomorrow morning
>>Lior 
>>______________________________________________________    
>>Lior Ofer | 978.439.5416 (o)  | 339.221.1451 (m)
>>Manager, US Customer Support center 
>>Voltaire - The Grid Backbone
>>www.voltaire.com
>>lioro at voltaire.com
>>No problem can withstand the assault of sustained thinking.  (Voltaire
>>1778)
>>
>>
>>-----Original Message-----
>>From: Ira Weiny [mailto:weiny2 at llnl.gov] 
>>Sent: Thursday, October 26, 2006 8:24 PM
>>To: support
>>Cc: Trent D'Hooge; Adam Moody; Mike Haskell; Matt Leininger
>>Subject: "MPI_init" MPI jobs hang on startup.
>>
>>Running Mellanox MPI version : 0.9.7_mlx2.2.0_1.0.4
>>OpenIB version               : ofed 1.1 rc7
>>
>>When running with more than about 2 tasks per node and > 1024 tasks
>>total we
>>are getting a high number of hangs.  This happens primarily with the
>>linpack
>>benchmark but has happened with other user codes.
>>
>>We have been able to get stack traces from many of the processes
>>    
>>
>running
>  
>
>>on the
>>various nodes.  I have included this file.  One of the similarities we
>>have
>>seen is that many of the tasks are in the "smpi_net_lookup" function.
>>Furthermore some of the tasks seem to get past MPI_init but the others
>>do not.
>>In this situation it seems mpi rank number 0-X get past MPI_init, X+1
>>    
>>
>-
>  
>
>>n-1 do
>>not.
>>
>>As far as our records show we have _not_ ever been able to run linpack
>>at full
>>scale.
>>
>>We are hopping to be able to run > 4096 tasks soon.  What can we do?
>>
>>Thanks,
>>Ira Weiny
>>weiny2 at llnl.gov
>>
>>    
>>
>
>
>  
>