[mvapich-discuss] Re: "MPI_init" MPI jobs hang on startup.

Fri Oct 27 09:53:43 EDT 2006

Adam, 

Thanks for letting us know that you are experiencing problems in runs
of 2K tasks or more on your new cluster and do not see this problem up
to 1K tasks. We have started taking a look at this problem and will
get back to you soon.

Thanks, 

DK

> Hi Lior, Chris, and the MVAPICH team,
> In our linpack runs, we are losing messages in the Sendrecv loop of 
> comm_exch_addr() on lines 192-201 of src/context/comm_rdma_init.c:
> 
>     for(i = 0; i < comm->np; i++) {
>         /* Don't send to myself */
>         if(i == comm->local_rank) continue;
> 
>         MPI_Sendrecv((void*)&send_pkt, sizeof(struct Coll_Addr_Exch),
>                 MPI_BYTE, comm->lrank_to_grank[i], ADDR_EXCHANGE_TAG,
>                 (void*)&(recv_pkt[i]),sizeof(struct Coll_Addr_Exch),
>                 MPI_BYTE, comm->lrank_to_grank[i], ADDR_EXCHANGE_TAG,
>                 MPI_COMM_WORLD, &(statarray[i]));
>     }
> 
> Runs of 1024 processes or less get through MPI_Init().  However, runs of 
> 2048 tasks or more sometimes hang (==> race condition).  In the cases 
> where we get hangs, typically MPI ranks 0 thru X get through MPI_Init, 
> but tasks X thru N-1 get stuck.  The value of X varies from one run to 
> the next.
> 
> Attaching with TotalView in one of these hangs, I could see that ranks 
> 0-12 made it through but 13-2047 were stuck.  In this case, I could also 
> see that ranks 13-995 were stuck waiting for a message from 996, while 
> 996-2047 were all waiting for a message from rank 12.  So it seems that 
> the messages from rank 12 to 996-2047 never made it through.  Although 
> apparently rank 12 sent messages to everyone and received messages from 
> everyone, since it made it past MPI_Init().  Given that we didn't get 
> any apparent IB errors (like code=12), I'm wondering whether a message 
> buffer may have been overwritten before it was processed.  Or maybe a 
> message was delivered to the wrong buffer?
> 
> On a side note, it would seem that this startup loop would run much 
> faster if nodes started communicating in a staggered manner.  It appears 
> that everyone starts by sending to rank 0, and then a pipeline forms.  
> Wouldn't it be better to have everyone start with their neighbor to the 
> right and progress this way.  I think the current implementation will 
> fire a flurry of messages to a single node, while the latter distributes 
> the load better.  It may also work around the problem of the lost 
> message, but we still need to figure out why messages are being lost.
> -Adam
> 
> 
> Lior Ofer wrote:
> 
> >Added Chris to the email list 
> >He will contact you tomorrow 
> >We will check if there is a hardcode barrier in the code for 1024 
> >
> >Lior
> >
> >______________________________________________________    
> >Lior Ofer | 978.439.5416 (o)  | 339.221.1451 (m)
> >Manager, US Customer Support center 
> >Voltaire - The Grid Backbone
> >www.voltaire.com
> >lioro at voltaire.com
> >No problem can withstand the assault of sustained thinking.  (Voltaire
> >1778)
> >
> >
> >-----Original Message-----
> >From: Ira Weiny [mailto:weiny2 at llnl.gov] 
> >Sent: Thursday, October 26, 2006 9:17 PM
> >To: Lior Ofer
> >Cc: tdhooge at llnl.gov; moody20 at llnl.gov; mhaskell at llnl.gov;
> >mlleinin at hpcn.ca.sandia.gov
> >Subject: Re: "MPI_init" MPI jobs hang on startup.
> >
> >Yes it seems that < 1024 is OK.
> >
> >Also, Adam has some more information which he will email out soon.
> >
> >Ira
> >
> >On Fri, 27 Oct 2006 03:06:08 +0200
> >"Lior Ofer" <lioro at voltaire.com> wrote:
> >
> >  
> >
> >>Hi Ira 
> >>Are you able to run It with lower than 1024? If yes what sis the max
> >>Chris will contact you tomorrow morning
> >>Lior 
> >>______________________________________________________    
> >>Lior Ofer | 978.439.5416 (o)  | 339.221.1451 (m)
> >>Manager, US Customer Support center 
> >>Voltaire - The Grid Backbone
> >>www.voltaire.com
> >>lioro at voltaire.com
> >>No problem can withstand the assault of sustained thinking.  (Voltaire
> >>1778)
> >>
> >>
> >>-----Original Message-----
> >>From: Ira Weiny [mailto:weiny2 at llnl.gov] 
> >>Sent: Thursday, October 26, 2006 8:24 PM
> >>To: support
> >>Cc: Trent D'Hooge; Adam Moody; Mike Haskell; Matt Leininger
> >>Subject: "MPI_init" MPI jobs hang on startup.
> >>
> >>Running Mellanox MPI version : 0.9.7_mlx2.2.0_1.0.4
> >>OpenIB version               : ofed 1.1 rc7
> >>
> >>When running with more than about 2 tasks per node and > 1024 tasks
> >>total we
> >>are getting a high number of hangs.  This happens primarily with the
> >>linpack
> >>benchmark but has happened with other user codes.
> >>
> >>We have been able to get stack traces from many of the processes
> >>    
> >>
> >running
> >  
> >
> >>on the
> >>various nodes.  I have included this file.  One of the similarities we
> >>have
> >>seen is that many of the tasks are in the "smpi_net_lookup" function.
> >>Furthermore some of the tasks seem to get past MPI_init but the others
> >>do not.
> >>In this situation it seems mpi rank number 0-X get past MPI_init, X+1
> >>    
> >>
> >-
> >  
> >
> >>n-1 do
> >>not.
> >>
> >>As far as our records show we have _not_ ever been able to run linpack
> >>at full
> >>scale.
> >>
> >>We are hopping to be able to run > 4096 tasks soon.  What can we do?
> >>
> >>Thanks,
> >>Ira Weiny
> >>weiny2 at llnl.gov
> >>
> >>    
> >>
> >
> >
> >  
> >
> _______________________________________________
> mvapich-discuss mailing list
> mvapich-discuss at cse.ohio-state.edu
> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>