[mvapich-discuss] Re: "MPI_init" MPI jobs hang on startup.
Adam Moody
moody20 at llnl.gov
Thu Oct 26 21:48:06 EDT 2006
Hi Lior, Chris, and the MVAPICH team,
In our linpack runs, we are losing messages in the Sendrecv loop of
comm_exch_addr() on lines 192-201 of src/context/comm_rdma_init.c:
for(i = 0; i < comm->np; i++) {
/* Don't send to myself */
if(i == comm->local_rank) continue;
MPI_Sendrecv((void*)&send_pkt, sizeof(struct Coll_Addr_Exch),
MPI_BYTE, comm->lrank_to_grank[i], ADDR_EXCHANGE_TAG,
(void*)&(recv_pkt[i]),sizeof(struct Coll_Addr_Exch),
MPI_BYTE, comm->lrank_to_grank[i], ADDR_EXCHANGE_TAG,
MPI_COMM_WORLD, &(statarray[i]));
}
Runs of 1024 processes or less get through MPI_Init(). However, runs of
2048 tasks or more sometimes hang (==> race condition). In the cases
where we get hangs, typically MPI ranks 0 thru X get through MPI_Init,
but tasks X thru N-1 get stuck. The value of X varies from one run to
the next.
Attaching with TotalView in one of these hangs, I could see that ranks
0-12 made it through but 13-2047 were stuck. In this case, I could also
see that ranks 13-995 were stuck waiting for a message from 996, while
996-2047 were all waiting for a message from rank 12. So it seems that
the messages from rank 12 to 996-2047 never made it through. Although
apparently rank 12 sent messages to everyone and received messages from
everyone, since it made it past MPI_Init(). Given that we didn't get
any apparent IB errors (like code=12), I'm wondering whether a message
buffer may have been overwritten before it was processed. Or maybe a
message was delivered to the wrong buffer?
On a side note, it would seem that this startup loop would run much
faster if nodes started communicating in a staggered manner. It appears
that everyone starts by sending to rank 0, and then a pipeline forms.
Wouldn't it be better to have everyone start with their neighbor to the
right and progress this way. I think the current implementation will
fire a flurry of messages to a single node, while the latter distributes
the load better. It may also work around the problem of the lost
message, but we still need to figure out why messages are being lost.
-Adam
Lior Ofer wrote:
>Added Chris to the email list
>He will contact you tomorrow
>We will check if there is a hardcode barrier in the code for 1024
>
>Lior
>
>______________________________________________________
>Lior Ofer | 978.439.5416 (o) | 339.221.1451 (m)
>Manager, US Customer Support center
>Voltaire - The Grid Backbone
>www.voltaire.com
>lioro at voltaire.com
>No problem can withstand the assault of sustained thinking. (Voltaire
>1778)
>
>
>-----Original Message-----
>From: Ira Weiny [mailto:weiny2 at llnl.gov]
>Sent: Thursday, October 26, 2006 9:17 PM
>To: Lior Ofer
>Cc: tdhooge at llnl.gov; moody20 at llnl.gov; mhaskell at llnl.gov;
>mlleinin at hpcn.ca.sandia.gov
>Subject: Re: "MPI_init" MPI jobs hang on startup.
>
>Yes it seems that < 1024 is OK.
>
>Also, Adam has some more information which he will email out soon.
>
>Ira
>
>On Fri, 27 Oct 2006 03:06:08 +0200
>"Lior Ofer" <lioro at voltaire.com> wrote:
>
>
>
>>Hi Ira
>>Are you able to run It with lower than 1024? If yes what sis the max
>>Chris will contact you tomorrow morning
>>Lior
>>______________________________________________________
>>Lior Ofer | 978.439.5416 (o) | 339.221.1451 (m)
>>Manager, US Customer Support center
>>Voltaire - The Grid Backbone
>>www.voltaire.com
>>lioro at voltaire.com
>>No problem can withstand the assault of sustained thinking. (Voltaire
>>1778)
>>
>>
>>-----Original Message-----
>>From: Ira Weiny [mailto:weiny2 at llnl.gov]
>>Sent: Thursday, October 26, 2006 8:24 PM
>>To: support
>>Cc: Trent D'Hooge; Adam Moody; Mike Haskell; Matt Leininger
>>Subject: "MPI_init" MPI jobs hang on startup.
>>
>>Running Mellanox MPI version : 0.9.7_mlx2.2.0_1.0.4
>>OpenIB version : ofed 1.1 rc7
>>
>>When running with more than about 2 tasks per node and > 1024 tasks
>>total we
>>are getting a high number of hangs. This happens primarily with the
>>linpack
>>benchmark but has happened with other user codes.
>>
>>We have been able to get stack traces from many of the processes
>>
>>
>running
>
>
>>on the
>>various nodes. I have included this file. One of the similarities we
>>have
>>seen is that many of the tasks are in the "smpi_net_lookup" function.
>>Furthermore some of the tasks seem to get past MPI_init but the others
>>do not.
>>In this situation it seems mpi rank number 0-X get past MPI_init, X+1
>>
>>
>-
>
>
>>n-1 do
>>not.
>>
>>As far as our records show we have _not_ ever been able to run linpack
>>at full
>>scale.
>>
>>We are hopping to be able to run > 4096 tasks soon. What can we do?
>>
>>Thanks,
>>Ira Weiny
>>weiny2 at llnl.gov
>>
>>
>>
>
>
>
>
More information about the mvapich-discuss
mailing list