[mvapich-discuss] Hang in Bcast

Adam Moody moody20 at llnl.gov
Wed Oct 4 20:07:54 EDT 2006


Hi all,
A user is hitting a hang in MPI_Bcast().  We are running a 
Mellanox-modified MVAPICH 0.9.7.  This may have something to do with it 
as the user claims he was able to run with the non-modified 0.9.7.  When 
I get the chance, I'll try to run against a non-modified 0.9.7 or 0.9.8.

A user is running a 4 task job, in which rank 0 does a broadcast of a 
single integer near the end of execution.  Ranks {0,1} run on one node, 
while ranks {2,3} run on another.  Ranks {2,3} get stuck in the 
MPI_Bcast, while {0,1} pass through the broadcast just fine and then 
hang on MPI_Finalize in a barrier.  Tracing under TotalView, I can see 
that rank 2 gets stuck in an rdma polling loop waiting for the broadcast 
message from 0.  I was able to check the rdma buffer addresses, and 
things looked to be ok.  The internode message never seems to come in.

When the user runs under batch, he has seen the following error:

srun: mvapich: Received ABORT message from MPI Rank 0
[0] Abort: [vertex3:0] Got completion with error, code=1, dest rank=2
 at line 382 in file viacheck.c
srun: error: vertex4: task3: Killed

What does this mean?  I've not yet seen this in my interactive runs, but 
it could be that it shows up after some timeout that I haven't waited 
long enough for.

Do you have any ideas?  Are there any environment variables to disable 
the rdma fast path as a test?
Thanks,
-Adam


More information about the mvapich-discuss mailing list