[mvapich-discuss] Hang in Bcast
Adam Moody
moody20 at llnl.gov
Wed Oct 4 20:07:54 EDT 2006
Hi all,
A user is hitting a hang in MPI_Bcast(). We are running a
Mellanox-modified MVAPICH 0.9.7. This may have something to do with it
as the user claims he was able to run with the non-modified 0.9.7. When
I get the chance, I'll try to run against a non-modified 0.9.7 or 0.9.8.
A user is running a 4 task job, in which rank 0 does a broadcast of a
single integer near the end of execution. Ranks {0,1} run on one node,
while ranks {2,3} run on another. Ranks {2,3} get stuck in the
MPI_Bcast, while {0,1} pass through the broadcast just fine and then
hang on MPI_Finalize in a barrier. Tracing under TotalView, I can see
that rank 2 gets stuck in an rdma polling loop waiting for the broadcast
message from 0. I was able to check the rdma buffer addresses, and
things looked to be ok. The internode message never seems to come in.
When the user runs under batch, he has seen the following error:
srun: mvapich: Received ABORT message from MPI Rank 0
[0] Abort: [vertex3:0] Got completion with error, code=1, dest rank=2
at line 382 in file viacheck.c
srun: error: vertex4: task3: Killed
What does this mean? I've not yet seen this in my interactive runs, but
it could be that it shows up after some timeout that I haven't waited
long enough for.
Do you have any ideas? Are there any environment variables to disable
the rdma fast path as a test?
Thanks,
-Adam
More information about the mvapich-discuss
mailing list