[mvapich-discuss] Launch hangs with multiple tasks per node
Kenny, Joseph P
jpkenny at sandia.gov
Wed Nov 7 19:50:50 EST 2018
Hi,
I’m testing RoCE on Mellanox 100G Ethernet hardware and have been encountering consistent hangs with mvapich2-2.3 during job launch when running more than one mpi task per node. Everything runs fine when I use one task per node.
My configuration:
../mvapich2-2.3/configure --prefix=$(HOME)/install/mvapich2-2.3 --with-device=ch3:mrail --with-rdma=gen2
My run command:
HYDRA_DEBUG=1 MV2_USE_RoCE=1 $(HOME)/install/mvapich2-2.3-carnac/bin/mpirun -f hosts.txt -np 64 -ppn 2 ./xhpl
The last chunk of debug output that I get is:
[proxy:0:24 at en273.eth] got pmi command (from 4): put
kvsname=kvs_14402_0 key=ARCH-HCA-00000030 value=0000000300000006
[proxy:0:24 at en273.eth] cached command: ARCH-HCA-00000030=0000000300000006
[proxy:0:24 at en273.eth] PMI response: cmd=put_result rc=0 msg=success
[proxy:0:24 at en273.eth] got pmi command (from 5): put
kvsname=kvs_14402_0 key=ARCH-HCA-00000031 value=0000000300000006
[proxy:0:24 at en273.eth] cached command: ARCH-HCA-00000031=0000000300000006
[proxy:0:24 at en273.eth] PMI response: cmd=put_result rc=0 msg=success
[proxy:0:24 at en273.eth] got pmi command (from 4): barrier_in
[proxy:0:24 at en273.eth] got pmi command (from 5): barrier_in
[proxy:0:24 at en273.eth] flushing 2 put command(s) out
[proxy:0:24 at en273.eth] forwarding command (cmd=put ARCH-HCA-00000030=0000000300000006 ARCH-HCA-00000031=0000000300000006) upstream
[proxy:0:24 at en273.eth] forwarding command (cmd=barrier_in) upstream
[mpiexec at en249.eth] [pgid: 0] got PMI command: cmd=put ARCH-HCA-00000030=0000000300000006 ARCH-HCA-00000031=0000000300000006
[mpiexec at en249.eth] [pgid: 0] got PMI command: cmd=barrier_in
Any tips on getting this working?
Thanks in advance,
Joe
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20181108/39849c6a/attachment.html>
More information about the mvapich-discuss
mailing list