[mvapich-discuss] Launch hangs with multiple tasks per node

Kenny, Joseph P jpkenny at sandia.gov
Wed Nov 7 19:50:50 EST 2018


Hi,



I’m testing RoCE on Mellanox 100G Ethernet hardware and have been encountering consistent hangs with mvapich2-2.3 during job launch when running more than one mpi task per node.  Everything runs fine when I use one task per node.



My configuration:

../mvapich2-2.3/configure --prefix=$(HOME)/install/mvapich2-2.3 --with-device=ch3:mrail --with-rdma=gen2



My run command:

HYDRA_DEBUG=1 MV2_USE_RoCE=1 $(HOME)/install/mvapich2-2.3-carnac/bin/mpirun -f hosts.txt -np 64 -ppn 2 ./xhpl



The last chunk of debug output that I get is:

[proxy:0:24 at en273.eth] got pmi command (from 4): put

kvsname=kvs_14402_0 key=ARCH-HCA-00000030 value=0000000300000006

[proxy:0:24 at en273.eth] cached command: ARCH-HCA-00000030=0000000300000006

[proxy:0:24 at en273.eth] PMI response: cmd=put_result rc=0 msg=success

[proxy:0:24 at en273.eth] got pmi command (from 5): put

kvsname=kvs_14402_0 key=ARCH-HCA-00000031 value=0000000300000006

[proxy:0:24 at en273.eth] cached command: ARCH-HCA-00000031=0000000300000006

[proxy:0:24 at en273.eth] PMI response: cmd=put_result rc=0 msg=success

[proxy:0:24 at en273.eth] got pmi command (from 4): barrier_in

[proxy:0:24 at en273.eth] got pmi command (from 5): barrier_in

[proxy:0:24 at en273.eth] flushing 2 put command(s) out

[proxy:0:24 at en273.eth] forwarding command (cmd=put ARCH-HCA-00000030=0000000300000006 ARCH-HCA-00000031=0000000300000006) upstream

[proxy:0:24 at en273.eth] forwarding command (cmd=barrier_in) upstream

[mpiexec at en249.eth] [pgid: 0] got PMI command: cmd=put ARCH-HCA-00000030=0000000300000006 ARCH-HCA-00000031=0000000300000006

[mpiexec at en249.eth] [pgid: 0] got PMI command: cmd=barrier_in



Any tips on getting this working?

Thanks in advance,

Joe












-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20181108/39849c6a/attachment.html>


More information about the mvapich-discuss mailing list