[mvapich-discuss] [EXTERNAL] RE: Launch hangs with multiple tasks per node
Kenny, Joseph P
jpkenny at sandia.gov
Thu Nov 8 12:19:31 EST 2018
Hi Hari,
MV2_USE_RDMA_CM=0 gets things running. Can you educate me on what’s going on here?
I had actually tried MV2_USE_RDMA_CM=1, but not MV2_USE_RDMA_CM=0… 😊
Joe
From: "Subramoni, Hari" <subramoni.1 at osu.edu>
Date: Wednesday, November 7, 2018 at 7:06 PM
To: "Kenny, Joseph P" <jpkenny at sandia.gov>, "mvapich-discuss at cse.ohio-state.edu" <mvapich-discuss at mailman.cse.ohio-state.edu>
Cc: "Subramoni, Hari" <subramoni.1 at osu.edu>
Subject: [EXTERNAL] RE: Launch hangs with multiple tasks per node
Hi, Kenny.
Can you try to set MV2_USE_RDMA_CM=0 and MV2_USE_RoCE=1 and see if things pass
HYDRA_DEBUG=1 MV2_USE_RoCE=1 MV2_USE_RDMA_CM=0 $(HOME)/install/mvapich2-2.3-carnac/bin/mpirun -f hosts.txt -np 64 -ppn 2 ./xhpl
Thx,
Hari.
From: mvapich-discuss-bounces at cse.ohio-state.edu On Behalf Of Kenny, Joseph P
Sent: Wednesday, November 7, 2018 7:51 PM
To: mvapich-discuss at cse.ohio-state.edu <mvapich-discuss at mailman.cse.ohio-state.edu>
Subject: [mvapich-discuss] Launch hangs with multiple tasks per node
Hi,
I’m testing RoCE on Mellanox 100G Ethernet hardware and have been encountering consistent hangs with mvapich2-2.3 during job launch when running more than one mpi task per node. Everything runs fine when I use one task per node.
My configuration:
../mvapich2-2.3/configure --prefix=$(HOME)/install/mvapich2-2.3 --with-device=ch3:mrail --with-rdma=gen2
My run command:
HYDRA_DEBUG=1 MV2_USE_RoCE=1 $(HOME)/install/mvapich2-2.3-carnac/bin/mpirun -f hosts.txt -np 64 -ppn 2 ./xhpl
The last chunk of debug output that I get is:
[proxy:0:24 at en273.eth] got pmi command (from 4): put
kvsname=kvs_14402_0 key=ARCH-HCA-00000030 value=0000000300000006
[proxy:0:24 at en273.eth] cached command: ARCH-HCA-00000030=0000000300000006
[proxy:0:24 at en273.eth] PMI response: cmd=put_result rc=0 msg=success
[proxy:0:24 at en273.eth] got pmi command (from 5): put
kvsname=kvs_14402_0 key=ARCH-HCA-00000031 value=0000000300000006
[proxy:0:24 at en273.eth] cached command: ARCH-HCA-00000031=0000000300000006
[proxy:0:24 at en273.eth] PMI response: cmd=put_result rc=0 msg=success
[proxy:0:24 at en273.eth] got pmi command (from 4): barrier_in
[proxy:0:24 at en273.eth] got pmi command (from 5): barrier_in
[proxy:0:24 at en273.eth] flushing 2 put command(s) out
[proxy:0:24 at en273.eth] forwarding command (cmd=put ARCH-HCA-00000030=0000000300000006 ARCH-HCA-00000031=0000000300000006) upstream
[proxy:0:24 at en273.eth] forwarding command (cmd=barrier_in) upstream
[mpiexec at en249.eth] [pgid: 0] got PMI command: cmd=put ARCH-HCA-00000030=0000000300000006 ARCH-HCA-00000031=0000000300000006
[mpiexec at en249.eth] [pgid: 0] got PMI command: cmd=barrier_in
Any tips on getting this working?
Thanks in advance,
Joe
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20181108/f218d060/attachment-0001.html>
More information about the mvapich-discuss
mailing list