[mvapich-discuss] [EXTERNAL] RE: Launch hangs with multiple tasks per node
Subramoni, Hari
subramoni.1 at osu.edu
Fri Nov 30 07:09:41 EST 2018
Hi, Kenny.
Can you try setting MV2_BCAST_HWLOC_TOPOLOGY=0 and rerunning it?
Thx,
Hari.
From: Kenny, Joseph P <jpkenny at sandia.gov>
Sent: Thursday, November 29, 2018 9:19 PM
To: Subramoni, Hari <subramoni.1 at osu.edu>; mvapich-discuss at cse.ohio-state.edu <mvapich-discuss at mailman.cse.ohio-state.edu>
Subject: Re: [mvapich-discuss] [EXTERNAL] RE: Launch hangs with multiple tasks per node
Well, I had mvapich2 RoCE running for a while, but I’m getting a new error and job failure now as I try to tune performance:
INTERNAL ERROR: invalid error code ffffffff (Ring Index out of range) in SMPI_LOAD_HWLOC_TOPOLOGY:2116
Oddly, it seems that this started occurring when I made some changes to the mlx5 driver config, as suggested to me by Mellanox (https://community.mellanox.com/docs/DOC-2881).
ib_send_bw still works fine. osu_bw displays these errors, but still runs fine. My hpl benchmark, however, hangs.
I understand that this is likely something going wrong with hwloc – I get plenty of warnings from hwloc – but before I dig into those issues I was wondering if anybody had experienced this and knows of a workaround.
Thanks,
Joe
From: mvapich-discuss <mvapich-discuss-bounces at cse.ohio-state.edu<mailto:mvapich-discuss-bounces at cse.ohio-state.edu>> on behalf of "Kenny, Joseph P" <jpkenny at sandia.gov<mailto:jpkenny at sandia.gov>>
Date: Thursday, November 8, 2018 at 9:20 AM
To: "Subramoni, Hari" <subramoni.1 at osu.edu<mailto:subramoni.1 at osu.edu>>, "mvapich-discuss at cse.ohio-state.edu<mailto:mvapich-discuss at cse.ohio-state.edu>" <mvapich-discuss at mailman.cse.ohio-state.edu<mailto:mvapich-discuss at mailman.cse.ohio-state.edu>>
Subject: Re: [mvapich-discuss] [EXTERNAL] RE: Launch hangs with multiple tasks per node
Hi Hari,
MV2_USE_RDMA_CM=0 gets things running. Can you educate me on what’s going on here?
I had actually tried MV2_USE_RDMA_CM=1, but not MV2_USE_RDMA_CM=0… 😊
Joe
From: "Subramoni, Hari" <subramoni.1 at osu.edu<mailto:subramoni.1 at osu.edu>>
Date: Wednesday, November 7, 2018 at 7:06 PM
To: "Kenny, Joseph P" <jpkenny at sandia.gov<mailto:jpkenny at sandia.gov>>, "mvapich-discuss at cse.ohio-state.edu<mailto:mvapich-discuss at cse.ohio-state.edu>" <mvapich-discuss at mailman.cse.ohio-state.edu<mailto:mvapich-discuss at mailman.cse.ohio-state.edu>>
Cc: "Subramoni, Hari" <subramoni.1 at osu.edu<mailto:subramoni.1 at osu.edu>>
Subject: [EXTERNAL] RE: Launch hangs with multiple tasks per node
Hi, Kenny.
Can you try to set MV2_USE_RDMA_CM=0 and MV2_USE_RoCE=1 and see if things pass
HYDRA_DEBUG=1 MV2_USE_RoCE=1 MV2_USE_RDMA_CM=0 $(HOME)/install/mvapich2-2.3-carnac/bin/mpirun -f hosts.txt -np 64 -ppn 2 ./xhpl
Thx,
Hari.
From: mvapich-discuss-bounces at cse.ohio-state.edu<mailto:mvapich-discuss-bounces at cse.ohio-state.edu> On Behalf Of Kenny, Joseph P
Sent: Wednesday, November 7, 2018 7:51 PM
To: mvapich-discuss at cse.ohio-state.edu<mailto:mvapich-discuss at cse.ohio-state.edu> <mvapich-discuss at mailman.cse.ohio-state.edu<mailto:mvapich-discuss at mailman.cse.ohio-state.edu>>
Subject: [mvapich-discuss] Launch hangs with multiple tasks per node
Hi,
I’m testing RoCE on Mellanox 100G Ethernet hardware and have been encountering consistent hangs with mvapich2-2.3 during job launch when running more than one mpi task per node. Everything runs fine when I use one task per node.
My configuration:
../mvapich2-2.3/configure --prefix=$(HOME)/install/mvapich2-2.3 --with-device=ch3:mrail --with-rdma=gen2
My run command:
HYDRA_DEBUG=1 MV2_USE_RoCE=1 $(HOME)/install/mvapich2-2.3-carnac/bin/mpirun -f hosts.txt -np 64 -ppn 2 ./xhpl
The last chunk of debug output that I get is:
[proxy:0:24 at en273.eth] got pmi command (from 4): put
kvsname=kvs_14402_0 key=ARCH-HCA-00000030 value=0000000300000006
[proxy:0:24 at en273.eth] cached command: ARCH-HCA-00000030=0000000300000006
[proxy:0:24 at en273.eth] PMI response: cmd=put_result rc=0 msg=success
[proxy:0:24 at en273.eth] got pmi command (from 5): put
kvsname=kvs_14402_0 key=ARCH-HCA-00000031 value=0000000300000006
[proxy:0:24 at en273.eth] cached command: ARCH-HCA-00000031=0000000300000006
[proxy:0:24 at en273.eth] PMI response: cmd=put_result rc=0 msg=success
[proxy:0:24 at en273.eth] got pmi command (from 4): barrier_in
[proxy:0:24 at en273.eth] got pmi command (from 5): barrier_in
[proxy:0:24 at en273.eth] flushing 2 put command(s) out
[proxy:0:24 at en273.eth] forwarding command (cmd=put ARCH-HCA-00000030=0000000300000006 ARCH-HCA-00000031=0000000300000006) upstream
[proxy:0:24 at en273.eth] forwarding command (cmd=barrier_in) upstream
[mpiexec at en249.eth] [pgid: 0] got PMI command: cmd=put ARCH-HCA-00000030=0000000300000006 ARCH-HCA-00000031=0000000300000006
[mpiexec at en249.eth] [pgid: 0] got PMI command: cmd=barrier_in
Any tips on getting this working?
Thanks in advance,
Joe
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20181130/7c5cdcb8/attachment-0001.html>
More information about the mvapich-discuss
mailing list