[mvapich-discuss] [EXTERNAL] RE: Launch hangs with multiple tasks per node

Kenny, Joseph P jpkenny at sandia.gov
Thu Nov 29 21:19:06 EST 2018


Well, I had mvapich2 RoCE running for a while, but I’m getting a new error and job failure now as I try to tune performance:

INTERNAL ERROR: invalid error code ffffffff (Ring Index out of range) in SMPI_LOAD_HWLOC_TOPOLOGY:2116

Oddly, it seems that this started occurring when I made some changes to the mlx5 driver config, as suggested to me by Mellanox (https://community.mellanox.com/docs/DOC-2881).

ib_send_bw still works fine.  osu_bw displays these errors, but still runs fine.  My hpl benchmark, however, hangs.

I understand that this is likely something going wrong with hwloc  – I get plenty of warnings from hwloc – but before I dig into those issues I was wondering if anybody had experienced this and knows of a workaround.

Thanks,
Joe

From: mvapich-discuss <mvapich-discuss-bounces at cse.ohio-state.edu> on behalf of "Kenny, Joseph P" <jpkenny at sandia.gov>
Date: Thursday, November 8, 2018 at 9:20 AM
To: "Subramoni, Hari" <subramoni.1 at osu.edu>, "mvapich-discuss at cse.ohio-state.edu" <mvapich-discuss at mailman.cse.ohio-state.edu>
Subject: Re: [mvapich-discuss] [EXTERNAL] RE: Launch hangs with multiple tasks per node

Hi Hari,

MV2_USE_RDMA_CM=0 gets things running.  Can you educate me on what’s going on here?

I had actually tried MV2_USE_RDMA_CM=1, but not MV2_USE_RDMA_CM=0… 😊

Joe

From: "Subramoni, Hari" <subramoni.1 at osu.edu>
Date: Wednesday, November 7, 2018 at 7:06 PM
To: "Kenny, Joseph P" <jpkenny at sandia.gov>, "mvapich-discuss at cse.ohio-state.edu" <mvapich-discuss at mailman.cse.ohio-state.edu>
Cc: "Subramoni, Hari" <subramoni.1 at osu.edu>
Subject: [EXTERNAL] RE: Launch hangs with multiple tasks per node

Hi, Kenny.

Can you try to set MV2_USE_RDMA_CM=0 and MV2_USE_RoCE=1 and see if things pass

HYDRA_DEBUG=1 MV2_USE_RoCE=1 MV2_USE_RDMA_CM=0 $(HOME)/install/mvapich2-2.3-carnac/bin/mpirun -f hosts.txt -np 64 -ppn 2 ./xhpl

Thx,
Hari.

From: mvapich-discuss-bounces at cse.ohio-state.edu On Behalf Of Kenny, Joseph P
Sent: Wednesday, November 7, 2018 7:51 PM
To: mvapich-discuss at cse.ohio-state.edu <mvapich-discuss at mailman.cse.ohio-state.edu>
Subject: [mvapich-discuss] Launch hangs with multiple tasks per node


Hi,



I’m testing RoCE on Mellanox 100G Ethernet hardware and have been encountering consistent hangs with mvapich2-2.3 during job launch when running more than one mpi task per node.  Everything runs fine when I use one task per node.



My configuration:

../mvapich2-2.3/configure --prefix=$(HOME)/install/mvapich2-2.3 --with-device=ch3:mrail --with-rdma=gen2



My run command:

HYDRA_DEBUG=1 MV2_USE_RoCE=1 $(HOME)/install/mvapich2-2.3-carnac/bin/mpirun -f hosts.txt -np 64 -ppn 2 ./xhpl



The last chunk of debug output that I get is:

[proxy:0:24 at en273.eth] got pmi command (from 4): put

kvsname=kvs_14402_0 key=ARCH-HCA-00000030 value=0000000300000006

[proxy:0:24 at en273.eth] cached command: ARCH-HCA-00000030=0000000300000006

[proxy:0:24 at en273.eth] PMI response: cmd=put_result rc=0 msg=success

[proxy:0:24 at en273.eth] got pmi command (from 5): put

kvsname=kvs_14402_0 key=ARCH-HCA-00000031 value=0000000300000006

[proxy:0:24 at en273.eth] cached command: ARCH-HCA-00000031=0000000300000006

[proxy:0:24 at en273.eth] PMI response: cmd=put_result rc=0 msg=success

[proxy:0:24 at en273.eth] got pmi command (from 4): barrier_in

[proxy:0:24 at en273.eth] got pmi command (from 5): barrier_in

[proxy:0:24 at en273.eth] flushing 2 put command(s) out

[proxy:0:24 at en273.eth] forwarding command (cmd=put ARCH-HCA-00000030=0000000300000006 ARCH-HCA-00000031=0000000300000006) upstream

[proxy:0:24 at en273.eth] forwarding command (cmd=barrier_in) upstream

[mpiexec at en249.eth] [pgid: 0] got PMI command: cmd=put ARCH-HCA-00000030=0000000300000006 ARCH-HCA-00000031=0000000300000006

[mpiexec at en249.eth] [pgid: 0] got PMI command: cmd=barrier_in



Any tips on getting this working?

Thanks in advance,

Joe












-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20181130/a1343c05/attachment-0001.html>


More information about the mvapich-discuss mailing list