[mvapich-discuss] mvapich2 hang on startup

Chakraborty, Sourav chakraborty.52 at buckeyemail.osu.edu
Wed Sep 26 18:18:17 EDT 2018


Hi Joe,

Can you please set the environment variable MV2_USE_RoCE=1 and see if it fixes the hang?

The userguide has some more information on setting up a RoCE environment:

http://mvapich.cse.ohio-state.edu/static/media/mvapich/mvapich2-2.3-userguide.html#x1-420005.2.7

Thanks,
Sourav


On Wed, Sep 26, 2018 at 5:13 PM Kenny, Joseph P <jpkenny at sandia.gov<mailto:jpkenny at sandia.gov>> wrote:
Hi,

I’m trying to get mvapich2-2.3 up and running for testing on a Mellanox 100G Ethernet system (I’d like to test RoCE).  I have a ‘--with-device=ch3:nemesis:tcp’ build that is working fine, but my ‘--with-device=ch3:mrail --with-rdma=gen2’ build hangs during startup:


PMI response: cmd=barrier_out


#0  0x00007fd946fbda20 in __poll_nocancel () from /lib64/libc.so.6

#1  0x000000000042bb55 in HYDT_dmxu_poll_wait_for_event (wtime=-1)

    at ../../../../mvapich2-2.3/src/pm/hydra/tools/demux/demux_poll.c:39

#2  0x000000000040b10b in HYD_pmci_wait_for_completion (timeout=-1)

    at ../../../../mvapich2-2.3/src/pm/hydra/pm/pmiserv/pmiserv_pmci.c:195

#3  0x0000000000403f3a in main (argc=<optimized out>, argv=<optimized out>)

    at ../../../../mvapich2-2.3/src/pm/hydra/ui/mpich/mpiexec.c:339



The behavior looks very similar to this previous thread:

http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/2015-June/005634.html



Details on my HCA and OFED are:

02:00.0 Ethernet controller: Mellanox Technologies MT27800 Family [ConnectX-5]

        CA type: MT4119

        Firmware version: 16.23.1020

        Hardware version: 0

MLNX_OFED_LINUX-4.4-2.0.7.0 (OFED-4.4-2.0.7)



I imagine it’s something that I’m misconfiguring.  Any pointers on debugging this?



Thanks,

Joe





_______________________________________________
mvapich-discuss mailing list
mvapich-discuss at cse.ohio-state.edu<mailto:mvapich-discuss at cse.ohio-state.edu>
http://mailman.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20180926/1270cc8f/attachment-0003.html>


More information about the mvapich-discuss mailing list