[mvapich-discuss] mvapich2 hang on startup

Kenny, Joseph P jpkenny at sandia.gov
Wed Sep 26 17:12:59 EDT 2018


Hi,

I’m trying to get mvapich2-2.3 up and running for testing on a Mellanox 100G Ethernet system (I’d like to test RoCE).  I have a ‘--with-device=ch3:nemesis:tcp’ build that is working fine, but my ‘--with-device=ch3:mrail --with-rdma=gen2’ build hangs during startup:


PMI response: cmd=barrier_out


#0  0x00007fd946fbda20 in __poll_nocancel () from /lib64/libc.so.6

#1  0x000000000042bb55 in HYDT_dmxu_poll_wait_for_event (wtime=-1)

    at ../../../../mvapich2-2.3/src/pm/hydra/tools/demux/demux_poll.c:39

#2  0x000000000040b10b in HYD_pmci_wait_for_completion (timeout=-1)

    at ../../../../mvapich2-2.3/src/pm/hydra/pm/pmiserv/pmiserv_pmci.c:195

#3  0x0000000000403f3a in main (argc=<optimized out>, argv=<optimized out>)

    at ../../../../mvapich2-2.3/src/pm/hydra/ui/mpich/mpiexec.c:339



The behavior looks very similar to this previous thread:

http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/2015-June/005634.html



Details on my HCA and OFED are:

02:00.0 Ethernet controller: Mellanox Technologies MT27800 Family [ConnectX-5]

        CA type: MT4119

        Firmware version: 16.23.1020

        Hardware version: 0

MLNX_OFED_LINUX-4.4-2.0.7.0 (OFED-4.4-2.0.7)



I imagine it’s something that I’m misconfiguring.  Any pointers on debugging this?



Thanks,

Joe




-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20180926/9f819504/attachment-0001.html>


More information about the mvapich-discuss mailing list