[mvapich-discuss] mpirun hang on simple cpi job with mvapich 2.2 and Ubuntu 16.04 with MOFED 4.0

Tue May 23 17:26:22 EDT 2017

Hi all,

I'm having some strange behavior with mvapich 2.2 on a small Ubuntu 
16.04 cluster.  The cluster has ConnectX4 EDR IB HCAs in every node.  
The compute nodes have (9) Geforce 1080s each. They're named master and 
node2 through node5.

I've installed MOFED 4.0 on the cluster to begin with.  OpenMPI from 
that works fine. CUDA8 is also installed

I first installed mvapich2-gdr, but when I tried running an example job 
(basic cpi test) it hung.  I then did some reading that indicated 
mvapich2-gdr was just for Tesla/Quadro, and not for Geforce, so I 
removed mvapich2-gdr and build regular mvapich2 from source instead. Is 
that true?  Should I be using the gdr build with Geforce cards?

With the copy I build from source, I reproduced the same hang running a 
basic 2 process job on 2 of the compute nodes.  However, I found that if 
I use the master as 1 of the 2 systems, the job works fine (I hadn't 
tried this with gdr before removing, might have been the same there).  
It only fails if I use 2 (or more) different computes nodes together.  
It also works if I send 2 processes to the same node.

microway at master:~$ mpirun -np 2 --host master,node2 -env MV2_USE_CUDA 0 
./cpi-mvapich2
NVIDIA: no NVIDIA devices found
Process 0 of 2 on master
Process 1 of 2 on node2
pi is approximately 3.1415926544231318, Error is 0.0000000008333387
wall clock time = 1.092004
*******WORKED*******

microway at master:~$ mpirun -np 2 --host master,node3 -env MV2_USE_CUDA 0 
./cpi-mvapich2
NVIDIA: no NVIDIA devices found
Process 0 of 2 on master
Process 1 of 2 on node3
pi is approximately 3.1415926544231318, Error is 0.0000000008333387
wall clock time = 0.820147
*******WORKED*******

microway at master:~$ mpirun -np 2 --host node2,node2 -env MV2_USE_CUDA 0 
./cpi-mvapich2
Process 0 of 2 on node2
Process 1 of 2 on node2
pi is approximately 3.1415926544231318, Error is 0.0000000008333387
wall clock time = 0.005124
*******WORKED*******

microway at master:~$ mpirun -np 2 --host node2,node3 -env MV2_USE_CUDA 0 
./cpi-mvapich2
*******HANGS HERE - NEVER RETURNS UNTIL CTRL-C*******

I'm using the MV2_USE_CUDA environment variable because the master does 
not have cuda devices.

However, mpirun_rsh works:
microway at master:~$ mpirun_rsh -np 2 node2 node3 MV2_USE_CUDA=0 
./cpi-mvapich2
Process 0 of 2 on node2
Process 1 of 2 on node3
pi is approximately 3.1415926544231318, Error is 0.0000000008333387
wall clock time = 0.128403

This isn't making sense to me.  The debugging I've done so far with 
strace and gdb has revealed rank 0 is waiting around line 1630 of 
src/mpid/ch3/channels/mrail/src/rdma/ch3_smp_progress.c in the function 
MPIDI_CH3I_CM_SHMEM_Sync. Here is a backtrace I created by sending a 
SIGSEGV to the process:
microway at master:~$ mpirun -np 2 --host node2,node3 -env MV2_USE_CUDA 0 
./cpi-mvapich2
[node2:9777 :0] Caught signal 11 (Segmentation fault)
==== backtrace ====
     0  /opt/mellanox/mxm/lib/libmxm.so.2(+0x3c69c) [0x7fab0802f69c]
     1  /lib/x86_64-linux-gnu/libc.so.6(+0x354b0) [0x7fab0ad944b0]
     2 
/usr/local/mpi/gcc/mvapich2-2.2/lib64/libmpi.so.12(MPIDI_CH3I_CM_SHMEM_Sync+0x86) 
[0x7fab0b5c6e7b]
     3 
/usr/local/mpi/gcc/mvapich2-2.2/lib64/libmpi.so.12(MPIDI_CH3I_CM_Create_region+0x280) 
[0x7fab0b5c73ff]
     4 
/usr/local/mpi/gcc/mvapich2-2.2/lib64/libmpi.so.12(MPIDI_CH3I_MRAIL_CM_Alloc+0x2c) 
[0x7fab0b5e3883]
     5 
/usr/local/mpi/gcc/mvapich2-2.2/lib64/libmpi.so.12(MPIDI_CH3_Init+0x638) 
[0x7fab0b5b2c3d]
     6 
/usr/local/mpi/gcc/mvapich2-2.2/lib64/libmpi.so.12(MPID_Init+0x323) 
[0x7fab0b59abf0]
     7 
/usr/local/mpi/gcc/mvapich2-2.2/lib64/libmpi.so.12(MPIR_Init_thread+0x411) 
[0x7fab0b48fb01]
     8 
/usr/local/mpi/gcc/mvapich2-2.2/lib64/libmpi.so.12(MPI_Init+0x19a) 
[0x7fab0b48ea49]
     9  ./cpi-mvapich2() [0x400aed]
    10  /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf0) 
[0x7fab0ad7f830]
    11  ./cpi-mvapich2() [0x400989]
===================

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   PID 9777 RUNNING AT node2
=   EXIT CODE: 139
=   CLEANING UP REMAINING PROCESSES
=   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================
[proxy:0:1 at node3] HYD_pmcd_pmip_control_cmd_cb 
(pm/pmiserv/pmip_cb.c:909): assert (!closed) failed
[proxy:0:1 at node3] HYDT_dmxu_poll_wait_for_event 
(tools/demux/demux_poll.c:76): callback returned error status
[proxy:0:1 at node3] main (pm/pmiserv/pmip.c:206): demux engine error 
waiting for event
YOUR APPLICATION TERMINATED WITH THE EXIT STRING: Segmentation fault 
(signal 11)
This typically refers to a problem with your application.
Please see the FAQ page for debugging suggestions

Here is mpiexec -info:
microway at master:~$ mpiexec -info
HYDRA build details:
     Version:                                 3.1.4
     Release Date:                            Wed Sep  7 14:33:43 EDT 2016
     CC:                              gcc
     CXX:                             g++
     F77:                             gfortran
     F90:                             gfortran
     Configure options: '--disable-option-checking' 
'--prefix=/usr/local/mpi/gcc/mvapich2-2.2' '--localstatedir=/var' 
'--disable-static' '--enable-shared' '--with-mxm=/opt/mellanox/mxm' 
'--with-hcoll=/opt/mellanox/hcoll' '--with-knem=/opt/knem-1.1.2.90mlnx1' 
'--without-slurm' '--disable-mcast' '--without-cma' 
'--without-hydra-ckpointlib' '--enable-g=dbg' '--enable-cuda' 
'--with-cuda=/usr/local/cuda' '--enable-fast=ndebug' 
'--cache-file=/dev/null' '--srcdir=.' 'CC=gcc' 'CFLAGS= -DNDEBUG 
-DNVALGRIND -g' 'LDFLAGS=-L/usr/local/cuda/lib64 -L/usr/local/cuda/lib 
-L/lib -L/lib -L/opt/mellanox/hcoll/lib64 -L/opt/mellanox/hcoll/lib 
-L/lib -Wl,-rpath,/lib -L/lib -Wl,-rpath,/lib -L/lib -L/lib' 
'LIBS=-lcudart -lcuda -lrdmacm -libumad -libverbs -ldl -lrt -lm 
-lpthread ' 'CPPFLAGS=-I/usr/local/cuda/include 
-I/opt/mellanox/hcoll/include 
-I/mcms/build/mvapich/source/mvapich2-2.2/src/mpl/include 
-I/mcms/build/mvapich/source/mvapich2-2.2/src/mpl/include 
-I/mcms/build/mvapich/source/mvapich2-2.2/src/openpa/src 
-I/mcms/build/mvapich/source/mvapich2-2.2/src/openpa/src -D_REENTRANT 
-I/mcms/build/mvapich/source/mvapich2-2.2/src/mpi/romio/include 
-I/include -I/include -I/include -I/include'
     Process Manager:                         pmi
     Launchers available:                     ssh rsh fork slurm ll lsf 
sge manual persist
     Topology libraries available:            hwloc
     Resource management kernels available:   user slurm ll lsf sge pbs 
cobalt
     Checkpointing libraries available:
     Demux engines available:                 poll select

If there is any other needed info please let me know.

Thanks,
Rick