[mvapich-discuss] applications hanging - Infiniband EDR - Mellanox OFED-4.2-, 1.2.0

Christiane Pousa cpousari at id.ethz.ch
Sun Feb 18 14:16:48 EST 2018


Hi,

we have observed applications hanging when run with mvapich2. We tried 
different versions of mvapich2 but, none worked in our cluster.  We have 
nodes with Mellanox EDR (ConnectX-4), centos 7.4 and Mellanox OFED-4.2-
1.2.0.

Have you seen this too? ssh to/from nodes is working, we checked. Some
information about testing installation:

#build mvapich
mvapich2-2.3b.tar.gz
tar xzvf mvapich2-2.3b.tar.gz
cd mvapich2-2.3b
./configure --disable-option-checking '--prefix=/scratch/mvapich'
--disable-versioning --with-atomic-primitives=auto_allow_emulation
'--with-device=ch3:mrail' '--with-rdma=gen2' '--enable-static=no'
'--disable-static' '--enable-hybrid' '--enable-cma'
'--disable-wrapper-rpath' '--enable-shared' '--disable-rpath'
'--enable-versioning'  '--enable-romio'
'--with-file-system=panfs+nfs+ufs'
'CC=/cluster/apps/gcc/4.8.2/bin/gcc'
'CFLAGS=-O2 -ftree-vectorize -march=corei7-avx -mavx   -DNDEBUG
-DNVALGRIND' 'CPP=/cluster/apps/gcc/4.8.2/bin/cpp'
'CXX=/cluster/apps/gcc/4.8.2/bin/g++' 'CXXFLAGS=-O2 -ftree-vectorize
-march=corei7-avx -mavx  -DNDEBUG -DNVALGRIND'
'FC=/cluster/apps/gcc/4.8.2/bin/gfortran' 'FCFLAGS=-O2 -ftree-
vectorize
-march=corei7-avx -mavx ' 'F77=/cluster/apps/gcc/4.8.2/bin/gfortran'
'FFLAGS=-O2 -ftree-vectorize -march=corei7-avx -mavx '
--cache-file=/dev/null --srcdir=.
make
make install

#compile & run
LD_LIBRARY_PATH="/scratch/mvapich/lib" /scratch/mvapich/bin/mpicc -g
-o mpi_hello_world mpi_hello_world.c -I/scratch/mvapich/include
-L/scratch/mvapich/lib
LD_LIBRARY_PATH="/scratch/mvapich/lib" /scratch/mvapich/bin/mpirun
-verbose -n 2 -ppn 1 -env LD_LIBRARY_PATH "/scratch/mvapich/lib"
-hosts eu-a6-011-01,eu-a6-011-02 /scratch/mpi_hello_world

host: eu-a6-011-01
host: eu-a6-011-02

=====================================================================
=============================
mpiexec options:
----------------
    Base path: /scratch/mvapich/bin/
    Launcher: (null)
    Debug level: 1
    Enable X: -1
   _=/scratch/mvapich/bin/mpirun

    Hydra internal environment:
    ---------------------------
      GFORTRAN_UNBUFFERED_PRECONNECTED=y


      Proxy information:
      *********************
        [1] proxy: eu-a6-011-01 (1 cores)
        Exec list: /scratch/mpi_hello_world (1 processes);

        [2] proxy: eu-a6-011-02 (1 cores)
        Exec list: /scratch/mpi_hello_world (1 processes);

[mpiexec at eu-a6-011-01] Timeout set to -1 (-1 means infinite)
[mpiexec at eu-a6-011-01] Got a control port string of eu-a6-011-
01:42316

Proxy launch args: /scratch/mvapich/bin/hydra_pmi_proxy --control-
port eu-a6-011-01:42316 --debug --rmk user --launcher ssh --demux pol
l --pgid 0 --retries 10 --usize -2 --proxy-id

[mpiexec at eu-a6-011-01] Launch arguments:
/scratch/mvapich/bin/hydra_pmi_proxy --control-port eu-a6-011-
01:42316 --debug --rmk user --la
uncher ssh --demux poll --pgid 0 --retries 10 --usize -2 --proxy-id 0
[mpiexec at eu-a6-011-01] Launch arguments: /usr/bin/ssh -x eu-a6-011-
02 "/scratch/mvapich/bin/hydra_pmi_proxy" --control-port eu-a6-011-0
1:42316 --debug --rmk user --launcher ssh --demux poll --pgid 0
--retries 10 --usize -2 --proxy-id 1
[proxy:0:0 at eu-a6-011-01] got pmi command (from 0): init
pmi_version=1 pmi_subversion=1
[proxy:0:0 at eu-a6-011-01] PMI response: cmd=response_to_init
pmi_version=1 pmi_subversion=1 rc=0
[proxy:0:0 at eu-a6-011-01] got pmi command (from 0): get_maxes

[proxy:0:0 at eu-a6-011-01] PMI response: cmd=maxes kvsname_max=256
keylen_max=64 vallen_max=1024
[proxy:0:0 at eu-a6-011-01] got pmi command (from 0): get_appnum

[proxy:0:0 at eu-a6-011-01] PMI response: cmd=appnum appnum=0
[proxy:0:0 at eu-a6-011-01] got pmi command (from 0): get_my_kvsname

[proxy:0:0 at eu-a6-011-01] PMI response: cmd=my_kvsname
kvsname=kvs_63220_0
[proxy:0:0 at eu-a6-011-01] got pmi command (from 0): get_my_kvsname

[proxy:0:0 at eu-a6-011-01] PMI response: cmd=my_kvsname
kvsname=kvs_63220_0
[proxy:0:0 at eu-a6-011-01] got pmi command (from 0): get
kvsname=kvs_63220_0 key=PMI_process_mapping
[proxy:0:0 at eu-a6-011-01] PMI response: cmd=get_result rc=0
msg=success
value=(vector,(0,2,1))
[proxy:0:1 at eu-a6-011-02] got pmi command (from 4): init
pmi_version=1 pmi_subversion=1
[proxy:0:1 at eu-a6-011-02] PMI response: cmd=response_to_init
pmi_version=1 pmi_subversion=1 rc=0
[proxy:0:1 at eu-a6-011-02] got pmi command (from 4): get_maxes

[proxy:0:1 at eu-a6-011-02] PMI response: cmd=maxes kvsname_max=256
keylen_max=64 vallen_max=1024
[proxy:0:1 at eu-a6-011-02] got pmi command (from 4): get_appnum

[proxy:0:1 at eu-a6-011-02] PMI response: cmd=appnum appnum=0
[proxy:0:1 at eu-a6-011-02] got pmi command (from 4): get_my_kvsname
....

{hangs}


#if run with timeout
MPIEXEC_TIMEOUT=10
LD_LIBRARY_PATH="/scratch/mvapich/lib" /scratch/mvapich/bin/mpirun
-verbose -n 2 -ppn 1 -env LD_LIBRARY_PATH "/scratch/mvapich/lib"
-hosts eu-a6-011-01,eu-a6-011-02 /scratch/mpi_hello_world
....
[proxy:0:1 at eu-a6-011-02] PMI response: cmd=my_kvsname
kvsname=kvs_6505_0
[proxy:0:1 at eu-a6-011-02] got pmi command (from 4): get_my_kvsname

[proxy:0:1 at eu-a6-011-02] PMI response: cmd=my_kvsname
kvsname=kvs_6505_0
[proxy:0:1 at eu-a6-011-02] got pmi command (from 4): get
kvsname=kvs_6505_0 key=PMI_process_mapping
[proxy:0:1 at eu-a6-011-02] PMI response: cmd=get_result rc=0
msg=success
value=(vector,(0,2,1))
[mpiexec at eu-a6-011-01] APPLICATION TIMED OUT

Thank you,

-- 
Dr. Christiane Pousa
High Performance Computing, Scientific IT Services, ETH Zurich
WEC D 15, Weinbergstrasse 11, Zurich, Switzerland
Phone number: +41 44 633 91 74
http://www.id.ethz.ch/


More information about the mvapich-discuss mailing list