[mvapich-discuss] applications hanging - Infiniband EDR - Mellanox OFED-4.2-, 1.2.0

Sourav Chakraborty chakraborty.52 at buckeyemail.osu.edu
Tue Feb 20 14:56:39 EST 2018


Hi Dr. Pousa,

We have fixed an issue that causes a hang with hydra in the latest release
of MVAPICH2-2.3rc1.

Can you please try the new version and let us know if it fixes the issue?
It is available from the following url:
http://mvapich.cse.ohio-state.edu/downloads/

Thanks,
Sourav


On Sun, Feb 18, 2018 at 2:16 PM, Christiane Pousa <cpousari at id.ethz.ch>
wrote:

> Hi,
>
> we have observed applications hanging when run with mvapich2. We tried
> different versions of mvapich2 but, none worked in our cluster.  We have
> nodes with Mellanox EDR (ConnectX-4), centos 7.4 and Mellanox OFED-4.2-
> 1.2.0.
>
> Have you seen this too? ssh to/from nodes is working, we checked. Some
> information about testing installation:
>
> #build mvapich
> mvapich2-2.3b.tar.gz
> tar xzvf mvapich2-2.3b.tar.gz
> cd mvapich2-2.3b
> ./configure --disable-option-checking '--prefix=/scratch/mvapich'
> --disable-versioning --with-atomic-primitives=auto_allow_emulation
> '--with-device=ch3:mrail' '--with-rdma=gen2' '--enable-static=no'
> '--disable-static' '--enable-hybrid' '--enable-cma'
> '--disable-wrapper-rpath' '--enable-shared' '--disable-rpath'
> '--enable-versioning'  '--enable-romio'
> '--with-file-system=panfs+nfs+ufs'
> 'CC=/cluster/apps/gcc/4.8.2/bin/gcc'
> 'CFLAGS=-O2 -ftree-vectorize -march=corei7-avx -mavx   -DNDEBUG
> -DNVALGRIND' 'CPP=/cluster/apps/gcc/4.8.2/bin/cpp'
> 'CXX=/cluster/apps/gcc/4.8.2/bin/g++' 'CXXFLAGS=-O2 -ftree-vectorize
> -march=corei7-avx -mavx  -DNDEBUG -DNVALGRIND'
> 'FC=/cluster/apps/gcc/4.8.2/bin/gfortran' 'FCFLAGS=-O2 -ftree-
> vectorize
> -march=corei7-avx -mavx ' 'F77=/cluster/apps/gcc/4.8.2/bin/gfortran'
> 'FFLAGS=-O2 -ftree-vectorize -march=corei7-avx -mavx '
> --cache-file=/dev/null --srcdir=.
> make
> make install
>
> #compile & run
> LD_LIBRARY_PATH="/scratch/mvapich/lib" /scratch/mvapich/bin/mpicc -g
> -o mpi_hello_world mpi_hello_world.c -I/scratch/mvapich/include
> -L/scratch/mvapich/lib
> LD_LIBRARY_PATH="/scratch/mvapich/lib" /scratch/mvapich/bin/mpirun
> -verbose -n 2 -ppn 1 -env LD_LIBRARY_PATH "/scratch/mvapich/lib"
> -hosts eu-a6-011-01,eu-a6-011-02 /scratch/mpi_hello_world
>
> host: eu-a6-011-01
> host: eu-a6-011-02
>
> =====================================================================
> =============================
> mpiexec options:
> ----------------
>    Base path: /scratch/mvapich/bin/
>    Launcher: (null)
>    Debug level: 1
>    Enable X: -1
>   _=/scratch/mvapich/bin/mpirun
>
>    Hydra internal environment:
>    ---------------------------
>      GFORTRAN_UNBUFFERED_PRECONNECTED=y
>
>
>      Proxy information:
>      *********************
>        [1] proxy: eu-a6-011-01 (1 cores)
>        Exec list: /scratch/mpi_hello_world (1 processes);
>
>        [2] proxy: eu-a6-011-02 (1 cores)
>        Exec list: /scratch/mpi_hello_world (1 processes);
>
> [mpiexec at eu-a6-011-01] Timeout set to -1 (-1 means infinite)
> [mpiexec at eu-a6-011-01] Got a control port string of eu-a6-011-
> 01:42316
>
> Proxy launch args: /scratch/mvapich/bin/hydra_pmi_proxy --control-
> port eu-a6-011-01:42316 --debug --rmk user --launcher ssh --demux pol
> l --pgid 0 --retries 10 --usize -2 --proxy-id
>
> [mpiexec at eu-a6-011-01] Launch arguments:
> /scratch/mvapich/bin/hydra_pmi_proxy --control-port eu-a6-011-
> 01:42316 --debug --rmk user --la
> uncher ssh --demux poll --pgid 0 --retries 10 --usize -2 --proxy-id 0
> [mpiexec at eu-a6-011-01] Launch arguments: /usr/bin/ssh -x eu-a6-011-
> 02 "/scratch/mvapich/bin/hydra_pmi_proxy" --control-port eu-a6-011-0
> 1:42316 --debug --rmk user --launcher ssh --demux poll --pgid 0
> --retries 10 --usize -2 --proxy-id 1
> [proxy:0:0 at eu-a6-011-01] got pmi command (from 0): init
> pmi_version=1 pmi_subversion=1
> [proxy:0:0 at eu-a6-011-01] PMI response: cmd=response_to_init
> pmi_version=1 pmi_subversion=1 rc=0
> [proxy:0:0 at eu-a6-011-01] got pmi command (from 0): get_maxes
>
> [proxy:0:0 at eu-a6-011-01] PMI response: cmd=maxes kvsname_max=256
> keylen_max=64 vallen_max=1024
> [proxy:0:0 at eu-a6-011-01] got pmi command (from 0): get_appnum
>
> [proxy:0:0 at eu-a6-011-01] PMI response: cmd=appnum appnum=0
> [proxy:0:0 at eu-a6-011-01] got pmi command (from 0): get_my_kvsname
>
> [proxy:0:0 at eu-a6-011-01] PMI response: cmd=my_kvsname
> kvsname=kvs_63220_0
> [proxy:0:0 at eu-a6-011-01] got pmi command (from 0): get_my_kvsname
>
> [proxy:0:0 at eu-a6-011-01] PMI response: cmd=my_kvsname
> kvsname=kvs_63220_0
> [proxy:0:0 at eu-a6-011-01] got pmi command (from 0): get
> kvsname=kvs_63220_0 key=PMI_process_mapping
> [proxy:0:0 at eu-a6-011-01] PMI response: cmd=get_result rc=0
> msg=success
> value=(vector,(0,2,1))
> [proxy:0:1 at eu-a6-011-02] got pmi command (from 4): init
> pmi_version=1 pmi_subversion=1
> [proxy:0:1 at eu-a6-011-02] PMI response: cmd=response_to_init
> pmi_version=1 pmi_subversion=1 rc=0
> [proxy:0:1 at eu-a6-011-02] got pmi command (from 4): get_maxes
>
> [proxy:0:1 at eu-a6-011-02] PMI response: cmd=maxes kvsname_max=256
> keylen_max=64 vallen_max=1024
> [proxy:0:1 at eu-a6-011-02] got pmi command (from 4): get_appnum
>
> [proxy:0:1 at eu-a6-011-02] PMI response: cmd=appnum appnum=0
> [proxy:0:1 at eu-a6-011-02] got pmi command (from 4): get_my_kvsname
> ....
>
> {hangs}
>
>
> #if run with timeout
> MPIEXEC_TIMEOUT=10
> LD_LIBRARY_PATH="/scratch/mvapich/lib" /scratch/mvapich/bin/mpirun
> -verbose -n 2 -ppn 1 -env LD_LIBRARY_PATH "/scratch/mvapich/lib"
> -hosts eu-a6-011-01,eu-a6-011-02 /scratch/mpi_hello_world
> ....
> [proxy:0:1 at eu-a6-011-02] PMI response: cmd=my_kvsname
> kvsname=kvs_6505_0
> [proxy:0:1 at eu-a6-011-02] got pmi command (from 4): get_my_kvsname
>
> [proxy:0:1 at eu-a6-011-02] PMI response: cmd=my_kvsname
> kvsname=kvs_6505_0
> [proxy:0:1 at eu-a6-011-02] got pmi command (from 4): get
> kvsname=kvs_6505_0 key=PMI_process_mapping
> [proxy:0:1 at eu-a6-011-02] PMI response: cmd=get_result rc=0
> msg=success
> value=(vector,(0,2,1))
> [mpiexec at eu-a6-011-01] APPLICATION TIMED OUT
>
> Thank you,
>
> --
> Dr. Christiane Pousa
> High Performance Computing, Scientific IT Services, ETH Zurich
> WEC D 15, Weinbergstrasse
> <https://maps.google.com/?q=15,+Weinbergstrasse&entry=gmail&source=g> 11,
> Zurich, Switzerland
> Phone number: +41 44 633 91 74
> http://www.id.ethz.ch/
> _______________________________________________
> mvapich-discuss mailing list
> mvapich-discuss at cse.ohio-state.edu
> http://mailman.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20180220/02add266/attachment-0001.html>


More information about the mvapich-discuss mailing list