[mvapich-discuss] Help troubleshooting hang issues with mpi-hello-world on small cluster

Chris Green greenc at fnal.gov
Mon May 8 17:00:22 EDT 2017


Hi,

I'm trying to troubleshoot the installation and operation of mvapich2 
2.3a on a small cluster of RHEL7-ish machines connected via InfiniBand 
ConnectX 2/3 (32-core AMD 61XX CPUs). The home and /usr/local 
directories are common, and the node names corresponding to the 
infiniband interfaces are: cluckib, ibgrunt{1,2,3,4,5}. The output from 
mpiexec -info is:

[greenc at cluck] ~ $ mpiexec -info
HYDRA build details:
     Version:                                 3.2
     Release Date:                            Wed Mar 29 16:05:27 EDT 2017
     CC:                              gcc
     CXX:                             g++  -std=c++14
     F77:                             gfortran
     F90:                             gfortran
     Configure options:                       '--disable-option-checking' '--prefix=/usr/local/mvapich2-2.3a' '--exec-prefix=/usr/local/mvapich2-2.3a/Linux64bit+3.10-2.17-e14' '--includedir=/usr/local/mvapich2-2.3a/Linux64bit+3.10-2.17-e14/include' '--enable-error-checking=all' '--enable-error-messages=all' '--enable-timing=none' '--enable-mpit-pvars=none' '--enable-fast=O3,ndebug' '--enable-fortran=all' '--enable-romio' '--enable-threads=runtime' '--enable-rdma-cm' '--disable-static' '--disable-dependency-tracking' '--disable-wrapper-rpath' '--enable-time-type=clock_gettime' '--enable-debuginfo' '--enable-versioning' '--disable-libxml2' '--enable-g=debug' '--enable-hybrid' 'CXXFLAGS=-std=c++14 -DNDEBUG -DNVALGRIND -g -O3' '--cache-file=/dev/null' '--srcdir=../../../../mvapich2-2.3a/src/pm/hydra' 'CC=gcc' 'CFLAGS= -DNDEBUG -DNVALGRIND -g -O3' 'LDFLAGS=-L/lib -L/lib -L/lib -Wl,-rpath,/lib -L/lib -Wl,-rpath,/lib -L/lib -L/lib' 'LIBS=-libmad -lrdmacm -libumad -libverbs -ldl -lrt -lm -lpthread ' 'CPPFLAGS= -I/home/greenc/software/build-e14/src/mpl/include -I/home/greenc/software/mvapich2-2.3a/src/mpl/include -I/home/greenc/software/mvapich2-2.3a/src/openpa/src -I/home/greenc/software/build-e14/src/openpa/src -D_REENTRANT -I/home/greenc/software/build-e14/src/mpi/romio/include -I/include -I/include -I/include -I/include'
     Process Manager:                         pmi
     Launchers available:                     ssh rsh fork slurm ll lsf sge manual persist
     Topology libraries available:            hwloc
     Resource management kernels available:   user slurm ll lsf sge pbs cobalt
     Checkpointing libraries available:
     Demux engines available:                 poll select

In addition, passwordless ssh and rsh have been set up between the 
nodes. It is worth noting that MVAPICH2 has been compiled with a 
non-default compiler. I'm a little worried about the prevalence of /lib 
in the configuration, when this is a 64-bit OS and I would have expected 
to see /lib64. Simply doing ldd `which mpiexec` seems to point at the 
64-bit libraries, however.

The following command works:

[greenc at cluck] ~ $ mpiexec -launcher rsh -localhost cluckib -envall -np 30 -hosts XXXXX ./mpi-hello-world

for any given XXXXXX among the InfiniBand node names of the cluster.

However, as soon as I specify more than one node name, the system hangs. 
Note that the firewall on all the machines is set to allow TCP and UDP 
protocols on the "internal" network, of which ib0 is a member.

Not being any kind of an expert, I'm at a loss. Does anyone have any 
pointers?

Thanks for any help,

Chris Green.






More information about the mvapich-discuss mailing list