[mvapich-discuss] Help troubleshooting hang issues with mpi-hello-world on small cluster
Chris Green
greenc at fnal.gov
Mon May 8 17:00:22 EDT 2017
Hi,
I'm trying to troubleshoot the installation and operation of mvapich2
2.3a on a small cluster of RHEL7-ish machines connected via InfiniBand
ConnectX 2/3 (32-core AMD 61XX CPUs). The home and /usr/local
directories are common, and the node names corresponding to the
infiniband interfaces are: cluckib, ibgrunt{1,2,3,4,5}. The output from
mpiexec -info is:
[greenc at cluck] ~ $ mpiexec -info
HYDRA build details:
Version: 3.2
Release Date: Wed Mar 29 16:05:27 EDT 2017
CC: gcc
CXX: g++ -std=c++14
F77: gfortran
F90: gfortran
Configure options: '--disable-option-checking' '--prefix=/usr/local/mvapich2-2.3a' '--exec-prefix=/usr/local/mvapich2-2.3a/Linux64bit+3.10-2.17-e14' '--includedir=/usr/local/mvapich2-2.3a/Linux64bit+3.10-2.17-e14/include' '--enable-error-checking=all' '--enable-error-messages=all' '--enable-timing=none' '--enable-mpit-pvars=none' '--enable-fast=O3,ndebug' '--enable-fortran=all' '--enable-romio' '--enable-threads=runtime' '--enable-rdma-cm' '--disable-static' '--disable-dependency-tracking' '--disable-wrapper-rpath' '--enable-time-type=clock_gettime' '--enable-debuginfo' '--enable-versioning' '--disable-libxml2' '--enable-g=debug' '--enable-hybrid' 'CXXFLAGS=-std=c++14 -DNDEBUG -DNVALGRIND -g -O3' '--cache-file=/dev/null' '--srcdir=../../../../mvapich2-2.3a/src/pm/hydra' 'CC=gcc' 'CFLAGS= -DNDEBUG -DNVALGRIND -g -O3' 'LDFLAGS=-L/lib -L/lib -L/lib -Wl,-rpath,/lib -L/lib -Wl,-rpath,/lib -L/lib -L/lib' 'LIBS=-libmad -lrdmacm -libumad -libverbs -ldl -lrt -lm -lpthread ' 'CPPFLAGS= -I/home/greenc/software/build-e14/src/mpl/include -I/home/greenc/software/mvapich2-2.3a/src/mpl/include -I/home/greenc/software/mvapich2-2.3a/src/openpa/src -I/home/greenc/software/build-e14/src/openpa/src -D_REENTRANT -I/home/greenc/software/build-e14/src/mpi/romio/include -I/include -I/include -I/include -I/include'
Process Manager: pmi
Launchers available: ssh rsh fork slurm ll lsf sge manual persist
Topology libraries available: hwloc
Resource management kernels available: user slurm ll lsf sge pbs cobalt
Checkpointing libraries available:
Demux engines available: poll select
In addition, passwordless ssh and rsh have been set up between the
nodes. It is worth noting that MVAPICH2 has been compiled with a
non-default compiler. I'm a little worried about the prevalence of /lib
in the configuration, when this is a 64-bit OS and I would have expected
to see /lib64. Simply doing ldd `which mpiexec` seems to point at the
64-bit libraries, however.
The following command works:
[greenc at cluck] ~ $ mpiexec -launcher rsh -localhost cluckib -envall -np 30 -hosts XXXXX ./mpi-hello-world
for any given XXXXXX among the InfiniBand node names of the cluster.
However, as soon as I specify more than one node name, the system hangs.
Note that the firewall on all the machines is set to allow TCP and UDP
protocols on the "internal" network, of which ib0 is a member.
Not being any kind of an expert, I'm at a loss. Does anyone have any
pointers?
Thanks for any help,
Chris Green.
More information about the mvapich-discuss
mailing list