[mvapich-discuss] Help troubleshooting hang issues with mpi-hello-world on small cluster

Sourav Chakraborty chakraborty.52 at buckeyemail.osu.edu
Tue May 9 18:30:17 EDT 2017


Hi Chris,

Thanks for the detailed report. There seems to be no obvious issue with
your configuration.

Can you please check if the issue is present with the mpirun_rsh launcher?
The MVAPICH user guide has more details on how to use mpirun_rsh in
different scenarios. (
http://mvapich.cse.ohio-state.edu/static/media/mvapich/mvapich2-2.3a-userguide.html#x1-290005.2.1
).

Can you please also try manually selecting the HCA by setting the
MV2_IBA_HCA environment variable? (
http://mvapich.cse.ohio-state.edu/static/media/mvapich/mvapich2-2.3a-userguide.html#x1-19700011.24
)

Thanks,
Sourav




On Mon, May 8, 2017 at 5:00 PM, Chris Green <greenc at fnal.gov> wrote:

> Hi,
>
> I'm trying to troubleshoot the installation and operation of mvapich2 2.3a
> on a small cluster of RHEL7-ish machines connected via InfiniBand ConnectX
> 2/3 (32-core AMD 61XX CPUs). The home and /usr/local directories are
> common, and the node names corresponding to the infiniband interfaces are:
> cluckib, ibgrunt{1,2,3,4,5}. The output from mpiexec -info is:
>
> [greenc at cluck] ~ $ mpiexec -info
> HYDRA build details:
>     Version:                                 3.2
>     Release Date:                            Wed Mar 29 16:05:27 EDT 2017
>     CC:                              gcc
>     CXX:                             g++  -std=c++14
>     F77:                             gfortran
>     F90:                             gfortran
>     Configure options:                       '--disable-option-checking'
> '--prefix=/usr/local/mvapich2-2.3a' '--exec-prefix=/usr/local/mvap
> ich2-2.3a/Linux64bit+3.10-2.17-e14' '--includedir=/usr/local/mvapi
> ch2-2.3a/Linux64bit+3.10-2.17-e14/include' '--enable-error-checking=all'
> '--enable-error-messages=all' '--enable-timing=none'
> '--enable-mpit-pvars=none' '--enable-fast=O3,ndebug' '--enable-fortran=all'
> '--enable-romio' '--enable-threads=runtime' '--enable-rdma-cm'
> '--disable-static' '--disable-dependency-tracking'
> '--disable-wrapper-rpath' '--enable-time-type=clock_gettime'
> '--enable-debuginfo' '--enable-versioning' '--disable-libxml2'
> '--enable-g=debug' '--enable-hybrid' 'CXXFLAGS=-std=c++14 -DNDEBUG
> -DNVALGRIND -g -O3' '--cache-file=/dev/null' '--srcdir=../../../../mvapich2-2.3a/src/pm/hydra'
> 'CC=gcc' 'CFLAGS= -DNDEBUG -DNVALGRIND -g -O3' 'LDFLAGS=-L/lib -L/lib
> -L/lib -Wl,-rpath,/lib -L/lib -Wl,-rpath,/lib -L/lib -L/lib' 'LIBS=-libmad
> -lrdmacm -libumad -libverbs -ldl -lrt -lm -lpthread ' 'CPPFLAGS=
> -I/home/greenc/software/build-e14/src/mpl/include
> -I/home/greenc/software/mvapich2-2.3a/src/mpl/include
> -I/home/greenc/software/mvapich2-2.3a/src/openpa/src
> -I/home/greenc/software/build-e14/src/openpa/src -D_REENTRANT
> -I/home/greenc/software/build-e14/src/mpi/romio/include -I/include
> -I/include -I/include -I/include'
>     Process Manager:                         pmi
>     Launchers available:                     ssh rsh fork slurm ll lsf sge
> manual persist
>     Topology libraries available:            hwloc
>     Resource management kernels available:   user slurm ll lsf sge pbs
> cobalt
>     Checkpointing libraries available:
>     Demux engines available:                 poll select
>
> In addition, passwordless ssh and rsh have been set up between the nodes.
> It is worth noting that MVAPICH2 has been compiled with a non-default
> compiler. I'm a little worried about the prevalence of /lib in the
> configuration, when this is a 64-bit OS and I would have expected to see
> /lib64. Simply doing ldd `which mpiexec` seems to point at the 64-bit
> libraries, however.
>
> The following command works:
>
> [greenc at cluck] ~ $ mpiexec -launcher rsh -localhost cluckib -envall -np
> 30 -hosts XXXXX ./mpi-hello-world
>
> for any given XXXXXX among the InfiniBand node names of the cluster.
>
> However, as soon as I specify more than one node name, the system hangs.
> Note that the firewall on all the machines is set to allow TCP and UDP
> protocols on the "internal" network, of which ib0 is a member.
>
> Not being any kind of an expert, I'm at a loss. Does anyone have any
> pointers?
>
> Thanks for any help,
>
> Chris Green.
>
>
>
>
> _______________________________________________
> mvapich-discuss mailing list
> mvapich-discuss at cse.ohio-state.edu
> http://mailman.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20170509/e547ff52/attachment.html>


More information about the mvapich-discuss mailing list