[mvapich-discuss] segfault when launching parallel application through slurm

Dominikus Heinzeller climbfuji at ymail.com
Wed Sep 14 23:37:01 EDT 2016


Dear mvapich developers,

I am facing a problem with mvapich2 after upgrading some test nodes. Specifically, I compiled and used successfully mvapich2-2.2b (and mvapich2-2.2) on RH SL 7.2 with the following verbs libraries:

libibverbs-1.1.8-8.el7.x86_64
libipathverbs-1.3-2.el7.x86_64
libibverbs-devel-1.1.8-8.el7.x86_64
libibverbs-utils-1.1.8-8.el7.x86_64

After setting the correct MV environment variables, namely

declare -x MV2_DEFAULT_PORT="1"
declare -x MV2_IBA_HCA="mlx5_1”

I can run mpi codes across nodes. However, we just created a new image to boot from, which uses the following verbs libraries to allow for GPFS RDMA:

libibverbs-devel-static-1.1.8mlnx1-OFED.3.3.0.0.9.33100.x86_64
libibverbs-devel-1.1.8mlnx1-OFED.3.3.0.0.9.33100.x86_64
libibverbs-utils-1.1.8mlnx1-OFED.3.3.0.0.9.33100.x86_64
libibverbs-1.1.8mlnx1-OFED.3.3.0.0.9.33100.x86_64

Whenever I try to run an srun job across multiple nodes, using the *previously* compiled mvapich2 version, the code crashes (segfault etc). Unfortunately, right now I cannot recompile mvapich2 as the login nodes (where all the compilers etc are installed) cannot be moved to the new image.

One strange thing I noticed: ibv_devinfo labels the Infiniband network adapter correctly as mlx5_1 on the new system, while an ibhosts shows "WHATEVERNODENAME HCA-2" for the new image; for the old image, both ibv_devinfo as well as ibhosts consistenly show "mlx5_1".

ANother interesting thing: mpich-3.1.4, compiled with on the old image (libibverbs-1.1.8-8.el7.x86_64) runs without complaints on the new image (libibverbs-1.1.8mlnx1-OFED.3.3.0.0.9.33100.x86_64) as long as the interface name is defined properly in the environment variables (ib0). This and the ib test utilities suggest that the physical infiniband connection is working fine.

I am wondering whether I need to recompile mvapich2 with the new verbs libraries or if the problem lies elsewhere. The mvapich2 library on the old image was configured as follows (with ch3:mrail and rdma=gen2 as default values on linux):

./configure \
--prefix=/app/mvapich2-2.2b/intel-15.0.4 \
--enable-fast \
--enable-f77   \
--enable-fc    \
--enable-cxx   \
--with-pm=slurm \
--with-pmi=pmi1 \
--enable-strict 2>&1 | tee log.config

I would appreciate any help regarding this matter.

Thanks heaps in advance.

Cheers

Dom
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20160914/15041640/attachment.html>


More information about the mvapich-discuss mailing list