[mvapich-discuss] segfault when launching parallel application through slurm
Dominikus Heinzeller
climbfuji at ymail.com
Wed Sep 14 23:37:01 EDT 2016
Dear mvapich developers,
I am facing a problem with mvapich2 after upgrading some test nodes. Specifically, I compiled and used successfully mvapich2-2.2b (and mvapich2-2.2) on RH SL 7.2 with the following verbs libraries:
libibverbs-1.1.8-8.el7.x86_64
libipathverbs-1.3-2.el7.x86_64
libibverbs-devel-1.1.8-8.el7.x86_64
libibverbs-utils-1.1.8-8.el7.x86_64
After setting the correct MV environment variables, namely
declare -x MV2_DEFAULT_PORT="1"
declare -x MV2_IBA_HCA="mlx5_1”
I can run mpi codes across nodes. However, we just created a new image to boot from, which uses the following verbs libraries to allow for GPFS RDMA:
libibverbs-devel-static-1.1.8mlnx1-OFED.3.3.0.0.9.33100.x86_64
libibverbs-devel-1.1.8mlnx1-OFED.3.3.0.0.9.33100.x86_64
libibverbs-utils-1.1.8mlnx1-OFED.3.3.0.0.9.33100.x86_64
libibverbs-1.1.8mlnx1-OFED.3.3.0.0.9.33100.x86_64
Whenever I try to run an srun job across multiple nodes, using the *previously* compiled mvapich2 version, the code crashes (segfault etc). Unfortunately, right now I cannot recompile mvapich2 as the login nodes (where all the compilers etc are installed) cannot be moved to the new image.
One strange thing I noticed: ibv_devinfo labels the Infiniband network adapter correctly as mlx5_1 on the new system, while an ibhosts shows "WHATEVERNODENAME HCA-2" for the new image; for the old image, both ibv_devinfo as well as ibhosts consistenly show "mlx5_1".
ANother interesting thing: mpich-3.1.4, compiled with on the old image (libibverbs-1.1.8-8.el7.x86_64) runs without complaints on the new image (libibverbs-1.1.8mlnx1-OFED.3.3.0.0.9.33100.x86_64) as long as the interface name is defined properly in the environment variables (ib0). This and the ib test utilities suggest that the physical infiniband connection is working fine.
I am wondering whether I need to recompile mvapich2 with the new verbs libraries or if the problem lies elsewhere. The mvapich2 library on the old image was configured as follows (with ch3:mrail and rdma=gen2 as default values on linux):
./configure \
--prefix=/app/mvapich2-2.2b/intel-15.0.4 \
--enable-fast \
--enable-f77 \
--enable-fc \
--enable-cxx \
--with-pm=slurm \
--with-pmi=pmi1 \
--enable-strict 2>&1 | tee log.config
I would appreciate any help regarding this matter.
Thanks heaps in advance.
Cheers
Dom
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20160914/15041640/attachment.html>
More information about the mvapich-discuss
mailing list