[mvapich-discuss] segfault when launching parallel application through slurm

Hari Subramoni subramoni.1 at osu.edu
Fri Sep 16 23:04:09 EDT 2016


Hello,

Sorry to hear that you're facing issues with running MVAPICH2. Can you
please provide us some details about the sort of failure you're is seeing
(backtrace from the core dump if any etc). Re-running
with MV2_DEBUG_SHOW_BACKTRACE=2 be helpful as it may give you a stacktrace

On the MPICH front, can you tell us what netmod you're using? Is it the
sockets netmod?

Regards,
Hari.

On Wed, Sep 14, 2016 at 11:37 PM, Dominikus Heinzeller <climbfuji at ymail.com>
wrote:

> Dear mvapich developers,
>
> I am facing a problem with mvapich2 after upgrading some test nodes.
> Specifically, I compiled and used successfully mvapich2-2.2b (and
> mvapich2-2.2) on RH SL 7.2 with the following verbs libraries:
>
> libibverbs-1.1.8-8.el7.x86_64
> libipathverbs-1.3-2.el7.x86_64
> libibverbs-devel-1.1.8-8.el7.x86_64
> libibverbs-utils-1.1.8-8.el7.x86_64
>
> After setting the correct MV environment variables, namely
>
> declare -x MV2_DEFAULT_PORT="1"
> declare -x MV2_IBA_HCA="mlx5_1”
>
> I can run mpi codes across nodes. However, we just created a new image to
> boot from, which uses the following verbs libraries to allow for GPFS RDMA:
>
> libibverbs-devel-static-1.1.8mlnx1-OFED.3.3.0.0.9.33100.x86_64
> libibverbs-devel-1.1.8mlnx1-OFED.3.3.0.0.9.33100.x86_64
> libibverbs-utils-1.1.8mlnx1-OFED.3.3.0.0.9.33100.x86_64
> libibverbs-1.1.8mlnx1-OFED.3.3.0.0.9.33100.x86_64
>
> Whenever I try to run an srun job across multiple nodes, using the
> *previously* compiled mvapich2 version, the code crashes (segfault etc).
> Unfortunately, right now I cannot recompile mvapich2 as the login nodes
> (where all the compilers etc are installed) cannot be moved to the new
> image.
>
> One strange thing I noticed: ibv_devinfo labels the Infiniband network
> adapter correctly as mlx5_1 on the new system, while an ibhosts shows
> "WHATEVERNODENAME HCA-2" for the new image; for the old image, both
> ibv_devinfo as well as ibhosts consistenly show "mlx5_1".
>
> ANother interesting thing: mpich-3.1.4, compiled with on the old image
> (libibverbs-1.1.8-8.el7.x86_64) runs without complaints on the new image
> (libibverbs-1.1.8mlnx1-OFED.3.3.0.0.9.33100.x86_64) as long as the
> interface name is defined properly in the environment variables (ib0). This
> and the ib test utilities suggest that the physical infiniband connection
> is working fine.
>
> I am wondering whether I need to recompile mvapich2 with the new verbs
> libraries or if the problem lies elsewhere. The mvapich2 library on the old
> image was configured as follows (with ch3:mrail and rdma=gen2 as default
> values on linux):
>
> ./configure \
> --prefix=/app/mvapich2-2.2b/intel-15.0.4 \
> --enable-fast \
> --enable-f77   \
> --enable-fc    \
> --enable-cxx   \
> --with-pm=slurm \
> --with-pmi=pmi1 \
> --enable-strict 2>&1 | tee log.config
>
> I would appreciate any help regarding this matter.
>
> Thanks heaps in advance.
>
> Cheers
>
> Dom
>
> _______________________________________________
> mvapich-discuss mailing list
> mvapich-discuss at cse.ohio-state.edu
> http://mailman.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20160916/81c2cd6e/attachment.html>


More information about the mvapich-discuss mailing list