[mvapich-discuss] Crash in rdma_open_hca ()

Subramoni, Hari subramoni.1 at osu.edu
Thu Oct 15 10:54:41 EDT 2020


Hi, Ben.

Sorry to hear that you’re facing issues. Can you please send the output of ibstat/ibv_devinfo -v on the node that is crashing?

I do not think the MVAPICH2-GDR RPMs have been built with support for PMI2. Can you try launching with PMI1 and see if it works?

Best,
Hari.

From: mvapich-discuss-bounces at cse.ohio-state.edu <mvapich-discuss-bounces at mailman.cse.ohio-state.edu> On Behalf Of Ben Weigand
Sent: Thursday, October 15, 2020 1:55 AM
To: mvapich-discuss at cse.ohio-state.edu <mvapich-discuss at mailman.cse.ohio-state.edu>
Subject: [mvapich-discuss] Crash in rdma_open_hca ()


Hi,

I’m new to MVAPICH2, and I have a question about a crash that I’m seeing on mpi initialization with mvapich2-gdr-mcast.cuda11.0.mofed5.0.gnu7.3.0.slurm-2.3.4-1.
This happens when running the osu_init kernel that was in the pre-built rpm (on CentOS7).

The crash only happens on a few nodes, but it’s consistently crashing on the affected hosts, and *only* when I run it with Slurm.

Without slurm, standalone, it works fine:
$ /opt/mvapich2/gdr/2.3.4/mcast/no-openacc/cuda11.0/mofed5.0/slurm/gnu7.3.0/libexec/osu-micro-benchmarks/mpi/startup/osu_init
# OSU MPI Init Test v5.6.3
nprocs: 1, min: 282 ms, max: 282 ms, avg: 282 ms


With slurm (19.05):
## sbatch osu_init.sh:
==================
export LD_PRELOAD=/opt/mvapich2/gdr/2.3.4/mcast/no-openacc/cuda11.0/mofed5.0/slurm/gnu7.3.0/lib64/libmpi.so
export LD_LIBRARY_PATH=/opt/mvapich2/gdr/2.3.4/mcast/no-openacc/cuda11.0/mofed5.0/slurm/gnu7.3.0/lib64:/lib64:$LD_LIBRARY_PATH
export MV2_USE_CUDA=1
export MV2_GPUDIRECT_GDRCOPY_LIB=/usr/lib64/libgdrapi.so
export PATH=/opt/mvapich2/gdr/2.3.4/mcast/no-openacc/cuda11.0/mofed5.0/slurm/gnu7.3.0/bin/:$PATH

srun --mpi=pmi2 /opt/mvapich2/gdr/2.3.4/mcast/no-openacc/cuda11.0/mofed5.0/slurm/gnu7.3.0/libexec/osu-micro-benchmarks/mpi/startup/osu_init
==================


$ cat osu_mpi_run.host1060.839282.err
[host1060:mpi_rank_0][error_sighandler] Caught error: Segmentation fault (signal 11)
[host1061:mpi_rank_1][error_sighandler] Caught error: Segmentation fault (signal 11)
srun: error: host1060: task 0: Segmentation fault (core dumped)
srun: error: host1061: task 1: Segmentation fault (core dumped)


GDB Backtrace from core:
===========================
(gdb) bt full
#0  0x00007fd4d8b25b6c in __strncmp_sse42 () from /lib64/libc.so.6
No symbol table info available.
#1  0x00007fd4dccc7455 in rdma_open_hca () from /opt/mvapich2/gdr/2.3.4/mcast/no-openacc/cuda11.0/mofed5.0/slurm/gnu7.3.0/lib64/libmpi.so
No symbol table info available.
#2  0x00007fd4dccd4a1e in rdma_get_control_parameters () from /opt/mvapich2/gdr/2.3.4/mcast/no-openacc/cuda11.0/mofed5.0/slurm/gnu7.3.0/lib64/libmpi.so
No symbol table info available.
#3  0x00007fd4dcca130b in MPIDI_CH3_Init () from /opt/mvapich2/gdr/2.3.4/mcast/no-openacc/cuda11.0/mofed5.0/slurm/gnu7.3.0/lib64/libmpi.so
No symbol table info available.
#4  0x00007fd4dcc9472d in MPID_Init () from /opt/mvapich2/gdr/2.3.4/mcast/no-openacc/cuda11.0/mofed5.0/slurm/gnu7.3.0/lib64/libmpi.so
No symbol table info available.
#5  0x00007fd4dcbef28f in MPIR_Init_thread () from /opt/mvapich2/gdr/2.3.4/mcast/no-openacc/cuda11.0/mofed5.0/slurm/gnu7.3.0/lib64/libmpi.so
No symbol table info available.
#6  0x00007fd4dcbeecfe in PMPI_Init () from /opt/mvapich2/gdr/2.3.4/mcast/no-openacc/cuda11.0/mofed5.0/slurm/gnu7.3.0/lib64/libmpi.so
No symbol table info available.
#7  0x00000000004009ee in main ()
No symbol table info available.
===========================



Relevant source of ‘osu_init.c’:
===========================
     16 #include <mpi.h>
     17 #include <stdlib.h>
     18 #include <stdio.h>
     19 #include <time.h>
     20
     21 int
     22 main (int argc, char *argv[])
     23 {
     24     int myid, numprocs;
     25     struct timespec tp_before, tp_after;
     26     long duration = 0, min, max, avg;
     27
     28     clock_gettime(CLOCK_REALTIME, &tp_before);
     29     MPI_Init(&argc, &argv);
===========================


All these hosts should be identical. Has anyone seen a similar crash before? Any pointers?


Thanks,
B
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20201015/5305696d/attachment.html>


More information about the mvapich-discuss mailing list