[mvapich-discuss] Crash in rdma_open_hca ()

Subramoni, Hari subramoni.1 at osu.edu
Thu Oct 15 14:17:17 EDT 2020


Hi, Ben.

Do things pass if you add MV2_IBA_HCA=mlx5_0 or MV2_IBA_HCA=mlx5_0:mlx5_1:mlx5_2:mlx5_3?

Best,
Hari.

From: mvapich-discuss-bounces at cse.ohio-state.edu <mvapich-discuss-bounces at mailman.cse.ohio-state.edu> On Behalf Of Ben Weigand
Sent: Thursday, October 15, 2020 2:14 PM
To: mvapich-discuss at cse.ohio-state.edu <mvapich-discuss at mailman.cse.ohio-state.edu>
Subject: Re: [mvapich-discuss] Crash in rdma_open_hca ()

Hi Hari,

ibstat / ibv_devinfo from a failing host are attached.

I added the ‘—mpi=pmi2’ when I was trying to debug. I was originally running without an ‘—mpi=’ flag, and it seems to work on most of my hosts, just not these.

My slurm build has the following mpi versions available:

$ srun --mpi=list
srun: MPI types are...
srun: openmpi
srun: none
srun: pmi2


When I run ‘osu_latency’ (in slurm) without specifying ‘—mpi=’, I get believable results, so I it looks like communication is happening (on the *good* hosts).
===============
# OSU MPI Latency Test v5.6.3
# Size          Latency (us)
0                       1.88
1                       1.94
2                       1.93
4                       1.95
8                       1.95
16                      1.98
32                      1.98
64                      1.98
128                     2.05
256                     2.83
512                     2.93
1024                    3.09
2048                    3.45
4096                    4.32
8192                    5.41
16384                  10.44
32768                  10.08
65536                  11.53
131072                 13.54
262144                 17.27
524288                 26.79
1048576                40.50
2097152                70.41
4194304               125.20
===============

Thank you,

Ben


From: "Subramoni, Hari" <subramoni.1 at osu.edu<mailto:subramoni.1 at osu.edu>>
Date: Thursday, October 15, 2020 at 7:55 AM
To: Ben Weigand <bweigand at fb.com<mailto:bweigand at fb.com>>, "mvapich-discuss at cse.ohio-state.edu<mailto:mvapich-discuss at cse.ohio-state.edu>" <mvapich-discuss at mailman.cse.ohio-state.edu<mailto:mvapich-discuss at mailman.cse.ohio-state.edu>>
Cc: "Subramoni, Hari" <subramoni.1 at osu.edu<mailto:subramoni.1 at osu.edu>>
Subject: RE: [mvapich-discuss] Crash in rdma_open_hca ()

Hi, Ben.

Sorry to hear that you’re facing issues. Can you please send the output of ibstat/ibv_devinfo -v on the node that is crashing?

I do not think the MVAPICH2-GDR RPMs have been built with support for PMI2. Can you try launching with PMI1 and see if it works?

Best,
Hari.

From: mvapich-discuss-bounces at cse.ohio-state.edu<mailto:mvapich-discuss-bounces at cse.ohio-state.edu> <mvapich-discuss-bounces at mailman.cse.ohio-state.edu<mailto:mvapich-discuss-bounces at mailman.cse.ohio-state.edu>> On Behalf Of Ben Weigand
Sent: Thursday, October 15, 2020 1:55 AM
To: mvapich-discuss at cse.ohio-state.edu<mailto:mvapich-discuss at cse.ohio-state.edu> <mvapich-discuss at mailman.cse.ohio-state.edu<mailto:mvapich-discuss at mailman.cse.ohio-state.edu>>
Subject: [mvapich-discuss] Crash in rdma_open_hca ()


Hi,

I’m new to MVAPICH2, and I have a question about a crash that I’m seeing on mpi initialization with mvapich2-gdr-mcast.cuda11.0.mofed5.0.gnu7.3.0.slurm-2.3.4-1.
This happens when running the osu_init kernel that was in the pre-built rpm (on CentOS7).

The crash only happens on a few nodes, but it’s consistently crashing on the affected hosts, and *only* when I run it with Slurm.

Without slurm, standalone, it works fine:
$ /opt/mvapich2/gdr/2.3.4/mcast/no-openacc/cuda11.0/mofed5.0/slurm/gnu7.3.0/libexec/osu-micro-benchmarks/mpi/startup/osu_init
# OSU MPI Init Test v5.6.3
nprocs: 1, min: 282 ms, max: 282 ms, avg: 282 ms


With slurm (19.05):
## sbatch osu_init.sh:
==================
export LD_PRELOAD=/opt/mvapich2/gdr/2.3.4/mcast/no-openacc/cuda11.0/mofed5.0/slurm/gnu7.3.0/lib64/libmpi.so
export LD_LIBRARY_PATH=/opt/mvapich2/gdr/2.3.4/mcast/no-openacc/cuda11.0/mofed5.0/slurm/gnu7.3.0/lib64:/lib64:$LD_LIBRARY_PATH
export MV2_USE_CUDA=1
export MV2_GPUDIRECT_GDRCOPY_LIB=/usr/lib64/libgdrapi.so
export PATH=/opt/mvapich2/gdr/2.3.4/mcast/no-openacc/cuda11.0/mofed5.0/slurm/gnu7.3.0/bin/:$PATH

srun --mpi=pmi2 /opt/mvapich2/gdr/2.3.4/mcast/no-openacc/cuda11.0/mofed5.0/slurm/gnu7.3.0/libexec/osu-micro-benchmarks/mpi/startup/osu_init
==================


$ cat osu_mpi_run.host1060.839282.err
[host1060:mpi_rank_0][error_sighandler] Caught error: Segmentation fault (signal 11)
[host1061:mpi_rank_1][error_sighandler] Caught error: Segmentation fault (signal 11)
srun: error: host1060: task 0: Segmentation fault (core dumped)
srun: error: host1061: task 1: Segmentation fault (core dumped)


GDB Backtrace from core:
===========================
(gdb) bt full
#0  0x00007fd4d8b25b6c in __strncmp_sse42 () from /lib64/libc.so.6
No symbol table info available.
#1  0x00007fd4dccc7455 in rdma_open_hca () from /opt/mvapich2/gdr/2.3.4/mcast/no-openacc/cuda11.0/mofed5.0/slurm/gnu7.3.0/lib64/libmpi.so
No symbol table info available.
#2  0x00007fd4dccd4a1e in rdma_get_control_parameters () from /opt/mvapich2/gdr/2.3.4/mcast/no-openacc/cuda11.0/mofed5.0/slurm/gnu7.3.0/lib64/libmpi.so
No symbol table info available.
#3  0x00007fd4dcca130b in MPIDI_CH3_Init () from /opt/mvapich2/gdr/2.3.4/mcast/no-openacc/cuda11.0/mofed5.0/slurm/gnu7.3.0/lib64/libmpi.so
No symbol table info available.
#4  0x00007fd4dcc9472d in MPID_Init () from /opt/mvapich2/gdr/2.3.4/mcast/no-openacc/cuda11.0/mofed5.0/slurm/gnu7.3.0/lib64/libmpi.so
No symbol table info available.
#5  0x00007fd4dcbef28f in MPIR_Init_thread () from /opt/mvapich2/gdr/2.3.4/mcast/no-openacc/cuda11.0/mofed5.0/slurm/gnu7.3.0/lib64/libmpi.so
No symbol table info available.
#6  0x00007fd4dcbeecfe in PMPI_Init () from /opt/mvapich2/gdr/2.3.4/mcast/no-openacc/cuda11.0/mofed5.0/slurm/gnu7.3.0/lib64/libmpi.so
No symbol table info available.
#7  0x00000000004009ee in main ()
No symbol table info available.
===========================



Relevant source of ‘osu_init.c’:
===========================
     16 #include <mpi.h>
     17 #include <stdlib.h>
     18 #include <stdio.h>
     19 #include <time.h>
     20
     21 int
     22 main (int argc, char *argv[])
     23 {
     24     int myid, numprocs;
     25     struct timespec tp_before, tp_after;
     26     long duration = 0, min, max, avg;
     27
     28     clock_gettime(CLOCK_REALTIME, &tp_before);
     29     MPI_Init(&argc, &argv);
===========================


All these hosts should be identical. Has anyone seen a similar crash before? Any pointers?


Thanks,
B
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20201015/a15c779a/attachment-0001.html>


More information about the mvapich-discuss mailing list