[mvapich-discuss] Segfault in malloc_consolidate() running osu_init

Ben Weigand bweigand at fb.com
Wed Oct 21 01:37:34 EDT 2020


Hi,


I'm hopefully at my last hardware issue, this is on a different machine type (Nvidia DGX2).



If I build and run 'osu_init' against openmpi-4.0.3rc4, it runs as expected:

$ ./mpi/startup/osu_init
# OSU MPI Init Test v5.6.3
nprocs: 1, min: 539 ms, max: 539 ms, avg: 539 ms


But when I run with libmpi.so from 'mvapich2-gdr-mcast.cuda11.0.mofed5.0.gnu7.3.0.slurm-2.3.4-1.el7.x86_64.rpm'

$ /opt/mvapich2/gdr/2.3.4/mcast/no-openacc/cuda11.0/mofed5.0/slurm/gnu7.3.0/libexec/osu-micro-benchmarks/mpi/startup/osu_init
[dgx2-b03-07:mpi_rank_0][error_sighandler] Caught error: Segmentation fault (signal 11)
Segmentation fault (core dumped)


I see the following backtrace:

(gdb) bt full
#0  0x00007f0fa87da1e7 in malloc_consolidate ()
   from /opt/mvapich2/gdr/2.3.4/mcast/no-openacc/cuda11.0/mofed5.0/slurm/gnu7.3.0/lib64/libmpi.so.12
No symbol table info available.
#1  0x00007f0fa87dbfb4 in _int_malloc () from /opt/mvapich2/gdr/2.3.4/mcast/no-openacc/cuda11.0/mofed5.0/slurm/gnu7.3.0/lib64/libmpi.so.12
No symbol table info available.
#2  0x00007f0fa87dcc0a in malloc () from /opt/mvapich2/gdr/2.3.4/mcast/no-openacc/cuda11.0/mofed5.0/slurm/gnu7.3.0/lib64/libmpi.so.12
No symbol table info available.
#3  0x00007f0fa889481e in hwloc_tma_malloc () from /opt/mvapich2/gdr/2.3.4/mcast/no-openacc/cuda11.0/mofed5.0/slurm/gnu7.3.0/lib64/libmpi.so.12
No symbol table info available.
#4  0x00007f0fa889c7ed in hwloc__topology_init ()
   from /opt/mvapich2/gdr/2.3.4/mcast/no-openacc/cuda11.0/mofed5.0/slurm/gnu7.3.0/lib64/libmpi.so.12
No symbol table info available.
#5  0x00007f0fa889c9dc in hwloc_topology_init ()
   from /opt/mvapich2/gdr/2.3.4/mcast/no-openacc/cuda11.0/mofed5.0/slurm/gnu7.3.0/lib64/libmpi.so.12
No symbol table info available.
#6  0x00007f0fa8891ac8 in smpi_load_hwloc_topology ()
   from /opt/mvapich2/gdr/2.3.4/mcast/no-openacc/cuda11.0/mofed5.0/slurm/gnu7.3.0/lib64/libmpi.so.12
No symbol table info available.
#7  0x00007f0fa87d81be in mv2_get_arch_type ()
   from /opt/mvapich2/gdr/2.3.4/mcast/no-openacc/cuda11.0/mofed5.0/slurm/gnu7.3.0/lib64/libmpi.so.12
No symbol table info available.
#8  0x00007f0fa87d9c98 in mv2_new_get_arch_hca_type ()
   from /opt/mvapich2/gdr/2.3.4/mcast/no-openacc/cuda11.0/mofed5.0/slurm/gnu7.3.0/lib64/libmpi.so.12
No symbol table info available.
#9  0x00007f0fa87b4acb in rdma_get_control_parameters ()
   from /opt/mvapich2/gdr/2.3.4/mcast/no-openacc/cuda11.0/mofed5.0/slurm/gnu7.3.0/lib64/libmpi.so.12
No symbol table info available.
#10 0x00007f0fa878130b in MPIDI_CH3_Init () from /opt/mvapich2/gdr/2.3.4/mcast/no-openacc/cuda11.0/mofed5.0/slurm/gnu7.3.0/lib64/libmpi.so.12
No symbol table info available.
#11 0x00007f0fa877472d in MPID_Init () from /opt/mvapich2/gdr/2.3.4/mcast/no-openacc/cuda11.0/mofed5.0/slurm/gnu7.3.0/lib64/libmpi.so.12
No symbol table info available.
#12 0x00007f0fa86cf28f in MPIR_Init_thread () from /opt/mvapich2/gdr/2.3.4/mcast/no-openacc/cuda11.0/mofed5.0/slurm/gnu7.3.0/lib64/libmpi.so.12
No symbol table info available.
#13 0x00007f0fa86cecfe in PMPI_Init () from /opt/mvapich2/gdr/2.3.4/mcast/no-openacc/cuda11.0/mofed5.0/slurm/gnu7.3.0/lib64/libmpi.so.12
No symbol table info available.
#14 0x00000000004009ee in main ()
No symbol table info available.


Following the suggestion I found in http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/2015-September/005703.html for a similar stack trace, I tried setting “MV2_USE_LAZY_MEM_UNREGISTER=0”, but no luck.


I have ‘MV2_IBA_HCA’ defined as:
$ cat /etc/mvapich2.conf
# This defines the Infiniband HCA's for this platform
MV2_IBA_HCA=mlx5_0:mlx5_1:mlx5_2:mlx5_3:mlx5_6:mlx5_7:mlx5_8:mlx5_9

All of these IB interfaces are up, and active.


Any ideas what could be the cause?


Thanks,

Ben
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20201021/3b32c588/attachment-0001.html>


More information about the mvapich-discuss mailing list