[mvapich-discuss] Segmentation fault in osu_bw D D

YAMAMOTO KAZUMA(山本 和磨) kz-yamamoto at nec.com
Wed Apr 22 04:11:30 EDT 2020


Hello, 

A segmentation fault occurs when executing 'osu_bw D D' with mvapich2-2.3.3.
'osu_bw' finished nomally.

osu_bw used the libexec directory.

$ ./configure --prefix=<install_path> --enable-cuda --with-cuda=<cuda_path>  --enable-multi-subnet
$ make 
$ make install

$ mpirun -np 2 -hosts gnode11,gnode06 -genv MV2_DEBUG_SHOW_BACKTRACE 1 ./osu_bw D D
# OSU MPI-CUDA Bandwidth Test v5.6.2
# Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D)
# Size      Bandwidth (MB/s)
1                       0.07
2                       0.14
4                       0.29
8                       0.59
16                      1.19
32                      2.52
64                      4.77
128                    10.76
256                    22.68
512                    46.52
1024                   95.65
2048                  186.11
4096                  361.63
8192                  666.12
[gnode06:mpi_rank_1][error_sighandler] Caught error: Segmentation fault (signal 11)
[gnode06:mpi_rank_1][print_backtrace]   0: /work/1/NEC-GRP00/nec-usr04/20200422_mvapich2/install/mvapich2-2.3.3_gcc4.8.5-cuda10.1.2/lib/libmpi.so.12(print_backtrace+0x1c) [0x7f75c47e817c]
[gnode06:mpi_rank_1][print_backtrace]   1: /work/1/NEC-GRP00/nec-usr04/20200422_mvapich2/install/mvapich2-2.3.3_gcc4.8.5-cuda10.1.2/lib/libmpi.so.12(error_sighandler+0x59) [0x7f75c47e8279]
[gnode06:mpi_rank_1][print_backtrace]   2: /lib64/libpthread.so.0(+0xf5f0) [0x7f75c174a5f0]
[gnode06:mpi_rank_1][print_backtrace]   3: /work/1/NEC-GRP00/nec-usr04/20200422_mvapich2/install/mvapich2-2.3.3_gcc4.8.5-cuda10.1.2/lib/libmpi.so.12(MPIDI_CH3I_MRAILI_Rendezvous_rget_push+0x868) [0x7f75c47c6b08]
[gnode06:mpi_rank_1][print_backtrace]   4: /work/1/NEC-GRP00/nec-usr04/20200422_mvapich2/install/mvapich2-2.3.3_gcc4.8.5-cuda10.1.2/lib/libmpi.so.12(MPIDI_CH3_Rendezvous_push+0x23b) [0x7f75c4783acb]
[gnode06:mpi_rank_1][print_backtrace]   5: /work/1/NEC-GRP00/nec-usr04/20200422_mvapich2/install/mvapich2-2.3.3_gcc4.8.5-cuda10.1.2/lib/libmpi.so.12(MPIDI_CH3I_MRAILI_Process_rndv+0x7d) [0x7f75c4783f2d]
[gnode06:mpi_rank_1][print_backtrace]   6: /work/1/NEC-GRP00/nec-usr04/20200422_mvapich2/install/mvapich2-2.3.3_gcc4.8.5-cuda10.1.2/lib/libmpi.so.12(MPIDI_CH3I_Progress+0xec) [0x7f75c478062c]
[gnode06:mpi_rank_1][print_backtrace]   7: /work/1/NEC-GRP00/nec-usr04/20200422_mvapich2/install/mvapich2-2.3.3_gcc4.8.5-cuda10.1.2/lib/libmpi.so.12(MPIR_Waitall_impl+0x32d) [0x7f75c46e5c5d]
[gnode06:mpi_rank_1][print_backtrace]   8: /work/1/NEC-GRP00/nec-usr04/20200422_mvapich2/install/mvapich2-2.3.3_gcc4.8.5-cuda10.1.2/lib/libmpi.so.12(MPI_Waitall+0xe3) [0x7f75c46e6293]
[gnode06:mpi_rank_1][print_backtrace]   9: /home/NEC-GRP04/nec-usr04/work/20200422_mvapich2/install/mvapich2-2.3.3_gcc4.8.5-cuda10.1.2/libexec/osu-micro-benchmarks/mpi/pt2pt/osu_bw() [0x401fd6]
[gnode06:mpi_rank_1][print_backtrace]  10: /lib64/libc.so.6(__libc_start_main+0xf5) [0x7f75c0b70505]
[gnode06:mpi_rank_1][print_backtrace]  11: /home/NEC-GRP04/nec-usr04/work/20200422_mvapich2/install/mvapich2-2.3.3_gcc4.8.5-cuda10.1.2/libexec/osu-micro-benchmarks/mpi/pt2pt/osu_bw() [0x402192]

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   PID 94610 RUNNING AT gnode06
=   EXIT CODE: 139
=   CLEANING UP REMAINING PROCESSES
=   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================
[proxy:0:0 at gnode11] HYD_pmcd_pmip_control_cmd_cb (pm/pmiserv/pmip_cb.c:911): assert (!closed) failed
[proxy:0:0 at gnode11] HYDT_dmxu_poll_wait_for_event (tools/demux/demux_poll.c:76): callback returned error status
[proxy:0:0 at gnode11] main (pm/pmiserv/pmip.c:202): demux engine error waiting for event
[mpiexec at gnode11] HYDT_bscu_wait_for_completion (tools/bootstrap/utils/bscu_wait.c:76): one of the processes terminated badly; aborting
[mpiexec at gnode11] HYDT_bsci_wait_for_completion (tools/bootstrap/src/bsci_wait.c:23): launcher returned error waiting for completion
[mpiexec at gnode11] HYD_pmci_wait_for_completion (pm/pmiserv/pmiserv_pmci.c:218): launcher returned error waiting for completion
[mpiexec at gnode11] main (ui/mpich/mpiexec.c:340): process manager error waiting for completion


-System information
--HW (per node)
  Intel Xeon Gold 6126 x2
  NVIDIA Tesla V100 x4
  Infiniband ConnectX-6(HDR100) x4

--SW
  CentOS Linux release 7.7.1908 (3.10.0-1062.18.1.el7.x86_64)
  MLNX_OFED_LINUX-4.7-3.2.9.0 (OFED-4.7-3.2.9)
  CUDA10.1 update2, Driver Version: 440.33.01

Best regards,
----
YAMAMOTO Kazuma
NEC Corporation




More information about the mvapich-discuss mailing list