[mvapich-discuss] Segmentation fault in osu_bw D D
YAMAMOTO KAZUMA(山本 和磨)
kz-yamamoto at nec.com
Wed Apr 22 04:11:30 EDT 2020
Hello,
A segmentation fault occurs when executing 'osu_bw D D' with mvapich2-2.3.3.
'osu_bw' finished nomally.
osu_bw used the libexec directory.
$ ./configure --prefix=<install_path> --enable-cuda --with-cuda=<cuda_path> --enable-multi-subnet
$ make
$ make install
$ mpirun -np 2 -hosts gnode11,gnode06 -genv MV2_DEBUG_SHOW_BACKTRACE 1 ./osu_bw D D
# OSU MPI-CUDA Bandwidth Test v5.6.2
# Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D)
# Size Bandwidth (MB/s)
1 0.07
2 0.14
4 0.29
8 0.59
16 1.19
32 2.52
64 4.77
128 10.76
256 22.68
512 46.52
1024 95.65
2048 186.11
4096 361.63
8192 666.12
[gnode06:mpi_rank_1][error_sighandler] Caught error: Segmentation fault (signal 11)
[gnode06:mpi_rank_1][print_backtrace] 0: /work/1/NEC-GRP00/nec-usr04/20200422_mvapich2/install/mvapich2-2.3.3_gcc4.8.5-cuda10.1.2/lib/libmpi.so.12(print_backtrace+0x1c) [0x7f75c47e817c]
[gnode06:mpi_rank_1][print_backtrace] 1: /work/1/NEC-GRP00/nec-usr04/20200422_mvapich2/install/mvapich2-2.3.3_gcc4.8.5-cuda10.1.2/lib/libmpi.so.12(error_sighandler+0x59) [0x7f75c47e8279]
[gnode06:mpi_rank_1][print_backtrace] 2: /lib64/libpthread.so.0(+0xf5f0) [0x7f75c174a5f0]
[gnode06:mpi_rank_1][print_backtrace] 3: /work/1/NEC-GRP00/nec-usr04/20200422_mvapich2/install/mvapich2-2.3.3_gcc4.8.5-cuda10.1.2/lib/libmpi.so.12(MPIDI_CH3I_MRAILI_Rendezvous_rget_push+0x868) [0x7f75c47c6b08]
[gnode06:mpi_rank_1][print_backtrace] 4: /work/1/NEC-GRP00/nec-usr04/20200422_mvapich2/install/mvapich2-2.3.3_gcc4.8.5-cuda10.1.2/lib/libmpi.so.12(MPIDI_CH3_Rendezvous_push+0x23b) [0x7f75c4783acb]
[gnode06:mpi_rank_1][print_backtrace] 5: /work/1/NEC-GRP00/nec-usr04/20200422_mvapich2/install/mvapich2-2.3.3_gcc4.8.5-cuda10.1.2/lib/libmpi.so.12(MPIDI_CH3I_MRAILI_Process_rndv+0x7d) [0x7f75c4783f2d]
[gnode06:mpi_rank_1][print_backtrace] 6: /work/1/NEC-GRP00/nec-usr04/20200422_mvapich2/install/mvapich2-2.3.3_gcc4.8.5-cuda10.1.2/lib/libmpi.so.12(MPIDI_CH3I_Progress+0xec) [0x7f75c478062c]
[gnode06:mpi_rank_1][print_backtrace] 7: /work/1/NEC-GRP00/nec-usr04/20200422_mvapich2/install/mvapich2-2.3.3_gcc4.8.5-cuda10.1.2/lib/libmpi.so.12(MPIR_Waitall_impl+0x32d) [0x7f75c46e5c5d]
[gnode06:mpi_rank_1][print_backtrace] 8: /work/1/NEC-GRP00/nec-usr04/20200422_mvapich2/install/mvapich2-2.3.3_gcc4.8.5-cuda10.1.2/lib/libmpi.so.12(MPI_Waitall+0xe3) [0x7f75c46e6293]
[gnode06:mpi_rank_1][print_backtrace] 9: /home/NEC-GRP04/nec-usr04/work/20200422_mvapich2/install/mvapich2-2.3.3_gcc4.8.5-cuda10.1.2/libexec/osu-micro-benchmarks/mpi/pt2pt/osu_bw() [0x401fd6]
[gnode06:mpi_rank_1][print_backtrace] 10: /lib64/libc.so.6(__libc_start_main+0xf5) [0x7f75c0b70505]
[gnode06:mpi_rank_1][print_backtrace] 11: /home/NEC-GRP04/nec-usr04/work/20200422_mvapich2/install/mvapich2-2.3.3_gcc4.8.5-cuda10.1.2/libexec/osu-micro-benchmarks/mpi/pt2pt/osu_bw() [0x402192]
===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= PID 94610 RUNNING AT gnode06
= EXIT CODE: 139
= CLEANING UP REMAINING PROCESSES
= YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================
[proxy:0:0 at gnode11] HYD_pmcd_pmip_control_cmd_cb (pm/pmiserv/pmip_cb.c:911): assert (!closed) failed
[proxy:0:0 at gnode11] HYDT_dmxu_poll_wait_for_event (tools/demux/demux_poll.c:76): callback returned error status
[proxy:0:0 at gnode11] main (pm/pmiserv/pmip.c:202): demux engine error waiting for event
[mpiexec at gnode11] HYDT_bscu_wait_for_completion (tools/bootstrap/utils/bscu_wait.c:76): one of the processes terminated badly; aborting
[mpiexec at gnode11] HYDT_bsci_wait_for_completion (tools/bootstrap/src/bsci_wait.c:23): launcher returned error waiting for completion
[mpiexec at gnode11] HYD_pmci_wait_for_completion (pm/pmiserv/pmiserv_pmci.c:218): launcher returned error waiting for completion
[mpiexec at gnode11] main (ui/mpich/mpiexec.c:340): process manager error waiting for completion
-System information
--HW (per node)
Intel Xeon Gold 6126 x2
NVIDIA Tesla V100 x4
Infiniband ConnectX-6(HDR100) x4
--SW
CentOS Linux release 7.7.1908 (3.10.0-1062.18.1.el7.x86_64)
MLNX_OFED_LINUX-4.7-3.2.9.0 (OFED-4.7-3.2.9)
CUDA10.1 update2, Driver Version: 440.33.01
Best regards,
----
YAMAMOTO Kazuma
NEC Corporation
More information about the mvapich-discuss
mailing list