[mvapich-discuss] OSU benchmark on V100 for the case D D fails with seg fault

Subramoni, Hari subramoni.1 at osu.edu
Thu Jan 17 09:06:39 EST 2019


Hi, Yussuf.

Can you please try after setting LD_PRELOAD?

Thx,
Hari.

From: mvapich-discuss <mvapich-discuss-bounces at cse.ohio-state.edu> On Behalf Of Yussuf Ali
Sent: Wednesday, January 16, 2019 11:45 PM
To: mvapich-discuss at cse.ohio-state.edu <mvapich-discuss at mailman.cse.ohio-state.edu>
Subject: [mvapich-discuss] OSU benchmark on V100 for the case D D fails with seg fault

Dear MVAPICH developers and users,

we want to measure the performance of a new V100 cluster with the OSU benchmark 5.5 but for some reason the benchmark fails for the D D case with a seg fault.
The interesting thing is that our application which uses device buffers and MVAPICH-GDR-2.3rc1 runs fine.

The following tests for host and managed memory run fine:
mpiexec -n 2 ./get_local_rank ./mpi/pt2pt/osu_bibw H H
mpiexec -n 2 ./get_local_rank ./mpi/pt2pt/osu_bibw M M

The following test fails with a segfault:

export MV2_USE_CUDA=1
mpiexec -n 2 ./get_local_rank ./mpi/pt2pt/osu_bibw D D

We build the benchmark with the following command:

./configure
CC=/apps/mvapich2-gdr/2.3rc1/gcc4.8.5_cuda9.2/bin/mpicc
CXX=/apps/mvapich2-gdr/2.3rc1/gcc4.8.5_cuda9.2/bin/mpicxx
--enable-cuda
--with-cuda-include=/apps/cuda/9.2.88.1/include
--with-cuda-libpath=/apps/cuda/9.2.88.1/lib64

If we also set MV2_DEBUG_SHOW_BACKTRACE=1, the following output is generated:

# OSU MPI-CUDA Bi-Directional Bandwidth Test v5.5
# Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D)
# Size      Bandwidth (MB/s)
[g0040.abci.local:mpi_rank_0][error_sighandler] Caught error: Segmentation fault (signal 11)
[g0040.abci.local:mpi_rank_1][error_sighandler] Caught error: Segmentation fault (signal 11)
[g0040.abci.local:mpi_rank_0][print_backtrace]   0: /apps/mvapich2-gdr/2.3rc1/gcc4.8.5_cuda9.2/lib64/libmpi.so.12(print_backtrace+0x1c) [0x2aed95454c3c]
[g0040.abci.local:mpi_rank_0][print_backtrace]   1: /apps/mvapich2-gdr/2.3rc1/gcc4.8.5_cuda9.2/lib64/libmpi.so.12(error_sighandler+0x59) [0x2aed95454d39]
[g0040.abci.local:mpi_rank_0][print_backtrace]   2: /lib64/libpthread.so.0(+0xf5e0) [0x2aed94b6d5e0]
[g0040.abci.local:mpi_rank_0][print_backtrace]   3: /lib64/libc.so.6(+0x14d780) [0x2aed95e6f780]
[g0040.abci.local:mpi_rank_0][print_backtrace]   4: /apps/mvapich2-gdr/2.3rc1/gcc4.8.5_cuda9.2/lib64/libmpi.so.12(mv2_smp_fast_write_contig+0x33e) [0x2aed953e9bee]
[g0040.abci.local:mpi_rank_0][print_backtrace]   5: /apps/mvapich2-gdr/2.3rc1/gcc4.8.5_cuda9.2/lib64/libmpi.so.12(MPIDI_CH3_EagerContigShortSend+0xb7) [0x2aed953d0237]
[g0040.abci.local:mpi_rank_0][print_backtrace]   6: /apps/mvapich2-gdr/2.3rc1/gcc4.8.5_cuda9.2/lib64/libmpi.so.12(MPID_Isend+0x74e) [0x2aed953d731e]
[g0040.abci.local:mpi_rank_0][print_backtrace]   7: /apps/mvapich2-gdr/2.3rc1/gcc4.8.5_cuda9.2/lib64/libmpi.so.12(MPI_Isend+0x65e) [0x2aed9533c75e]
[g0040.abci.local:mpi_rank_0][print_backtrace]   8: ./mpi/pt2pt/osu_bibw() [0x401ed6]
[g0040.abci.local:mpi_rank_0][print_backtrace]   9: /lib64/libc.so.6(__libc_start_main+0xf5) [0x2aed95d43c05]
[g0040.abci.local:mpi_rank_0][print_backtrace]  10: ./mpi/pt2pt/osu_bibw() [0x40224b]
[g0040.abci.local:mpi_rank_1][print_backtrace]   0: /apps/mvapich2-gdr/2.3rc1/gcc4.8.5_cuda9.2/lib64/libmpi.so.12(print_backtrace+0x1c) [0x2ad7816fcc3c]
[g0040.abci.local:mpi_rank_1][print_backtrace]   1: /apps/mvapich2-gdr/2.3rc1/gcc4.8.5_cuda9.2/lib64/libmpi.so.12(error_sighandler+0x59) [0x2ad7816fcd39]
[g0040.abci.local:mpi_rank_1][print_backtrace]   2: /lib64/libpthread.so.0(+0xf5e0) [0x2ad780e155e0]
[g0040.abci.local:mpi_rank_1][print_backtrace]   3: /lib64/libc.so.6(+0x14d780) [0x2ad782117780]
[g0040.abci.local:mpi_rank_1][print_backtrace]   4: /apps/mvapich2-gdr/2.3rc1/gcc4.8.5_cuda9.2/lib64/libmpi.so.12(mv2_smp_fast_write_contig+0x33e) [0x2ad781691bee]
[g0040.abci.local:mpi_rank_1][print_backtrace] 5: /apps/mvapich2-gdr/2.3rc1/gcc4.8.5_cuda9.2/lib64/libmpi.so.12(MPIDI_CH3_EagerContigShortSend+0xb7) [0x2ad781678237]
[g0040.abci.local:mpi_rank_1][print_backtrace]   6: /apps/mvapich2-gdr/2.3rc1/gcc4.8.5_cuda9.2/lib64/libmpi.so.12(MPID_Isend+0x74e) [0x2ad78167f31e]
[g0040.abci.local:mpi_rank_1][print_backtrace]   7: /apps/mvapich2-gdr/2.3rc1/gcc4.8.5_cuda9.2/lib64/libmpi.so.12(MPI_Isend+0x65e) [0x2ad7815e475e]
[g0040.abci.local:mpi_rank_1][print_backtrace]   8: ./mpi/pt2pt/osu_bibw() [0x401fb3]
[g0040.abci.local:mpi_rank_1][print_backtrace]   9: /lib64/libc.so.6(__libc_start_main+0xf5) [0x2ad781febc05]
[g0040.abci.local:mpi_rank_1][print_backtrace]  10: ./mpi/pt2pt/osu_bibw() [0x40224b]

Is there something missing in our setup?

Thank you for your help
Yussuf
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20190117/0427bf21/attachment-0001.html>


More information about the mvapich-discuss mailing list