[mvapich-discuss] OSU benchmark on V100 for the case D D fails with seg fault

Yussuf Ali yussuf.ali at jaea.go.jp
Wed Jan 16 23:45:18 EST 2019


Dear MVAPICH developers and users,

 

we want to measure the performance of a new V100 cluster with the OSU
benchmark 5.5 but for some reason the benchmark fails for the D D case with
a seg fault. 

The interesting thing is that our application which uses device buffers and
MVAPICH-GDR-2.3rc1 runs fine.

 

The following tests for host and managed memory run fine:

mpiexec -n 2 ./get_local_rank ./mpi/pt2pt/osu_bibw H H

mpiexec -n 2 ./get_local_rank ./mpi/pt2pt/osu_bibw M M

 

The following test fails with a segfault:

 

export MV2_USE_CUDA=1

mpiexec -n 2 ./get_local_rank ./mpi/pt2pt/osu_bibw D D

 

We build the benchmark with the following command:

 

./configure 

CC=/apps/mvapich2-gdr/2.3rc1/gcc4.8.5_cuda9.2/bin/mpicc

CXX=/apps/mvapich2-gdr/2.3rc1/gcc4.8.5_cuda9.2/bin/mpicxx 

--enable-cuda

--with-cuda-include=/apps/cuda/9.2.88.1/include

--with-cuda-libpath=/apps/cuda/9.2.88.1/lib64

 

If we also set MV2_DEBUG_SHOW_BACKTRACE=1, the following output is
generated:

 

# OSU MPI-CUDA Bi-Directional Bandwidth Test v5.5

# Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D)

# Size      Bandwidth (MB/s)

[g0040.abci.local:mpi_rank_0][error_sighandler] Caught error: Segmentation
fault (signal 11)

[g0040.abci.local:mpi_rank_1][error_sighandler] Caught error: Segmentation
fault (signal 11)

[g0040.abci.local:mpi_rank_0][print_backtrace]   0:
/apps/mvapich2-gdr/2.3rc1/gcc4.8.5_cuda9.2/lib64/libmpi.so.12(print_backtrac
e+0x1c) [0x2aed95454c3c]

[g0040.abci.local:mpi_rank_0][print_backtrace]   1:
/apps/mvapich2-gdr/2.3rc1/gcc4.8.5_cuda9.2/lib64/libmpi.so.12(error_sighandl
er+0x59) [0x2aed95454d39]

[g0040.abci.local:mpi_rank_0][print_backtrace]   2:
/lib64/libpthread.so.0(+0xf5e0) [0x2aed94b6d5e0]

[g0040.abci.local:mpi_rank_0][print_backtrace]   3:
/lib64/libc.so.6(+0x14d780) [0x2aed95e6f780]

[g0040.abci.local:mpi_rank_0][print_backtrace]   4:
/apps/mvapich2-gdr/2.3rc1/gcc4.8.5_cuda9.2/lib64/libmpi.so.12(mv2_smp_fast_w
rite_contig+0x33e) [0x2aed953e9bee]

[g0040.abci.local:mpi_rank_0][print_backtrace]   5:
/apps/mvapich2-gdr/2.3rc1/gcc4.8.5_cuda9.2/lib64/libmpi.so.12(MPIDI_CH3_Eage
rContigShortSend+0xb7) [0x2aed953d0237]

[g0040.abci.local:mpi_rank_0][print_backtrace]   6:
/apps/mvapich2-gdr/2.3rc1/gcc4.8.5_cuda9.2/lib64/libmpi.so.12(MPID_Isend+0x7
4e) [0x2aed953d731e]

[g0040.abci.local:mpi_rank_0][print_backtrace]   7:
/apps/mvapich2-gdr/2.3rc1/gcc4.8.5_cuda9.2/lib64/libmpi.so.12(MPI_Isend+0x65
e) [0x2aed9533c75e]

[g0040.abci.local:mpi_rank_0][print_backtrace]   8: ./mpi/pt2pt/osu_bibw()
[0x401ed6]

[g0040.abci.local:mpi_rank_0][print_backtrace]   9:
/lib64/libc.so.6(__libc_start_main+0xf5) [0x2aed95d43c05]

[g0040.abci.local:mpi_rank_0][print_backtrace]  10: ./mpi/pt2pt/osu_bibw()
[0x40224b]

[g0040.abci.local:mpi_rank_1][print_backtrace]   0:
/apps/mvapich2-gdr/2.3rc1/gcc4.8.5_cuda9.2/lib64/libmpi.so.12(print_backtrac
e+0x1c) [0x2ad7816fcc3c]

[g0040.abci.local:mpi_rank_1][print_backtrace]   1:
/apps/mvapich2-gdr/2.3rc1/gcc4.8.5_cuda9.2/lib64/libmpi.so.12(error_sighandl
er+0x59) [0x2ad7816fcd39]

[g0040.abci.local:mpi_rank_1][print_backtrace]   2:
/lib64/libpthread.so.0(+0xf5e0) [0x2ad780e155e0]

[g0040.abci.local:mpi_rank_1][print_backtrace]   3:
/lib64/libc.so.6(+0x14d780) [0x2ad782117780]

[g0040.abci.local:mpi_rank_1][print_backtrace]   4:
/apps/mvapich2-gdr/2.3rc1/gcc4.8.5_cuda9.2/lib64/libmpi.so.12(mv2_smp_fast_w
rite_contig+0x33e) [0x2ad781691bee]

[g0040.abci.local:mpi_rank_1][print_backtrace] 5:
/apps/mvapich2-gdr/2.3rc1/gcc4.8.5_cuda9.2/lib64/libmpi.so.12(MPIDI_CH3_Eage
rContigShortSend+0xb7) [0x2ad781678237]

[g0040.abci.local:mpi_rank_1][print_backtrace]   6:
/apps/mvapich2-gdr/2.3rc1/gcc4.8.5_cuda9.2/lib64/libmpi.so.12(MPID_Isend+0x7
4e) [0x2ad78167f31e]

[g0040.abci.local:mpi_rank_1][print_backtrace]   7:
/apps/mvapich2-gdr/2.3rc1/gcc4.8.5_cuda9.2/lib64/libmpi.so.12(MPI_Isend+0x65
e) [0x2ad7815e475e]

[g0040.abci.local:mpi_rank_1][print_backtrace]   8: ./mpi/pt2pt/osu_bibw()
[0x401fb3]

[g0040.abci.local:mpi_rank_1][print_backtrace]   9:
/lib64/libc.so.6(__libc_start_main+0xf5) [0x2ad781febc05]

[g0040.abci.local:mpi_rank_1][print_backtrace]  10: ./mpi/pt2pt/osu_bibw()
[0x40224b]

 

Is there something missing in our setup?

 

Thank you for your help

Yussuf

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20190117/546a00c2/attachment.html>


More information about the mvapich-discuss mailing list