[mvapich-discuss] OSU benchmark on V100 for the case D D fails with seg fault

Yussuf Ali yussuf.ali at jaea.go.jp
Fri Jan 18 03:08:38 EST 2019


Dear Hari,

Thank you for your quick response! Yes setting LD_PRELOAD solves the
problem!

Thank you for your help,
Yussuf 

-----Original Message-----
From: Subramoni, Hari [mailto:subramoni.1 at osu.edu] 
Sent: Thursday, January 17, 2019 11:07 PM
To: Yussuf Ali <yussuf.ali at jaea.go.jp>; mvapich-discuss at cse.ohio-state.edu
<mvapich-discuss at mailman.cse.ohio-state.edu>
Cc: Subramoni, Hari <subramoni.1 at osu.edu>
Subject: RE: [mvapich-discuss] OSU benchmark on V100 for the case D D fails
with seg fault

Hi, Yussuf.

Can you please try after setting LD_PRELOAD?

Thx,
Hari.

From: mvapich-discuss <mvapich-discuss-bounces at cse.ohio-state.edu> On Behalf
Of Yussuf Ali
Sent: Wednesday, January 16, 2019 11:45 PM
To: mvapich-discuss at cse.ohio-state.edu
<mvapich-discuss at mailman.cse.ohio-state.edu>
Subject: [mvapich-discuss] OSU benchmark on V100 for the case D D fails with
seg fault

Dear MVAPICH developers and users,

we want to measure the performance of a new V100 cluster with the OSU
benchmark 5.5 but for some reason the benchmark fails for the D D case with
a seg fault.
The interesting thing is that our application which uses device buffers and
MVAPICH-GDR-2.3rc1 runs fine.

The following tests for host and managed memory run fine:
mpiexec -n 2 ./get_local_rank ./mpi/pt2pt/osu_bibw H H mpiexec -n 2
./get_local_rank ./mpi/pt2pt/osu_bibw M M

The following test fails with a segfault:

export MV2_USE_CUDA=1
mpiexec -n 2 ./get_local_rank ./mpi/pt2pt/osu_bibw D D

We build the benchmark with the following command:

./configure
CC=/apps/mvapich2-gdr/2.3rc1/gcc4.8.5_cuda9.2/bin/mpicc
CXX=/apps/mvapich2-gdr/2.3rc1/gcc4.8.5_cuda9.2/bin/mpicxx
--enable-cuda
--with-cuda-include=/apps/cuda/9.2.88.1/include
--with-cuda-libpath=/apps/cuda/9.2.88.1/lib64

If we also set MV2_DEBUG_SHOW_BACKTRACE=1, the following output is
generated:

# OSU MPI-CUDA Bi-Directional Bandwidth Test v5.5 # Send Buffer on DEVICE
(D) and Receive Buffer on DEVICE (D)
# Size      Bandwidth (MB/s)
[g0040.abci.local:mpi_rank_0][error_sighandler] Caught error: Segmentation
fault (signal 11) [g0040.abci.local:mpi_rank_1][error_sighandler] Caught
error: Segmentation fault (signal 11)
[g0040.abci.local:mpi_rank_0][print_backtrace]   0:
/apps/mvapich2-gdr/2.3rc1/gcc4.8.5_cuda9.2/lib64/libmpi.so.12(print_backtrac
e+0x1c) [0x2aed95454c3c]
[g0040.abci.local:mpi_rank_0][print_backtrace]   1:
/apps/mvapich2-gdr/2.3rc1/gcc4.8.5_cuda9.2/lib64/libmpi.so.12(error_sighandl
er+0x59) [0x2aed95454d39]
[g0040.abci.local:mpi_rank_0][print_backtrace]   2:
/lib64/libpthread.so.0(+0xf5e0) [0x2aed94b6d5e0]
[g0040.abci.local:mpi_rank_0][print_backtrace]   3:
/lib64/libc.so.6(+0x14d780) [0x2aed95e6f780]
[g0040.abci.local:mpi_rank_0][print_backtrace]   4:
/apps/mvapich2-gdr/2.3rc1/gcc4.8.5_cuda9.2/lib64/libmpi.so.12(mv2_smp_fast_w
rite_contig+0x33e) [0x2aed953e9bee]
[g0040.abci.local:mpi_rank_0][print_backtrace]   5:
/apps/mvapich2-gdr/2.3rc1/gcc4.8.5_cuda9.2/lib64/libmpi.so.12(MPIDI_CH3_Eage
rContigShortSend+0xb7) [0x2aed953d0237]
[g0040.abci.local:mpi_rank_0][print_backtrace]   6:
/apps/mvapich2-gdr/2.3rc1/gcc4.8.5_cuda9.2/lib64/libmpi.so.12(MPID_Isend+0x7
4e) [0x2aed953d731e]
[g0040.abci.local:mpi_rank_0][print_backtrace]   7:
/apps/mvapich2-gdr/2.3rc1/gcc4.8.5_cuda9.2/lib64/libmpi.so.12(MPI_Isend+0x65
e) [0x2aed9533c75e]
[g0040.abci.local:mpi_rank_0][print_backtrace]   8: ./mpi/pt2pt/osu_bibw()
[0x401ed6]
[g0040.abci.local:mpi_rank_0][print_backtrace]   9:
/lib64/libc.so.6(__libc_start_main+0xf5) [0x2aed95d43c05]
[g0040.abci.local:mpi_rank_0][print_backtrace]  10: ./mpi/pt2pt/osu_bibw()
[0x40224b]
[g0040.abci.local:mpi_rank_1][print_backtrace]   0:
/apps/mvapich2-gdr/2.3rc1/gcc4.8.5_cuda9.2/lib64/libmpi.so.12(print_backtrac
e+0x1c) [0x2ad7816fcc3c]
[g0040.abci.local:mpi_rank_1][print_backtrace]   1:
/apps/mvapich2-gdr/2.3rc1/gcc4.8.5_cuda9.2/lib64/libmpi.so.12(error_sighandl
er+0x59) [0x2ad7816fcd39]
[g0040.abci.local:mpi_rank_1][print_backtrace]   2:
/lib64/libpthread.so.0(+0xf5e0) [0x2ad780e155e0]
[g0040.abci.local:mpi_rank_1][print_backtrace]   3:
/lib64/libc.so.6(+0x14d780) [0x2ad782117780]
[g0040.abci.local:mpi_rank_1][print_backtrace]   4:
/apps/mvapich2-gdr/2.3rc1/gcc4.8.5_cuda9.2/lib64/libmpi.so.12(mv2_smp_fast_w
rite_contig+0x33e) [0x2ad781691bee]
[g0040.abci.local:mpi_rank_1][print_backtrace] 5:
/apps/mvapich2-gdr/2.3rc1/gcc4.8.5_cuda9.2/lib64/libmpi.so.12(MPIDI_CH3_Eage
rContigShortSend+0xb7) [0x2ad781678237]
[g0040.abci.local:mpi_rank_1][print_backtrace]   6:
/apps/mvapich2-gdr/2.3rc1/gcc4.8.5_cuda9.2/lib64/libmpi.so.12(MPID_Isend+0x7
4e) [0x2ad78167f31e]
[g0040.abci.local:mpi_rank_1][print_backtrace]   7:
/apps/mvapich2-gdr/2.3rc1/gcc4.8.5_cuda9.2/lib64/libmpi.so.12(MPI_Isend+0x65
e) [0x2ad7815e475e]
[g0040.abci.local:mpi_rank_1][print_backtrace]   8: ./mpi/pt2pt/osu_bibw()
[0x401fb3]
[g0040.abci.local:mpi_rank_1][print_backtrace]   9:
/lib64/libc.so.6(__libc_start_main+0xf5) [0x2ad781febc05]
[g0040.abci.local:mpi_rank_1][print_backtrace]  10: ./mpi/pt2pt/osu_bibw()
[0x40224b]

Is there something missing in our setup?

Thank you for your help
Yussuf



More information about the mvapich-discuss mailing list