[mvapich-discuss] OSU benchmark on V100 for the case D D fails with seg fault
Yussuf Ali
yussuf.ali at jaea.go.jp
Fri Jan 18 03:08:38 EST 2019
Dear Hari,
Thank you for your quick response! Yes setting LD_PRELOAD solves the
problem!
Thank you for your help,
Yussuf
-----Original Message-----
From: Subramoni, Hari [mailto:subramoni.1 at osu.edu]
Sent: Thursday, January 17, 2019 11:07 PM
To: Yussuf Ali <yussuf.ali at jaea.go.jp>; mvapich-discuss at cse.ohio-state.edu
<mvapich-discuss at mailman.cse.ohio-state.edu>
Cc: Subramoni, Hari <subramoni.1 at osu.edu>
Subject: RE: [mvapich-discuss] OSU benchmark on V100 for the case D D fails
with seg fault
Hi, Yussuf.
Can you please try after setting LD_PRELOAD?
Thx,
Hari.
From: mvapich-discuss <mvapich-discuss-bounces at cse.ohio-state.edu> On Behalf
Of Yussuf Ali
Sent: Wednesday, January 16, 2019 11:45 PM
To: mvapich-discuss at cse.ohio-state.edu
<mvapich-discuss at mailman.cse.ohio-state.edu>
Subject: [mvapich-discuss] OSU benchmark on V100 for the case D D fails with
seg fault
Dear MVAPICH developers and users,
we want to measure the performance of a new V100 cluster with the OSU
benchmark 5.5 but for some reason the benchmark fails for the D D case with
a seg fault.
The interesting thing is that our application which uses device buffers and
MVAPICH-GDR-2.3rc1 runs fine.
The following tests for host and managed memory run fine:
mpiexec -n 2 ./get_local_rank ./mpi/pt2pt/osu_bibw H H mpiexec -n 2
./get_local_rank ./mpi/pt2pt/osu_bibw M M
The following test fails with a segfault:
export MV2_USE_CUDA=1
mpiexec -n 2 ./get_local_rank ./mpi/pt2pt/osu_bibw D D
We build the benchmark with the following command:
./configure
CC=/apps/mvapich2-gdr/2.3rc1/gcc4.8.5_cuda9.2/bin/mpicc
CXX=/apps/mvapich2-gdr/2.3rc1/gcc4.8.5_cuda9.2/bin/mpicxx
--enable-cuda
--with-cuda-include=/apps/cuda/9.2.88.1/include
--with-cuda-libpath=/apps/cuda/9.2.88.1/lib64
If we also set MV2_DEBUG_SHOW_BACKTRACE=1, the following output is
generated:
# OSU MPI-CUDA Bi-Directional Bandwidth Test v5.5 # Send Buffer on DEVICE
(D) and Receive Buffer on DEVICE (D)
# Size Bandwidth (MB/s)
[g0040.abci.local:mpi_rank_0][error_sighandler] Caught error: Segmentation
fault (signal 11) [g0040.abci.local:mpi_rank_1][error_sighandler] Caught
error: Segmentation fault (signal 11)
[g0040.abci.local:mpi_rank_0][print_backtrace] 0:
/apps/mvapich2-gdr/2.3rc1/gcc4.8.5_cuda9.2/lib64/libmpi.so.12(print_backtrac
e+0x1c) [0x2aed95454c3c]
[g0040.abci.local:mpi_rank_0][print_backtrace] 1:
/apps/mvapich2-gdr/2.3rc1/gcc4.8.5_cuda9.2/lib64/libmpi.so.12(error_sighandl
er+0x59) [0x2aed95454d39]
[g0040.abci.local:mpi_rank_0][print_backtrace] 2:
/lib64/libpthread.so.0(+0xf5e0) [0x2aed94b6d5e0]
[g0040.abci.local:mpi_rank_0][print_backtrace] 3:
/lib64/libc.so.6(+0x14d780) [0x2aed95e6f780]
[g0040.abci.local:mpi_rank_0][print_backtrace] 4:
/apps/mvapich2-gdr/2.3rc1/gcc4.8.5_cuda9.2/lib64/libmpi.so.12(mv2_smp_fast_w
rite_contig+0x33e) [0x2aed953e9bee]
[g0040.abci.local:mpi_rank_0][print_backtrace] 5:
/apps/mvapich2-gdr/2.3rc1/gcc4.8.5_cuda9.2/lib64/libmpi.so.12(MPIDI_CH3_Eage
rContigShortSend+0xb7) [0x2aed953d0237]
[g0040.abci.local:mpi_rank_0][print_backtrace] 6:
/apps/mvapich2-gdr/2.3rc1/gcc4.8.5_cuda9.2/lib64/libmpi.so.12(MPID_Isend+0x7
4e) [0x2aed953d731e]
[g0040.abci.local:mpi_rank_0][print_backtrace] 7:
/apps/mvapich2-gdr/2.3rc1/gcc4.8.5_cuda9.2/lib64/libmpi.so.12(MPI_Isend+0x65
e) [0x2aed9533c75e]
[g0040.abci.local:mpi_rank_0][print_backtrace] 8: ./mpi/pt2pt/osu_bibw()
[0x401ed6]
[g0040.abci.local:mpi_rank_0][print_backtrace] 9:
/lib64/libc.so.6(__libc_start_main+0xf5) [0x2aed95d43c05]
[g0040.abci.local:mpi_rank_0][print_backtrace] 10: ./mpi/pt2pt/osu_bibw()
[0x40224b]
[g0040.abci.local:mpi_rank_1][print_backtrace] 0:
/apps/mvapich2-gdr/2.3rc1/gcc4.8.5_cuda9.2/lib64/libmpi.so.12(print_backtrac
e+0x1c) [0x2ad7816fcc3c]
[g0040.abci.local:mpi_rank_1][print_backtrace] 1:
/apps/mvapich2-gdr/2.3rc1/gcc4.8.5_cuda9.2/lib64/libmpi.so.12(error_sighandl
er+0x59) [0x2ad7816fcd39]
[g0040.abci.local:mpi_rank_1][print_backtrace] 2:
/lib64/libpthread.so.0(+0xf5e0) [0x2ad780e155e0]
[g0040.abci.local:mpi_rank_1][print_backtrace] 3:
/lib64/libc.so.6(+0x14d780) [0x2ad782117780]
[g0040.abci.local:mpi_rank_1][print_backtrace] 4:
/apps/mvapich2-gdr/2.3rc1/gcc4.8.5_cuda9.2/lib64/libmpi.so.12(mv2_smp_fast_w
rite_contig+0x33e) [0x2ad781691bee]
[g0040.abci.local:mpi_rank_1][print_backtrace] 5:
/apps/mvapich2-gdr/2.3rc1/gcc4.8.5_cuda9.2/lib64/libmpi.so.12(MPIDI_CH3_Eage
rContigShortSend+0xb7) [0x2ad781678237]
[g0040.abci.local:mpi_rank_1][print_backtrace] 6:
/apps/mvapich2-gdr/2.3rc1/gcc4.8.5_cuda9.2/lib64/libmpi.so.12(MPID_Isend+0x7
4e) [0x2ad78167f31e]
[g0040.abci.local:mpi_rank_1][print_backtrace] 7:
/apps/mvapich2-gdr/2.3rc1/gcc4.8.5_cuda9.2/lib64/libmpi.so.12(MPI_Isend+0x65
e) [0x2ad7815e475e]
[g0040.abci.local:mpi_rank_1][print_backtrace] 8: ./mpi/pt2pt/osu_bibw()
[0x401fb3]
[g0040.abci.local:mpi_rank_1][print_backtrace] 9:
/lib64/libc.so.6(__libc_start_main+0xf5) [0x2ad781febc05]
[g0040.abci.local:mpi_rank_1][print_backtrace] 10: ./mpi/pt2pt/osu_bibw()
[0x40224b]
Is there something missing in our setup?
Thank you for your help
Yussuf
More information about the mvapich-discuss
mailing list