[mvapich-discuss] Segmentation fault in MPI_Allreduce with data on GPU memory

Mayuko Ishii m-ishii at dr.jp.nec.com
Thu Oct 31 00:35:45 EDT 2019


Hi MVAPICH Team:

Using MVAPICH2-GDR 2.3.2 PGI (mvapich2-gdr-mcast.cuda10.1.mofed4.4.pgi19.1-2.3.2-1.el7.x86_64.rpm),
The program crash on Segmentation fault in MPI_Allreduce with data on GPU memory.

-- Error Output 

[gnode19:mpi_rank_0][error_sighandler] Caught error: Segmentation fault (signal 11)
rank  1:   0.0000000000000000E+00   0.1280000000000000E+03   0.0000000000000000E+00
[gnode22:mpi_rank_1][error_sighandler] Caught error: Segmentation fault (signal 11)
[gnode19:mpi_rank_0][print_backtrace]   0: /system/apps/mvapich2-gdr/2.3.2/cuda10.1.mofed4.4.pgi19.1/lib64/libmpi.so.12(print_backtrace+0x26) [0x2ad5e1671e66]
[gnode19:mpi_rank_0][print_backtrace]   1: /system/apps/mvapich2-gdr/2.3.2/cuda10.1.mofed4.4.pgi19.1/lib64/libmpi.so.12(error_sighandler+0x9f) [0x2ad5e167206f]
[gnode19:mpi_rank_0][print_backtrace]   2: /usr/lib/gcc/x86_64-redhat-linux/4.8.5/../../../../lib64/libpthread.so.0(+0xf5d0) [0x2ad5f4c5a5d0]
[gnode19:mpi_rank_0][print_backtrace]   3: /usr/lib/gcc/x86_64-redhat-linux/4.8.5/../../../../lib64/libc.so.6(+0x153d69) [0x2ad5f5a23d69]
[gnode19:mpi_rank_0][print_backtrace]   4: /system/apps/mvapich2-gdr/2.3.2/cuda10.1.mofed4.4.pgi19.1/lib64/libmpi.so.12(MPIR_Localcopy+0x5fa) [0x2ad5e145a11a]
[gnode19:mpi_rank_0][print_backtrace]   5: /system/apps/mvapich2-gdr/2.3.2/cuda10.1.mofed4.4.pgi19.1/lib64/libmpi.so.12(MPIR_Allreduce_two_level_MV2+0x28d) [0x2ad5e0e6d87d]
[gnode19:mpi_rank_0][print_backtrace]   6: /system/apps/mvapich2-gdr/2.3.2/cuda10.1.mofed4.4.pgi19.1/lib64/libmpi.so.12(MPIR_Allreduce_index_tuned_intra_MV2+0x12d4) [0x2ad5e0e72c94]
[gnode19:mpi_rank_0][print_backtrace]   7: /system/apps/mvapich2-gdr/2.3.2/cuda10.1.mofed4.4.pgi19.1/lib64/libmpi.so.12(MPIR_Allreduce_MV2+0xba) [0x2ad5e0e745da]
[gnode19:mpi_rank_0][print_backtrace]   8: /system/apps/mvapich2-gdr/2.3.2/cuda10.1.mofed4.4.pgi19.1/lib64/libmpi.so.12(MPIR_Allreduce_impl+0x86) [0x2ad5e0db6f56]
[gnode19:mpi_rank_0][print_backtrace]   9: /system/apps/mvapich2-gdr/2.3.2/cuda10.1.mofed4.4.pgi19.1/lib64/libmpi.so.12(PMPI_Allreduce+0x1342) [0x2ad5e0db8402]
[gnode19:mpi_rank_0][print_backtrace]  10: /system/apps/mvapich2-gdr/2.3.2/cuda10.1.mofed4.4.pgi19.1/lib64/libmpifort.so.12(mpi_allreduce_+0xb0) [0x2ad5e09f69b0]
[gnode19:mpi_rank_0][print_backtrace]  11: ./a.out() [0x402cc8]
[gnode19:mpi_rank_0][print_backtrace]  12: ./a.out() [0x401dc6]
[gnode19:mpi_rank_0][print_backtrace]  13: /usr/lib/gcc/x86_64-redhat-linux/4.8.5/../../../../lib64/libc.so.6(__libc_start_main+0xf5) [0x2ad5f58f23d5]
[gnode19:mpi_rank_0][print_backtrace]  14: ./a.out() [0x401c99]
-----

- How to reproduce

I have attached a sample program.

export CUDA_HOME=/usr/local/cuda-10.1
export CUDA_PATH=/usr/local/cuda-10.1
export MV2_PATH=/system/apps/mvapich2-gdr/2.3.2/cuda10.1.mofed4.4.pgi19.1
export PATH=/system/apps/mvapich2-gdr/2.3.2/cuda10.1.mofed4.4.pgi19.1/bin:/usr/local/cuda-10.1/bin:/system/apps/pgi/19.1/linux86-64/19.1/bin:$PATH
export LD_LIBRARY_PATH=/system/apps/mvapich2-gdr/2.3.2/cuda10.1.mofed4.4.pgi19.1/lib64:/usr/local/cuda-10.1/lib64:/system/apps/pgi/19.1/linux86-64/19.1/lib:$LD_LIBRARY_PATH
export LIBRARY_PATH=/system/apps/mvapich2-gdr/2.3.2/cuda10.1.mofed4.4.pgi19.1/lib64:/usr/local/cuda-10.1/lib64:$LIBRARY_PATH
export CPATH=/system/apps/mvapich2-gdr/2.3.2/cuda10.1.mofed4.4.pgi19.1/include:/usr/local/cuda-10.1/include:$CPATH
export MV2_USE_CUDA=1
export MV2_GPUDIRECT_GDRCOPY_LIB=/opt/gdrcopy/libgdrapi.so
export MV2_DEBUG_SHOW_BACKTRACE=1

mpif90 -Mcuda=cc70 ./mpi_reduction_test.CUF
mpirun_rsh -np 2 -hostfile hostfile ./a.out 0 32

- System information
--HW (per node)
  Intel Xeon Gold 6126 x2
  NVIDIA Tesla V100 x4
  Infiniband HDR100 x4

--SW
  CentOS Linux release 7.6.1810 (3.10.0-957.5.1.el7.x86_64)
  MLNX_OFED_LINUX-4.6-1.0.1.1 (OFED-4.6-1.0.1):
  CUDA10.1


Thanks.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: mpi_reduction_test.cuf
Type: application/octet-stream
Size: 5394 bytes
Desc: mpi_reduction_test.cuf
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20191031/94dcfe1a/attachment.obj>


More information about the mvapich-discuss mailing list