[mvapich-discuss] CUDA runtime issues when compiling from source
Adam Guymon
aguymon at nvidia.com
Thu Sep 27 16:10:01 EDT 2018
Hello,
I am having runtime issues when compiling from source running the collective benchmark with cuda. I believe it may be linked to this issue: http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/2015-April/005595.html It was unclear to me whether this issue was ever resolved. Below are additional details on configuration and how I am running the test. Any information you could provide would be a big help. When I run the test with the MV2-GDR 2.3rc1 version installed it works fine. Just fails when I build from source.
Configured to match the MV2-GDR 2.3rc1 configuration:
./configure --prefix=/opt/mvapich2/gdr/2.3rc1/mcast/no-openacc/cuda9.2/mofed4.2/mpirun/gnu4.8.5 --disable-rpath --disable-static --enable-shared --disable-rdma-cm --with-core-direct --enable-cuda --with-cuda-include=/usr/local/cuda-9.2/include --with-cuda-libpath=/usr/local/cuda-9.2/lib64/
$ MV2_USE_CUDA=1 MV2_USE_GPUDIRECT_GDRCOPY=0 MV2_DEBUG_SHOW_BACKTRACE=1 mpirun -np 2 /usr/local/osumb/libexec/osu-micro-benchmarks/mpi/collective/osu_allgather -d cuda
# OSU MPI-CUDA Allgather Latency Test v5.4.4
# Size Avg Latency(us)
1 26.07
2 21.49
4 20.09
8 18.70
16 18.39
32 17.79
64 18.24
128 18.19
256 18.41
512 18.79
1024 19.16
2048 19.86
4096 21.53
8192 25.90
16384 33.27
32768 52.74
65536 92.77
131072 163.99
[SC-FAT-EHPC-CF-BDW50:mpi_rank_0][error_sighandler] Caught error: Segmentation fault (signal 11)
[SC-FAT-EHPC-CF-BDW50:mpi_rank_0][print_backtrace] 0: /opt/mvapich2/gdr/2.3rc1/mcast/no-openacc/cuda9.2/mofed4.2/mpirun/gnu4.8.5/lib64/libmpi.so.12(print_backtrace+0x2f) [0x7f90791cf01f]
[SC-FAT-EHPC-CF-BDW50:mpi_rank_0][print_backtrace] 1: /opt/mvapich2/gdr/2.3rc1/mcast/no-openacc/cuda9.2/mofed4.2/mpirun/gnu4.8.5/lib64/libmpi.so.12(error_sighandler+0x63) [0x7f90791cf163]
[SC-FAT-EHPC-CF-BDW50:mpi_rank_0][print_backtrace] 2: /lib/x86_64-linux-gnu/libc.so.6(+0x354b0) [0x7f9078a0f4b0]
[SC-FAT-EHPC-CF-BDW50:mpi_rank_0][print_backtrace] 3: /opt/mvapich2/gdr/2.3rc1/mcast/no-openacc/cuda9.2/mofed4.2/mpirun/gnu4.8.5/lib64/libmpi.so.12(MPIDI_CH3_SMP_iStartMsg+0x150) [0x7f907916df00]
[SC-FAT-EHPC-CF-BDW50:mpi_rank_0][print_backtrace] 4: /opt/mvapich2/gdr/2.3rc1/mcast/no-openacc/cuda9.2/mofed4.2/mpirun/gnu4.8.5/lib64/libmpi.so.12(MPIDI_CH3_iStartMsg+0x14e) [0x7f907916e0de]
[SC-FAT-EHPC-CF-BDW50:mpi_rank_0][print_backtrace] 5: /opt/mvapich2/gdr/2.3rc1/mcast/no-openacc/cuda9.2/mofed4.2/mpirun/gnu4.8.5/lib64/libmpi.so.12(MPIDI_CH3_Rendezvous_push+0xc9) [0x7f9079172b19]
[SC-FAT-EHPC-CF-BDW50:mpi_rank_0][print_backtrace] 6: /opt/mvapich2/gdr/2.3rc1/mcast/no-openacc/cuda9.2/mofed4.2/mpirun/gnu4.8.5/lib64/libmpi.so.12(MPIDI_CH3I_MRAILI_Process_rndv+0x81) [0x7f9079172f81]
[SC-FAT-EHPC-CF-BDW50:mpi_rank_0][print_backtrace] 7: /opt/mvapich2/gdr/2.3rc1/mcast/no-openacc/cuda9.2/mofed4.2/mpirun/gnu4.8.5/lib64/libmpi.so.12(MPIDI_CH3I_Progress+0xfb) [0x7f907916fe7b]
[SC-FAT-EHPC-CF-BDW50:mpi_rank_0][print_backtrace] 8: /opt/mvapich2/gdr/2.3rc1/mcast/no-openacc/cuda9.2/mofed4.2/mpirun/gnu4.8.5/lib64/libmpi.so.12(MPIR_Waitall_impl+0x3b6) [0x7f90790d80f6]
[SC-FAT-EHPC-CF-BDW50:mpi_rank_0][print_backtrace] 9: /opt/mvapich2/gdr/2.3rc1/mcast/no-openacc/cuda9.2/mofed4.2/mpirun/gnu4.8.5/lib64/libmpi.so.12(MPIC_Waitall+0xa2) [0x7f9079101f22]
[SC-FAT-EHPC-CF-BDW50:mpi_rank_0][print_backtrace] 10: /opt/mvapich2/gdr/2.3rc1/mcast/no-openacc/cuda9.2/mofed4.2/mpirun/gnu4.8.5/lib64/libmpi.so.12(MPIR_Allgather_cuda_intra_MV2+0x64a) [0x7f9078ea8c2a]
[SC-FAT-EHPC-CF-BDW50:mpi_rank_0][print_backtrace] 11: /opt/mvapich2/gdr/2.3rc1/mcast/no-openacc/cuda9.2/mofed4.2/mpirun/gnu4.8.5/lib64/libmpi.so.12(MPIR_Allgather_index_tuned_intra_MV2+0x1e0) [0x7f9078e71800]
[SC-FAT-EHPC-CF-BDW50:mpi_rank_0][print_backtrace] 12: /opt/mvapich2/gdr/2.3rc1/mcast/no-openacc/cuda9.2/mofed4.2/mpirun/gnu4.8.5/lib64/libmpi.so.12(MPIR_Allgather_MV2+0x8b) [0x7f9078e726ab]
[SC-FAT-EHPC-CF-BDW50:mpi_rank_0][print_backtrace] 13: /opt/mvapich2/gdr/2.3rc1/mcast/no-openacc/cuda9.2/mofed4.2/mpirun/gnu4.8.5/lib64/libmpi.so.12(MPIR_Allgather_impl+0x29) [0x7f9078e39619]
[SC-FAT-EHPC-CF-BDW50:mpi_rank_0][print_backtrace] 14: /opt/mvapich2/gdr/2.3rc1/mcast/no-openacc/cuda9.2/mofed4.2/mpirun/gnu4.8.5/lib64/libmpi.so.12(MPI_Allgather+0x8d0) [0x7f9078e39f70]
[SC-FAT-EHPC-CF-BDW50:mpi_rank_0][print_backtrace] 15: /usr/local/osumb/libexec/osu-micro-benchmarks/mpi/collective/osu_allgather() [0x401d23]
[SC-FAT-EHPC-CF-BDW50:mpi_rank_0][print_backtrace] 16: /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf0) [0x7f90789fa830]
[SC-FAT-EHPC-CF-BDW50:mpi_rank_0][print_backtrace] 17: /usr/local/osumb/libexec/osu-micro-benchmarks/mpi/collective/osu_allgather() [0x402189]
===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= PID 19095 RUNNING AT SC-FAT-EHPC-CF-BDW50
= EXIT CODE: 139
= CLEANING UP REMAINING PROCESSES
= YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================
YOUR APPLICATION TERMINATED WITH THE EXIT STRING: Segmentation fault (signal 11)
This typically refers to a problem with your application.
Please see the FAQ page for debugging suggestions
Thanks,
Adam Guymon
-----------------------------------------------------------------------------------
This email message is for the sole use of the intended recipient(s) and may contain
confidential information. Any unauthorized review, use, disclosure or distribution
is prohibited. If you are not the intended recipient, please contact the sender by
reply email and destroy all copies of the original message.
-----------------------------------------------------------------------------------
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20180927/101cd7e1/attachment-0001.html>
More information about the mvapich-discuss
mailing list