[mvapich-discuss] CUDA runtime issues when compiling from source

Adam Guymon aguymon at nvidia.com
Thu Sep 27 16:10:01 EDT 2018


Hello,


I am having runtime issues when compiling from source running the collective benchmark with cuda. I believe it may be linked to this issue: http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/2015-April/005595.html It was unclear to me whether this issue was ever resolved. Below are additional details on configuration and how I am running the test. Any information you could provide would be a big help. When I run the test with the MV2-GDR 2.3rc1 version installed it works fine. Just fails when I build from source.


Configured to match the MV2-GDR 2.3rc1 configuration:

./configure --prefix=/opt/mvapich2/gdr/2.3rc1/mcast/no-openacc/cuda9.2/mofed4.2/mpirun/gnu4.8.5 --disable-rpath --disable-static --enable-shared --disable-rdma-cm --with-core-direct --enable-cuda --with-cuda-include=/usr/local/cuda-9.2/include --with-cuda-libpath=/usr/local/cuda-9.2/lib64/



$ MV2_USE_CUDA=1 MV2_USE_GPUDIRECT_GDRCOPY=0 MV2_DEBUG_SHOW_BACKTRACE=1 mpirun -np 2 /usr/local/osumb/libexec/osu-micro-benchmarks/mpi/collective/osu_allgather -d cuda

# OSU MPI-CUDA Allgather Latency Test v5.4.4
# Size       Avg Latency(us)
1                      26.07
2                      21.49
4                      20.09
8                      18.70
16                     18.39
32                     17.79
64                     18.24
128                    18.19
256                    18.41
512                    18.79
1024                   19.16
2048                   19.86
4096                   21.53
8192                   25.90
16384                  33.27
32768                  52.74
65536                  92.77
131072                163.99
[SC-FAT-EHPC-CF-BDW50:mpi_rank_0][error_sighandler] Caught error: Segmentation fault (signal 11)
[SC-FAT-EHPC-CF-BDW50:mpi_rank_0][print_backtrace]   0: /opt/mvapich2/gdr/2.3rc1/mcast/no-openacc/cuda9.2/mofed4.2/mpirun/gnu4.8.5/lib64/libmpi.so.12(print_backtrace+0x2f) [0x7f90791cf01f]
[SC-FAT-EHPC-CF-BDW50:mpi_rank_0][print_backtrace]   1: /opt/mvapich2/gdr/2.3rc1/mcast/no-openacc/cuda9.2/mofed4.2/mpirun/gnu4.8.5/lib64/libmpi.so.12(error_sighandler+0x63) [0x7f90791cf163]
[SC-FAT-EHPC-CF-BDW50:mpi_rank_0][print_backtrace]   2: /lib/x86_64-linux-gnu/libc.so.6(+0x354b0) [0x7f9078a0f4b0]
[SC-FAT-EHPC-CF-BDW50:mpi_rank_0][print_backtrace]   3: /opt/mvapich2/gdr/2.3rc1/mcast/no-openacc/cuda9.2/mofed4.2/mpirun/gnu4.8.5/lib64/libmpi.so.12(MPIDI_CH3_SMP_iStartMsg+0x150) [0x7f907916df00]
[SC-FAT-EHPC-CF-BDW50:mpi_rank_0][print_backtrace]   4: /opt/mvapich2/gdr/2.3rc1/mcast/no-openacc/cuda9.2/mofed4.2/mpirun/gnu4.8.5/lib64/libmpi.so.12(MPIDI_CH3_iStartMsg+0x14e) [0x7f907916e0de]
[SC-FAT-EHPC-CF-BDW50:mpi_rank_0][print_backtrace]   5: /opt/mvapich2/gdr/2.3rc1/mcast/no-openacc/cuda9.2/mofed4.2/mpirun/gnu4.8.5/lib64/libmpi.so.12(MPIDI_CH3_Rendezvous_push+0xc9) [0x7f9079172b19]
[SC-FAT-EHPC-CF-BDW50:mpi_rank_0][print_backtrace]   6: /opt/mvapich2/gdr/2.3rc1/mcast/no-openacc/cuda9.2/mofed4.2/mpirun/gnu4.8.5/lib64/libmpi.so.12(MPIDI_CH3I_MRAILI_Process_rndv+0x81) [0x7f9079172f81]
[SC-FAT-EHPC-CF-BDW50:mpi_rank_0][print_backtrace]   7: /opt/mvapich2/gdr/2.3rc1/mcast/no-openacc/cuda9.2/mofed4.2/mpirun/gnu4.8.5/lib64/libmpi.so.12(MPIDI_CH3I_Progress+0xfb) [0x7f907916fe7b]
[SC-FAT-EHPC-CF-BDW50:mpi_rank_0][print_backtrace]   8: /opt/mvapich2/gdr/2.3rc1/mcast/no-openacc/cuda9.2/mofed4.2/mpirun/gnu4.8.5/lib64/libmpi.so.12(MPIR_Waitall_impl+0x3b6) [0x7f90790d80f6]
[SC-FAT-EHPC-CF-BDW50:mpi_rank_0][print_backtrace]   9: /opt/mvapich2/gdr/2.3rc1/mcast/no-openacc/cuda9.2/mofed4.2/mpirun/gnu4.8.5/lib64/libmpi.so.12(MPIC_Waitall+0xa2) [0x7f9079101f22]
[SC-FAT-EHPC-CF-BDW50:mpi_rank_0][print_backtrace]  10: /opt/mvapich2/gdr/2.3rc1/mcast/no-openacc/cuda9.2/mofed4.2/mpirun/gnu4.8.5/lib64/libmpi.so.12(MPIR_Allgather_cuda_intra_MV2+0x64a) [0x7f9078ea8c2a]
[SC-FAT-EHPC-CF-BDW50:mpi_rank_0][print_backtrace]  11: /opt/mvapich2/gdr/2.3rc1/mcast/no-openacc/cuda9.2/mofed4.2/mpirun/gnu4.8.5/lib64/libmpi.so.12(MPIR_Allgather_index_tuned_intra_MV2+0x1e0) [0x7f9078e71800]
[SC-FAT-EHPC-CF-BDW50:mpi_rank_0][print_backtrace]  12: /opt/mvapich2/gdr/2.3rc1/mcast/no-openacc/cuda9.2/mofed4.2/mpirun/gnu4.8.5/lib64/libmpi.so.12(MPIR_Allgather_MV2+0x8b) [0x7f9078e726ab]
[SC-FAT-EHPC-CF-BDW50:mpi_rank_0][print_backtrace]  13: /opt/mvapich2/gdr/2.3rc1/mcast/no-openacc/cuda9.2/mofed4.2/mpirun/gnu4.8.5/lib64/libmpi.so.12(MPIR_Allgather_impl+0x29) [0x7f9078e39619]
[SC-FAT-EHPC-CF-BDW50:mpi_rank_0][print_backtrace]  14: /opt/mvapich2/gdr/2.3rc1/mcast/no-openacc/cuda9.2/mofed4.2/mpirun/gnu4.8.5/lib64/libmpi.so.12(MPI_Allgather+0x8d0) [0x7f9078e39f70]
[SC-FAT-EHPC-CF-BDW50:mpi_rank_0][print_backtrace]  15: /usr/local/osumb/libexec/osu-micro-benchmarks/mpi/collective/osu_allgather() [0x401d23]
[SC-FAT-EHPC-CF-BDW50:mpi_rank_0][print_backtrace]  16: /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf0) [0x7f90789fa830]
[SC-FAT-EHPC-CF-BDW50:mpi_rank_0][print_backtrace]  17: /usr/local/osumb/libexec/osu-micro-benchmarks/mpi/collective/osu_allgather() [0x402189]
===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   PID 19095 RUNNING AT SC-FAT-EHPC-CF-BDW50
=   EXIT CODE: 139
=   CLEANING UP REMAINING PROCESSES
=   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================
YOUR APPLICATION TERMINATED WITH THE EXIT STRING: Segmentation fault (signal 11)
This typically refers to a problem with your application.
Please see the FAQ page for debugging suggestions

Thanks,
Adam Guymon

-----------------------------------------------------------------------------------
This email message is for the sole use of the intended recipient(s) and may contain
confidential information.  Any unauthorized review, use, disclosure or distribution
is prohibited.  If you are not the intended recipient, please contact the sender by
reply email and destroy all copies of the original message.
-----------------------------------------------------------------------------------
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20180927/101cd7e1/attachment-0001.html>


More information about the mvapich-discuss mailing list