[mvapich-discuss] CUDA runtime issues when compiling from source

Subramoni, Hari subramoni.1 at osu.edu
Fri Sep 28 10:51:27 EDT 2018


Hello. Adam.

When you say build from source, I guess you download the MVAPICH2 source tarball from our website and configure it - correct?

Please note that MVAPICH2 and MVAPICH2-GDR are separate code bases. MVAPICH2-GDR has a lot more bug fixes and performance optimizations for GPU-enabled clusters and I would recommend using that for your GPU/CUDA-enabled applications.

Regards,
Hari.

From: mvapich-discuss-bounces at cse.ohio-state.edu On Behalf Of Adam Guymon
Sent: Thursday, September 27, 2018 4:10 PM
To: mvapich-discuss at cse.ohio-state.edu <mvapich-discuss at mailman.cse.ohio-state.edu>
Subject: [mvapich-discuss] CUDA runtime issues when compiling from source


Hello,



I am having runtime issues when compiling from source running the collective benchmark with cuda. I believe it may be linked to this issue: http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/2015-April/005595.html It was unclear to me whether this issue was ever resolved. Below are additional details on configuration and how I am running the test. Any information you could provide would be a big help. When I run the test with the MV2-GDR 2.3rc1 version installed it works fine. Just fails when I build from source.



Configured to match the MV2-GDR 2.3rc1 configuration:

./configure --prefix=/opt/mvapich2/gdr/2.3rc1/mcast/no-openacc/cuda9.2/mofed4.2/mpirun/gnu4.8.5 --disable-rpath --disable-static --enable-shared --disable-rdma-cm --with-core-direct --enable-cuda --with-cuda-include=/usr/local/cuda-9.2/include --with-cuda-libpath=/usr/local/cuda-9.2/lib64/




$ MV2_USE_CUDA=1 MV2_USE_GPUDIRECT_GDRCOPY=0 MV2_DEBUG_SHOW_BACKTRACE=1 mpirun -np 2 /usr/local/osumb/libexec/osu-micro-benchmarks/mpi/collective/osu_allgather -d cuda

# OSU MPI-CUDA Allgather Latency Test v5.4.4
# Size       Avg Latency(us)
1                      26.07
2                      21.49
4                      20.09
8                      18.70
16                     18.39
32                     17.79
64                     18.24
128                    18.19
256                    18.41
512                    18.79
1024                   19.16
2048                   19.86
4096                   21.53
8192                   25.90
16384                  33.27
32768                  52.74
65536                  92.77
131072                163.99
[SC-FAT-EHPC-CF-BDW50:mpi_rank_0][error_sighandler] Caught error: Segmentation fault (signal 11)
[SC-FAT-EHPC-CF-BDW50:mpi_rank_0][print_backtrace]   0: /opt/mvapich2/gdr/2.3rc1/mcast/no-openacc/cuda9.2/mofed4.2/mpirun/gnu4.8.5/lib64/libmpi.so.12(print_backtrace+0x2f) [0x7f90791cf01f]
[SC-FAT-EHPC-CF-BDW50:mpi_rank_0][print_backtrace]   1: /opt/mvapich2/gdr/2.3rc1/mcast/no-openacc/cuda9.2/mofed4.2/mpirun/gnu4.8.5/lib64/libmpi.so.12(error_sighandler+0x63) [0x7f90791cf163]
[SC-FAT-EHPC-CF-BDW50:mpi_rank_0][print_backtrace]   2: /lib/x86_64-linux-gnu/libc.so.6(+0x354b0) [0x7f9078a0f4b0]
[SC-FAT-EHPC-CF-BDW50:mpi_rank_0][print_backtrace]   3: /opt/mvapich2/gdr/2.3rc1/mcast/no-openacc/cuda9.2/mofed4.2/mpirun/gnu4.8.5/lib64/libmpi.so.12(MPIDI_CH3_SMP_iStartMsg+0x150) [0x7f907916df00]
[SC-FAT-EHPC-CF-BDW50:mpi_rank_0][print_backtrace]   4: /opt/mvapich2/gdr/2.3rc1/mcast/no-openacc/cuda9.2/mofed4.2/mpirun/gnu4.8.5/lib64/libmpi.so.12(MPIDI_CH3_iStartMsg+0x14e) [0x7f907916e0de]
[SC-FAT-EHPC-CF-BDW50:mpi_rank_0][print_backtrace]   5: /opt/mvapich2/gdr/2.3rc1/mcast/no-openacc/cuda9.2/mofed4.2/mpirun/gnu4.8.5/lib64/libmpi.so.12(MPIDI_CH3_Rendezvous_push+0xc9) [0x7f9079172b19]
[SC-FAT-EHPC-CF-BDW50:mpi_rank_0][print_backtrace]   6: /opt/mvapich2/gdr/2.3rc1/mcast/no-openacc/cuda9.2/mofed4.2/mpirun/gnu4.8.5/lib64/libmpi.so.12(MPIDI_CH3I_MRAILI_Process_rndv+0x81) [0x7f9079172f81]
[SC-FAT-EHPC-CF-BDW50:mpi_rank_0][print_backtrace]   7: /opt/mvapich2/gdr/2.3rc1/mcast/no-openacc/cuda9.2/mofed4.2/mpirun/gnu4.8.5/lib64/libmpi.so.12(MPIDI_CH3I_Progress+0xfb) [0x7f907916fe7b]
[SC-FAT-EHPC-CF-BDW50:mpi_rank_0][print_backtrace]   8: /opt/mvapich2/gdr/2.3rc1/mcast/no-openacc/cuda9.2/mofed4.2/mpirun/gnu4.8.5/lib64/libmpi.so.12(MPIR_Waitall_impl+0x3b6) [0x7f90790d80f6]
[SC-FAT-EHPC-CF-BDW50:mpi_rank_0][print_backtrace]   9: /opt/mvapich2/gdr/2.3rc1/mcast/no-openacc/cuda9.2/mofed4.2/mpirun/gnu4.8.5/lib64/libmpi.so.12(MPIC_Waitall+0xa2) [0x7f9079101f22]
[SC-FAT-EHPC-CF-BDW50:mpi_rank_0][print_backtrace]  10: /opt/mvapich2/gdr/2.3rc1/mcast/no-openacc/cuda9.2/mofed4.2/mpirun/gnu4.8.5/lib64/libmpi.so.12(MPIR_Allgather_cuda_intra_MV2+0x64a) [0x7f9078ea8c2a]
[SC-FAT-EHPC-CF-BDW50:mpi_rank_0][print_backtrace]  11: /opt/mvapich2/gdr/2.3rc1/mcast/no-openacc/cuda9.2/mofed4.2/mpirun/gnu4.8.5/lib64/libmpi.so.12(MPIR_Allgather_index_tuned_intra_MV2+0x1e0) [0x7f9078e71800]
[SC-FAT-EHPC-CF-BDW50:mpi_rank_0][print_backtrace]  12: /opt/mvapich2/gdr/2.3rc1/mcast/no-openacc/cuda9.2/mofed4.2/mpirun/gnu4.8.5/lib64/libmpi.so.12(MPIR_Allgather_MV2+0x8b) [0x7f9078e726ab]
[SC-FAT-EHPC-CF-BDW50:mpi_rank_0][print_backtrace]  13: /opt/mvapich2/gdr/2.3rc1/mcast/no-openacc/cuda9.2/mofed4.2/mpirun/gnu4.8.5/lib64/libmpi.so.12(MPIR_Allgather_impl+0x29) [0x7f9078e39619]
[SC-FAT-EHPC-CF-BDW50:mpi_rank_0][print_backtrace]  14: /opt/mvapich2/gdr/2.3rc1/mcast/no-openacc/cuda9.2/mofed4.2/mpirun/gnu4.8.5/lib64/libmpi.so.12(MPI_Allgather+0x8d0) [0x7f9078e39f70]
[SC-FAT-EHPC-CF-BDW50:mpi_rank_0][print_backtrace]  15: /usr/local/osumb/libexec/osu-micro-benchmarks/mpi/collective/osu_allgather() [0x401d23]
[SC-FAT-EHPC-CF-BDW50:mpi_rank_0][print_backtrace]  16: /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf0) [0x7f90789fa830]
[SC-FAT-EHPC-CF-BDW50:mpi_rank_0][print_backtrace]  17: /usr/local/osumb/libexec/osu-micro-benchmarks/mpi/collective/osu_allgather() [0x402189]
===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   PID 19095 RUNNING AT SC-FAT-EHPC-CF-BDW50
=   EXIT CODE: 139
=   CLEANING UP REMAINING PROCESSES
=   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================
YOUR APPLICATION TERMINATED WITH THE EXIT STRING: Segmentation fault (signal 11)
This typically refers to a problem with your application.
Please see the FAQ page for debugging suggestions

Thanks,
Adam Guymon
________________________________
This email message is for the sole use of the intended recipient(s) and may contain confidential information.  Any unauthorized review, use, disclosure or distribution is prohibited.  If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message.
________________________________
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20180928/436a6346/attachment-0001.html>


More information about the mvapich-discuss mailing list