[mvapich-discuss] Seeing segfault when CMA is disabled for CUDA device buffers

Akshay Venkatesh akvenkatesh at nvidia.com
Wed Mar 21 21:47:06 EDT 2018


Thanks, Hari.


The test still fails with a segfault but at a different point:

[hsw0:mpi_rank_0][error_sighandler] Caught error: Segmentation fault (signal 11)
[hsw0:mpi_rank_0][print_backtrace]   0: $MPI_HOME/lib/libmpi.so.12(print_backtrace+0x2f) [0x7f2c9a66187f]
[hsw0:mpi_rank_0][print_backtrace]   1: $MPI_HOME/lib/libmpi.so.12(error_sighandler+0x63) [0x7f2c9a6619c3]
[hsw0:mpi_rank_0][print_backtrace]   2: /lib/x86_64-linux-gnu/libc.so.6(+0x354b0) [0x7f2c987624b0]
[hsw0:mpi_rank_0][print_backtrace]   3: /lib/x86_64-linux-gnu/libc.so.6(+0x14e02a) [0x7f2c9887b02a]
[hsw0:mpi_rank_0][print_backtrace]   4: $MPI_HOME/lib/libmpi.so.12(MPIDI_CH3I_SMP_write_contig+0x360) [0x7f2c9a608540]
[hsw0:mpi_rank_0][print_backtrace]   5: $MPI_HOME/lib/libmpi.so.12(MPIDI_CH3_ContigSend+0x2e4) [0x7f2c9a60d794]
[hsw0:mpi_rank_0][print_backtrace]   6: $MPI_HOME/lib/libmpi.so.12(MPIDI_CH3_EagerContigSend+0x55) ...
________________________________
From: Subramoni, Hari <subramoni.1 at osu.edu>
Sent: Wednesday, March 21, 2018 6:36 PM
To: Akshay Venkatesh; mvapich-discuss at cse.ohio-state.edu
Cc: Subramoni, Hari
Subject: RE: [mvapich-discuss] Seeing segfault when CMA is disabled for CUDA device buffers

Hi, Akshay.

Can you try with "MV2_USE_EAGER_FAST_SEND=0"?

As you know, we do not recommend using basic MVAPICH2 for GPU-based testing or GPU-enabled applications. Please use MVAPICH2-GDR for this.


Regards,
Hari.

From: mvapich-discuss-bounces at cse.ohio-state.edu On Behalf Of Akshay Venkatesh
Sent: Wednesday, March 21, 2018 8:57 PM
To: mvapich-discuss at cse.ohio-state.edu <mvapich-discuss at mailman.cse.ohio-state.edu>
Subject: [mvapich-discuss] Seeing segfault when CMA is disabled for CUDA device buffers


Hi, Dr. Panda and Hari.



I'm using 2.3RC1 available here (http://mvapich.cse.ohio-state.edu/download/mvapich/mv2/mvapich2-2.3rc1.tar.gz) and trying to run osu_bcast with GPU buffers within a single node. I see the same error with osu_latency D D as well. CMA isn't available on the system so I'm disabling it.


$ mpirun_rsh -np 2 -hostfile hostfile MV2_SMP_USE_CMA=0 MV2_DEBUG_SHOW_BACKTRACE=1 ./get_local_rank mpi/collective/osu_bcast -d cuda

# OSU MPI-CUDA Broadcast Latency Test v5.4.1
# Size       Avg Latency(us)
[hsw0:mpi_rank_0][error_sighandler] Caught error: Segmentation fault (signal 11)
[hsw0:mpi_rank_0][print_backtrace]   0: $MPI_HOME/lib/libmpi.so.12(print_backtrace+0x2f) [0x7efc63d9d87f]
[hsw0:mpi_rank_0][print_backtrace]   1: $MPI_HOME/lib/libmpi.so.12(error_sighandler+0x63) [0x7efc63d9d9c3]
[hsw0:mpi_rank_0][print_backtrace]   2: /lib/x86_64-linux-gnu/libc.so.6(+0x354b0) [0x7efc61e9e4b0]
[hsw0:mpi_rank_0][print_backtrace]   3: /lib/x86_64-linux-gnu/libc.so.6(+0x14e02a) [0x7efc61fb702a]
[hsw0:mpi_rank_0][print_backtrace]   4: $MPI_HOME/lib/libmpi.so.12(mv2_smp_fast_write_contig+0x3fe) [0x7efc63d3dfce]
[hsw0:mpi_rank_0][print_backtrace]   5: $MPI_HOME/lib/libmpi.so.12(MPIDI_CH3_EagerContigShortSend+0xa5) [0x7efc63d24fa5]
[hsw0:mpi_rank_0][print_backtrace]   6: $MPI_HOME/lib/libmpi.so.12(MPID_Send+0x886) [0x7efc63d2e336]
[hsw0:mpi_rank_0][print_backtrace]   7: $MPI_HOME/lib/libmpi.so.12(MPIC_Send+0x57) [0x7efc63ccad47]
[hsw0:mpi_rank_0][print_backtrace]   8: $MPI_HOME/lib/libmpi.so.12(+0x7fe05) [0x7efc63a34e05]
[hsw0:mpi_rank_0][print_backtrace]   9: $MPI_HOME/lib/libmpi.so.12(MPIR_Bcast_intra+0x205) [0x7efc63a366d5]
[hsw0:mpi_rank_0][print_backtrace]  10: $MPI_HOME/lib/libmpi.so.12(MPIR_Bcast_index_tuned_intra_MV2+0x46b) [0x7efc63a9669b]
[hsw0:mpi_rank_0][print_backtrace]  11: $MPI_HOME/lib/libmpi.so.12(MPIR_Bcast_MV2+0xb9) [0x7efc63a94089]
[hsw0:mpi_rank_0][print_backtrace]  12: $MPI_HOME/lib/libmpi.so.12(MPIR_Bcast_intra+0x507) [0x7efc63a369d7]
[hsw0:mpi_rank_0][print_backtrace]  13: $MPI_HOME/lib/libmpi.so.12(MPIR_Bcast_index_tuned_intra_MV2+0x46b) [0x7efc63a9669b]
[hsw0:mpi_rank_0][print_backtrace]  14: $MPI_HOME/lib/libmpi.so.12(MPIR_Bcast_MV2+0xb9) [0x7efc63a94089]
[hsw0:mpi_rank_0][print_backtrace]  15: $MPI_HOME/lib/libmpi.so.12(MPIR_Bcast_impl+0x1b) [0x7efc63a3731b]
[hsw0:mpi_rank_0][print_backtrace]  16: $MPI_HOME/lib/libmpi.so.12(MPI_Bcast+0x601) [0x7efc63a37aa1]
[hsw0:mpi_rank_0][print_backtrace]  17: mpi/collective/osu_bcast() [0x401d43]
[hsw0:mpi_rank_0][print_backtrace]  18: /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf0) [0x7efc61e89830]
[hsw0:mpi_rank_0][print_backtrace]  19: mpi/collective/osu_bcast() [0x402149]
[hsw0:mpispawn_0][readline] Unexpected End-Of-File on file descriptor 7. MPI process died?
[hsw0:mpispawn_0][mtpmi_processops] Error while reading PMI socket. MPI process died?
[hsw0:mpispawn_0][child_handler] MPI process (rank: 0, pid: 255621) terminated with signal 11 -> abort job
[hsw0:mpirun_rsh][process_mpispawn_connection] mpispawn_0 from node hsw0 aborted: Error while reading a PMI socket (4)

I'm using cuda-9.1 and configuring in the following way:

./configure --prefix=$PWD/install --enable-cuda --with-cuda=/usr/local/cuda-9.1

Please let me know if I'm missing a parameter or if there's a way to get around the problem.

Thanks
________________________________
This email message is for the sole use of the intended recipient(s) and may contain confidential information.  Any unauthorized review, use, disclosure or distribution is prohibited.  If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message.
________________________________
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20180322/438c8559/attachment-0001.html>


More information about the mvapich-discuss mailing list