[mvapich-discuss] Seeing segfault when CMA is disabled for CUDA device buffers

Subramoni, Hari subramoni.1 at osu.edu
Wed Mar 21 21:36:14 EDT 2018


Hi, Akshay.

Can you try with "MV2_USE_EAGER_FAST_SEND=0"?

As you know, we do not recommend using basic MVAPICH2 for GPU-based testing or GPU-enabled applications. Please use MVAPICH2-GDR for this.


Regards,
Hari.

From: mvapich-discuss-bounces at cse.ohio-state.edu On Behalf Of Akshay Venkatesh
Sent: Wednesday, March 21, 2018 8:57 PM
To: mvapich-discuss at cse.ohio-state.edu <mvapich-discuss at mailman.cse.ohio-state.edu>
Subject: [mvapich-discuss] Seeing segfault when CMA is disabled for CUDA device buffers


Hi, Dr. Panda and Hari.



I'm using 2.3RC1 available here (http://mvapich.cse.ohio-state.edu/download/mvapich/mv2/mvapich2-2.3rc1.tar.gz) and trying to run osu_bcast with GPU buffers within a single node. I see the same error with osu_latency D D as well. CMA isn't available on the system so I'm disabling it.


$ mpirun_rsh -np 2 -hostfile hostfile MV2_SMP_USE_CMA=0 MV2_DEBUG_SHOW_BACKTRACE=1 ./get_local_rank mpi/collective/osu_bcast -d cuda

# OSU MPI-CUDA Broadcast Latency Test v5.4.1
# Size       Avg Latency(us)
[hsw0:mpi_rank_0][error_sighandler] Caught error: Segmentation fault (signal 11)
[hsw0:mpi_rank_0][print_backtrace]   0: $MPI_HOME/lib/libmpi.so.12(print_backtrace+0x2f) [0x7efc63d9d87f]
[hsw0:mpi_rank_0][print_backtrace]   1: $MPI_HOME/lib/libmpi.so.12(error_sighandler+0x63) [0x7efc63d9d9c3]
[hsw0:mpi_rank_0][print_backtrace]   2: /lib/x86_64-linux-gnu/libc.so.6(+0x354b0) [0x7efc61e9e4b0]
[hsw0:mpi_rank_0][print_backtrace]   3: /lib/x86_64-linux-gnu/libc.so.6(+0x14e02a) [0x7efc61fb702a]
[hsw0:mpi_rank_0][print_backtrace]   4: $MPI_HOME/lib/libmpi.so.12(mv2_smp_fast_write_contig+0x3fe) [0x7efc63d3dfce]
[hsw0:mpi_rank_0][print_backtrace]   5: $MPI_HOME/lib/libmpi.so.12(MPIDI_CH3_EagerContigShortSend+0xa5) [0x7efc63d24fa5]
[hsw0:mpi_rank_0][print_backtrace]   6: $MPI_HOME/lib/libmpi.so.12(MPID_Send+0x886) [0x7efc63d2e336]
[hsw0:mpi_rank_0][print_backtrace]   7: $MPI_HOME/lib/libmpi.so.12(MPIC_Send+0x57) [0x7efc63ccad47]
[hsw0:mpi_rank_0][print_backtrace]   8: $MPI_HOME/lib/libmpi.so.12(+0x7fe05) [0x7efc63a34e05]
[hsw0:mpi_rank_0][print_backtrace]   9: $MPI_HOME/lib/libmpi.so.12(MPIR_Bcast_intra+0x205) [0x7efc63a366d5]
[hsw0:mpi_rank_0][print_backtrace]  10: $MPI_HOME/lib/libmpi.so.12(MPIR_Bcast_index_tuned_intra_MV2+0x46b) [0x7efc63a9669b]
[hsw0:mpi_rank_0][print_backtrace]  11: $MPI_HOME/lib/libmpi.so.12(MPIR_Bcast_MV2+0xb9) [0x7efc63a94089]
[hsw0:mpi_rank_0][print_backtrace]  12: $MPI_HOME/lib/libmpi.so.12(MPIR_Bcast_intra+0x507) [0x7efc63a369d7]
[hsw0:mpi_rank_0][print_backtrace]  13: $MPI_HOME/lib/libmpi.so.12(MPIR_Bcast_index_tuned_intra_MV2+0x46b) [0x7efc63a9669b]
[hsw0:mpi_rank_0][print_backtrace]  14: $MPI_HOME/lib/libmpi.so.12(MPIR_Bcast_MV2+0xb9) [0x7efc63a94089]
[hsw0:mpi_rank_0][print_backtrace]  15: $MPI_HOME/lib/libmpi.so.12(MPIR_Bcast_impl+0x1b) [0x7efc63a3731b]
[hsw0:mpi_rank_0][print_backtrace]  16: $MPI_HOME/lib/libmpi.so.12(MPI_Bcast+0x601) [0x7efc63a37aa1]
[hsw0:mpi_rank_0][print_backtrace]  17: mpi/collective/osu_bcast() [0x401d43]
[hsw0:mpi_rank_0][print_backtrace]  18: /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf0) [0x7efc61e89830]
[hsw0:mpi_rank_0][print_backtrace]  19: mpi/collective/osu_bcast() [0x402149]
[hsw0:mpispawn_0][readline] Unexpected End-Of-File on file descriptor 7. MPI process died?
[hsw0:mpispawn_0][mtpmi_processops] Error while reading PMI socket. MPI process died?
[hsw0:mpispawn_0][child_handler] MPI process (rank: 0, pid: 255621) terminated with signal 11 -> abort job
[hsw0:mpirun_rsh][process_mpispawn_connection] mpispawn_0 from node hsw0 aborted: Error while reading a PMI socket (4)

I'm using cuda-9.1 and configuring in the following way:

./configure --prefix=$PWD/install --enable-cuda --with-cuda=/usr/local/cuda-9.1

Please let me know if I'm missing a parameter or if there's a way to get around the problem.

Thanks
________________________________
This email message is for the sole use of the intended recipient(s) and may contain confidential information.  Any unauthorized review, use, disclosure or distribution is prohibited.  If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message.
________________________________
-------------- next part --------------
A non-text attachment was scrubbed...
Name: winmail.dat
Type: application/ms-tnef
Size: 16343 bytes
Desc: not available
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20180322/cc8768db/attachment-0001.bin>


More information about the mvapich-discuss mailing list