[mvapich-discuss] Seeing segfault when CMA is disabled for CUDA device buffers

Akshay Venkatesh akvenkatesh at nvidia.com
Wed Mar 21 20:56:45 EDT 2018


Hi, Dr. Panda and Hari.


I'm using 2.3RC1 available here (http://mvapich.cse.ohio-state.edu/download/mvapich/mv2/mvapich2-2.3rc1.tar.gz) and trying to run osu_bcast with GPU buffers within a single node. I see the same error with osu_latency D D as well. CMA isn't available on the system so I'm disabling it.


$ mpirun_rsh -np 2 -hostfile hostfile MV2_SMP_USE_CMA=0 MV2_DEBUG_SHOW_BACKTRACE=1 ./get_local_rank mpi/collective/osu_bcast -d cuda

# OSU MPI-CUDA Broadcast Latency Test v5.4.1
# Size       Avg Latency(us)
[hsw0:mpi_rank_0][error_sighandler] Caught error: Segmentation fault (signal 11)
[hsw0:mpi_rank_0][print_backtrace]   0: $MPI_HOME/lib/libmpi.so.12(print_backtrace+0x2f) [0x7efc63d9d87f]
[hsw0:mpi_rank_0][print_backtrace]   1: $MPI_HOME/lib/libmpi.so.12(error_sighandler+0x63) [0x7efc63d9d9c3]
[hsw0:mpi_rank_0][print_backtrace]   2: /lib/x86_64-linux-gnu/libc.so.6(+0x354b0) [0x7efc61e9e4b0]
[hsw0:mpi_rank_0][print_backtrace]   3: /lib/x86_64-linux-gnu/libc.so.6(+0x14e02a) [0x7efc61fb702a]
[hsw0:mpi_rank_0][print_backtrace]   4: $MPI_HOME/lib/libmpi.so.12(mv2_smp_fast_write_contig+0x3fe) [0x7efc63d3dfce]
[hsw0:mpi_rank_0][print_backtrace]   5: $MPI_HOME/lib/libmpi.so.12(MPIDI_CH3_EagerContigShortSend+0xa5) [0x7efc63d24fa5]
[hsw0:mpi_rank_0][print_backtrace]   6: $MPI_HOME/lib/libmpi.so.12(MPID_Send+0x886) [0x7efc63d2e336]
[hsw0:mpi_rank_0][print_backtrace]   7: $MPI_HOME/lib/libmpi.so.12(MPIC_Send+0x57) [0x7efc63ccad47]
[hsw0:mpi_rank_0][print_backtrace]   8: $MPI_HOME/lib/libmpi.so.12(+0x7fe05) [0x7efc63a34e05]
[hsw0:mpi_rank_0][print_backtrace]   9: $MPI_HOME/lib/libmpi.so.12(MPIR_Bcast_intra+0x205) [0x7efc63a366d5]
[hsw0:mpi_rank_0][print_backtrace]  10: $MPI_HOME/lib/libmpi.so.12(MPIR_Bcast_index_tuned_intra_MV2+0x46b) [0x7efc63a9669b]
[hsw0:mpi_rank_0][print_backtrace]  11: $MPI_HOME/lib/libmpi.so.12(MPIR_Bcast_MV2+0xb9) [0x7efc63a94089]
[hsw0:mpi_rank_0][print_backtrace]  12: $MPI_HOME/lib/libmpi.so.12(MPIR_Bcast_intra+0x507) [0x7efc63a369d7]
[hsw0:mpi_rank_0][print_backtrace]  13: $MPI_HOME/lib/libmpi.so.12(MPIR_Bcast_index_tuned_intra_MV2+0x46b) [0x7efc63a9669b]
[hsw0:mpi_rank_0][print_backtrace]  14: $MPI_HOME/lib/libmpi.so.12(MPIR_Bcast_MV2+0xb9) [0x7efc63a94089]
[hsw0:mpi_rank_0][print_backtrace]  15: $MPI_HOME/lib/libmpi.so.12(MPIR_Bcast_impl+0x1b) [0x7efc63a3731b]
[hsw0:mpi_rank_0][print_backtrace]  16: $MPI_HOME/lib/libmpi.so.12(MPI_Bcast+0x601) [0x7efc63a37aa1]
[hsw0:mpi_rank_0][print_backtrace]  17: mpi/collective/osu_bcast() [0x401d43]
[hsw0:mpi_rank_0][print_backtrace]  18: /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf0) [0x7efc61e89830]
[hsw0:mpi_rank_0][print_backtrace]  19: mpi/collective/osu_bcast() [0x402149]
[hsw0:mpispawn_0][readline] Unexpected End-Of-File on file descriptor 7. MPI process died?
[hsw0:mpispawn_0][mtpmi_processops] Error while reading PMI socket. MPI process died?
[hsw0:mpispawn_0][child_handler] MPI process (rank: 0, pid: 255621) terminated with signal 11 -> abort job
[hsw0:mpirun_rsh][process_mpispawn_connection] mpispawn_0 from node hsw0 aborted: Error while reading a PMI socket (4)

I'm using cuda-9.1 and configuring in the following way:

./configure --prefix=$PWD/install --enable-cuda --with-cuda=/usr/local/cuda-9.1

Please let me know if I'm missing a parameter or if there's a way to get around the problem.

Thanks

-----------------------------------------------------------------------------------
This email message is for the sole use of the intended recipient(s) and may contain
confidential information.  Any unauthorized review, use, disclosure or distribution
is prohibited.  If you are not the intended recipient, please contact the sender by
reply email and destroy all copies of the original message.
-----------------------------------------------------------------------------------
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20180322/74b6ecda/attachment.html>


More information about the mvapich-discuss mailing list