[mvapich-discuss] Seeing segfault when CMA is disabled for CUDA device buffers

Ching-Hsiang Chu chu.368 at buckeyemail.osu.edu
Thu Mar 22 21:51:43 EDT 2018


Hi, Akshay,

Good to know this. Please feel free to let us know if you encounter any
issues.

Thanks,

On Thu, Mar 22, 2018 at 11:19 AM Akshay Venkatesh <akvenkatesh at nvidia.com>
wrote:

> Ching,
>
> Thanks for the help. I had missed mv2_use_cuda=1 environment variable. I'm
> able to run the experiments successfully after that.
>
>
>
> On Wed, Mar 21, 2018 at 6:47 PM -0700, "Akshay Venkatesh" <
> akvenkatesh at nvidia.com> wrote:
>
> Thanks, Hari.
>>
>>
>> The test still fails with a segfault but at a different point:
>> [hsw0:mpi_rank_0][error_sighandler] Caught error: Segmentation fault
>> (signal 11)
>> [hsw0:mpi_rank_0][print_backtrace]   0:
>> $MPI_HOME/lib/libmpi.so.12(print_backtrace+0x2f) [0x7f2c9a66187f]
>> [hsw0:mpi_rank_0][print_backtrace]   1:
>> $MPI_HOME/lib/libmpi.so.12(error_sighandler+0x63) [0x7f2c9a6619c3]
>> [hsw0:mpi_rank_0][print_backtrace]   2:
>> /lib/x86_64-linux-gnu/libc.so.6(+0x354b0) [0x7f2c987624b0]
>> [hsw0:mpi_rank_0][print_backtrace]   3:
>> /lib/x86_64-linux-gnu/libc.so.6(+0x14e02a) [0x7f2c9887b02a]
>> [hsw0:mpi_rank_0][print_backtrace]   4:
>> $MPI_HOME/lib/libmpi.so.12(MPIDI_CH3I_SMP_write_contig+0x360)
>> [0x7f2c9a608540]
>> [hsw0:mpi_rank_0][print_backtrace]   5:
>> $MPI_HOME/lib/libmpi.so.12(MPIDI_CH3_ContigSend+0x2e4) [0x7f2c9a60d794]
>> [hsw0:mpi_rank_0][print_backtrace]   6:
>> $MPI_HOME/lib/libmpi.so.12(MPIDI_CH3_EagerContigSend+0x55) ...
>> ------------------------------
>> *From:* Subramoni, Hari <subramoni.1 at osu.edu>
>> *Sent:* Wednesday, March 21, 2018 6:36 PM
>> *To:* Akshay Venkatesh; mvapich-discuss at cse.ohio-state.edu
>> *Cc:* Subramoni, Hari
>> *Subject:* RE: [mvapich-discuss] Seeing segfault when CMA is disabled
>> for CUDA device buffers
>>
>>
>> Hi, Akshay.
>>
>>
>>
>> Can you try with "MV2_USE_EAGER_FAST_SEND=0"?
>>
>>
>>
>> As you know, we do not recommend using basic MVAPICH2 for GPU-based
>> testing or GPU-enabled applications. Please use MVAPICH2-GDR for this.
>>
>>
>>
>>
>>
>> Regards,
>>
>> Hari.
>>
>>
>>
>> *From:* mvapich-discuss-bounces at cse.ohio-state.edu *On Behalf Of *Akshay
>> Venkatesh
>> *Sent:* Wednesday, March 21, 2018 8:57 PM
>> *To:* mvapich-discuss at cse.ohio-state.edu <
>> mvapich-discuss at mailman.cse.ohio-state.edu>
>> *Subject:* [mvapich-discuss] Seeing segfault when CMA is disabled for
>> CUDA device buffers
>>
>>
>>
>> Hi, Dr. Panda and Hari.
>>
>>
>>
>> I'm using 2.3RC1 available here (
>> http://mvapich.cse.ohio-state.edu/download/mvapich/mv2/mvapich2-2.3rc1.tar.gz)
>> and trying to run osu_bcast with GPU buffers within a single node. I see
>> the same error with osu_latency D D as well. CMA isn't available on the
>> system so I'm disabling it.
>>
>>
>>
>> $ mpirun_rsh -np 2 -hostfile hostfile MV2_SMP_USE_CMA=0
>> MV2_DEBUG_SHOW_BACKTRACE=1 ./get_local_rank mpi/collective/osu_bcast -d cuda
>>
>>
>>
>> # OSU MPI-CUDA Broadcast Latency Test v5.4.1
>>
>> # Size       Avg Latency(us)
>>
>> [hsw0:mpi_rank_0][error_sighandler] Caught error: Segmentation fault
>> (signal 11)
>>
>> [hsw0:mpi_rank_0][print_backtrace]   0:
>> $MPI_HOME/lib/libmpi.so.12(print_backtrace+0x2f) [0x7efc63d9d87f]
>>
>> [hsw0:mpi_rank_0][print_backtrace]   1:
>> $MPI_HOME/lib/libmpi.so.12(error_sighandler+0x63) [0x7efc63d9d9c3]
>>
>> [hsw0:mpi_rank_0][print_backtrace]   2:
>> /lib/x86_64-linux-gnu/libc.so.6(+0x354b0) [0x7efc61e9e4b0]
>>
>> [hsw0:mpi_rank_0][print_backtrace]   3:
>> /lib/x86_64-linux-gnu/libc.so.6(+0x14e02a) [0x7efc61fb702a]
>>
>> [hsw0:mpi_rank_0][print_backtrace]   4:
>> $MPI_HOME/lib/libmpi.so.12(mv2_smp_fast_write_contig+0x3fe) [0x7efc63d3dfce]
>>
>> [hsw0:mpi_rank_0][print_backtrace]   5:
>> $MPI_HOME/lib/libmpi.so.12(MPIDI_CH3_EagerContigShortSend+0xa5)
>> [0x7efc63d24fa5]
>>
>> [hsw0:mpi_rank_0][print_backtrace]   6:
>> $MPI_HOME/lib/libmpi.so.12(MPID_Send+0x886) [0x7efc63d2e336]
>>
>> [hsw0:mpi_rank_0][print_backtrace]   7:
>> $MPI_HOME/lib/libmpi.so.12(MPIC_Send+0x57) [0x7efc63ccad47]
>>
>> [hsw0:mpi_rank_0][print_backtrace]   8:
>> $MPI_HOME/lib/libmpi.so.12(+0x7fe05) [0x7efc63a34e05]
>>
>> [hsw0:mpi_rank_0][print_backtrace]   9:
>> $MPI_HOME/lib/libmpi.so.12(MPIR_Bcast_intra+0x205) [0x7efc63a366d5]
>>
>> [hsw0:mpi_rank_0][print_backtrace]  10:
>> $MPI_HOME/lib/libmpi.so.12(MPIR_Bcast_index_tuned_intra_MV2+0x46b)
>> [0x7efc63a9669b]
>>
>> [hsw0:mpi_rank_0][print_backtrace]  11:
>> $MPI_HOME/lib/libmpi.so.12(MPIR_Bcast_MV2+0xb9) [0x7efc63a94089]
>>
>> [hsw0:mpi_rank_0][print_backtrace]  12:
>> $MPI_HOME/lib/libmpi.so.12(MPIR_Bcast_intra+0x507) [0x7efc63a369d7]
>>
>> [hsw0:mpi_rank_0][print_backtrace]  13:
>> $MPI_HOME/lib/libmpi.so.12(MPIR_Bcast_index_tuned_intra_MV2+0x46b)
>> [0x7efc63a9669b]
>>
>> [hsw0:mpi_rank_0][print_backtrace]  14:
>> $MPI_HOME/lib/libmpi.so.12(MPIR_Bcast_MV2+0xb9) [0x7efc63a94089]
>>
>> [hsw0:mpi_rank_0][print_backtrace]  15:
>> $MPI_HOME/lib/libmpi.so.12(MPIR_Bcast_impl+0x1b) [0x7efc63a3731b]
>>
>> [hsw0:mpi_rank_0][print_backtrace]  16:
>> $MPI_HOME/lib/libmpi.so.12(MPI_Bcast+0x601) [0x7efc63a37aa1]
>>
>> [hsw0:mpi_rank_0][print_backtrace]  17: mpi/collective/osu_bcast()
>> [0x401d43]
>>
>> [hsw0:mpi_rank_0][print_backtrace]  18:
>> /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf0) [0x7efc61e89830]
>>
>> [hsw0:mpi_rank_0][print_backtrace]  19: mpi/collective/osu_bcast()
>> [0x402149]
>>
>> [hsw0:mpispawn_0][readline] Unexpected End-Of-File on file descriptor 7.
>> MPI process died?
>>
>> [hsw0:mpispawn_0][mtpmi_processops] Error while reading PMI socket. MPI
>> process died?
>>
>> [hsw0:mpispawn_0][child_handler] MPI process (rank: 0, pid: 255621)
>> terminated with signal 11 -> abort job
>>
>> [hsw0:mpirun_rsh][process_mpispawn_connection] mpispawn_0 from node hsw0
>> aborted: Error while reading a PMI socket (4)
>>
>>
>> I'm using cuda-9.1 and configuring in the following way:
>>
>>
>>
>> ./configure --prefix=$PWD/install --enable-cuda
>> --with-cuda=/usr/local/cuda-9.1
>>
>>
>>
>> Please let me know if I'm missing a parameter or if there's a way to get
>> around the problem.
>>
>>
>>
>> Thanks
>> ------------------------------
>>
>> This email message is for the sole use of the intended recipient(s) and
>> may contain confidential information.  Any unauthorized review, use,
>> disclosure or distribution is prohibited.  If you are not the intended
>> recipient, please contact the sender by reply email and destroy all copies
>> of the original message.
>> ------------------------------
>>
> _______________________________________________
> mvapich-discuss mailing list
> mvapich-discuss at cse.ohio-state.edu
> http://mailman.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>
-- 
Ching-Hsiang Chu
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20180323/8e2ad3be/attachment-0001.html>


More information about the mvapich-discuss mailing list