[mvapich-discuss] Segfault w/ GPUDirect MPISend fired from CUDA Callback (SMP)

khaled hamidouche hamidouc at cse.ohio-state.edu
Fri Feb 13 16:44:18 EST 2015


Hi Paul,

In order to help debugging this issue, can you please provide us some more
information :

1) Can we have a reproducer of this issue, this will help us debugging the
issue faster
2) Does your example use MPI_Init_thread? Your scenario belongs to the
Multiple thread, so MPI needs to be aware of it.
3) Are the Thread and the main thread accessing the same buffer at the same
time ?
4) How the thread is created ? Is the thread created with the same context
than the processes ?
5) In your system configuration, is both the GPUs on same socket (i.e : can
the IPC be used ?). If yes enabling IPC does it reach the same issue (seg
fault at memcopy ?)
6) What is the exact use case of this, in other words why MPI_pack (the
MVAPICH2 Kernel) is not sufficient ?


Thanks a lot

On Fri, Feb 13, 2015 at 2:30 PM, Paul Sathre <sath6220 at cs.vt.edu> wrote:

> Hi,
>
> I am constructing a library which requires fully asynchronous "pack and
> send" functionality, with a custom pack kernel, and (hopefully) a GPUDirect
> send. Therefore I have setup a pipeline via CUDA's callback mechanism, such
> that when the custom pack kernel completes asynchronously, the CUDA runtime
> automatically triggers a small function which launches an MPISend of the
> packed device buffer, and stores the request for the user application to
> test later. We are currently only testing intra-node exchanges via SMP.
>
> However this segfaults with the following backtrace (for eager protocol,
> rendezvous similarly fails on a __memcpy_sse2_unaligned.)
>
> #0  __memcpy_sse2_unaligned ()
>     at ../sysdeps/x86_64/multiarch/memcpy-sse2-unaligned.S:37
> #1  0x00007f0abf8c13d7 in MPIDI_CH3I_SMP_writev ()
>    from /home/psath/mvapich2-2.1rc1/build/install/lib/libmpi.so.12
> #2  0x00007f0abf8b6026 in MPIDI_CH3_iSendv ()
>    from /home/psath/mvapich2-2.1rc1/build/install/lib/libmpi.so.12
> #3  0x00007f0abf8a4c87 in MPIDI_CH3_EagerContigIsend ()
>    from /home/psath/mvapich2-2.1rc1/build/install/lib/libmpi.so.12
> #4  0x00007f0abf8ab9c1 in MPID_Isend ()
>    from /home/psath/mvapich2-2.1rc1/build/install/lib/libmpi.so.12
> #5  0x00007f0abf83272d in PMPI_Isend ()
>    from /home/psath/mvapich2-2.1rc1/build/install/lib/libmpi.so.12
> #6  0x00007f0abfc79a1f in cuda_sap_isend_cb (stream=0x0,
> status=cudaSuccess,
>     data=0xb52d70) at metamorph_mpi.c:435
>
> I am successfully able to transfer the same device buffer from the primary
> thread of the application, but when the MPISend is launched from the third
> thread (launched by the CUDA driver, which invokes the callback function)
> it seems to not understand that it is still a device pointer and cannot be
> copied with a CPU memcpy.
>
> Hao Wang who is currently at our lab suggested explicitly disabling IPC
> (and separately trying to *enable* SMP_IPC), which I attempted, but
> didn't help.
>
> We are using MVAPICH 2.1rc1
> The configure line is:
>
> ../mvapich2-2.1rc1/configure --prefix=/home/psath/mvapich2-
> 2.1rc1/build/install --enable-cuda --disable-mcast
> --with-ib-libpath=/home/psath/libibverbs/install/lib
> --with-ib-include=/home/psath/libibverbs/install/include
> --with-libcuda=/usr/local/cuda-6.0/lib64
> --with-libcudart=/usr/local/cuda-6.0/lib64/
>
> The system has 2 K20x GPUs running Nvidia driver 331.67. We are using a
> userspace build of libibverbs.so v1.1.8-1 from the Debian repos.
>
> Have you observed a use case like this before with similar segfaults? Do
> you have any further suggestions for tests or workarounds that will
> preserve the GPU direct behavior? (Forcing the callback to stall the
> transfer and place it on a helper list for the main thread to come back
> around to would incur additional polling overhead that should not be
> required, and bends the async model we are trying to implement.)
>
>
> Thanks!
> -Paul Sathre
> Research Programmer - Synergy Lab
> Dept. of Computer Science
> Virginia Tech
>
>
> _______________________________________________
> mvapich-discuss mailing list
> mvapich-discuss at cse.ohio-state.edu
> http://mailman.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20150213/c0a4bd85/attachment-0005.html>


More information about the mvapich-discuss mailing list