[mvapich-discuss] Segfault w/ GPUDirect MPISend fired from CUDA Callback (SMP)

Sat Feb 14 09:28:09 EST 2015

Hi Paul,

if you pass a device pointer to ISend it needs to do some calls into the CUDA API these are not allowed from stream call backs [1]. I am afraid that the best you can do is to use the polling aproche that you described.

Jiri

[1] http://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__STREAM.html#group__CUDART__STREAM_1g4d2688e1c3f3cf7da4bf55121fc0b0a1

Sent from my smartphone. Please excuse autocorrect typos.

---- Paul Sathre schrieb ----

Thanks for the quick reply, Khaled,

1) I will work on isolating a minimal test case next week.
2) No, would it be safe to have the callback function call Init_thread immediately before the ISend? (Much later than the global MPI_Init at the start of the program). Can i similarly MPI_finalize_thread in the callback function?
3) No, the main thread is actually blocking on a return from cudaThreadSynchronize to force the pack kernel to finish (because CUDA is blocking on the return of the segfaulting callback function.)
4) The thread is created by the CUDA runtime, I am unsure whether or not it has the same context, but would lean towards thinking the CUDA runtime smart enough to ensure it does.
5) I am unsure whether they are on the same socket, and capable of peer access, I will have to check. (I did check with MV2_CUDA_IPC=0 and =1 though, and both had segfaults.)
6) I was under the impression MPI_pack was for packing custom data types, whereas we are packing arbitrary regions of a multi-dimensional grid, based on arrays of user-specified offset/contig_length pairs. Am I incorrect and there's a lightweight way to achieve this through custom data types? Also we seamlessly interchange between OpenCL, CUDA, and OpenMP backends, so we are still stuck implementing an OpenCL pack kernel for transparent execution. However its callback chain isn't subject to this segfault, since we have to host-stage the buffer before MPI transfer anyway.. That is, unless you have a GPUDirect equivalent for OpenCL that I'm unaware of - which we'd be *very* interested in =)

Thanks again!

-Paul Sathre
Research Programmer - Synergy Lab
Dept. of Computer Science
Virginia Tech

On Fri, Feb 13, 2015 at 4:44 PM, khaled hamidouche <hamidouc at cse.ohio-state.edu<mailto:hamidouc at cse.ohio-state.edu>> wrote:
Hi Paul,

In order to help debugging this issue, can you please provide us some more information :

1) Can we have a reproducer of this issue, this will help us debugging the issue faster
2) Does your example use MPI_Init_thread? Your scenario belongs to the Multiple thread, so MPI needs to be aware of it.
3) Are the Thread and the main thread accessing the same buffer at the same time ?
4) How the thread is created ? Is the thread created with the same context than the processes ?
5) In your system configuration, is both the GPUs on same socket (i.e : can the IPC be used ?). If yes enabling IPC does it reach the same issue (seg fault at memcopy ?)
6) What is the exact use case of this, in other words why MPI_pack (the MVAPICH2 Kernel) is not sufficient ?

Thanks a lot

On Fri, Feb 13, 2015 at 2:30 PM, Paul Sathre <sath6220 at cs.vt.edu<mailto:sath6220 at cs.vt.edu>> wrote:
Hi,

I am constructing a library which requires fully asynchronous "pack and send" functionality, with a custom pack kernel, and (hopefully) a GPUDirect send. Therefore I have setup a pipeline via CUDA's callback mechanism, such that when the custom pack kernel completes asynchronously, the CUDA runtime automatically triggers a small function which launches an MPISend of the packed device buffer, and stores the request for the user application to test later. We are currently only testing intra-node exchanges via SMP.

However this segfaults with the following backtrace (for eager protocol, rendezvous similarly fails on a __memcpy_sse2_unaligned.)

#0  __memcpy_sse2_unaligned ()
    at ../sysdeps/x86_64/multiarch/memcpy-sse2-unaligned.S:37
#1  0x00007f0abf8c13d7 in MPIDI_CH3I_SMP_writev ()
   from /home/psath/mvapich2-2.1rc1/build/install/lib/libmpi.so.12
#2  0x00007f0abf8b6026 in MPIDI_CH3_iSendv ()
   from /home/psath/mvapich2-2.1rc1/build/install/lib/libmpi.so.12
#3  0x00007f0abf8a4c87 in MPIDI_CH3_EagerContigIsend ()
   from /home/psath/mvapich2-2.1rc1/build/install/lib/libmpi.so.12
#4  0x00007f0abf8ab9c1 in MPID_Isend ()
   from /home/psath/mvapich2-2.1rc1/build/install/lib/libmpi.so.12
#5  0x00007f0abf83272d in PMPI_Isend ()
   from /home/psath/mvapich2-2.1rc1/build/install/lib/libmpi.so.12
#6  0x00007f0abfc79a1f in cuda_sap_isend_cb (stream=0x0, status=cudaSuccess,
    data=0xb52d70) at metamorph_mpi.c:435

I am successfully able to transfer the same device buffer from the primary thread of the application, but when the MPISend is launched from the third thread (launched by the CUDA driver, which invokes the callback function) it seems to not understand that it is still a device pointer and cannot be copied with a CPU memcpy.

Hao Wang who is currently at our lab suggested explicitly disabling IPC (and separately trying to enable SMP_IPC), which I attempted, but didn't help.

We are using MVAPICH 2.1rc1
The configure line is:

../mvapich2-2.1rc1/configure --prefix=/home/psath/mvapich2-
2.1rc1/build/install --enable-cuda --disable-mcast --with-ib-libpath=/home/psath/libibverbs/install/lib --with-ib-include=/home/psath/libibverbs/install/include --with-libcuda=/usr/local/cuda-6.0/lib64 --with-libcudart=/usr/local/cuda-6.0/lib64/

The system has 2 K20x GPUs running Nvidia driver 331.67. We are using a userspace build of libibverbs.so v1.1.8-1 from the Debian repos.

Have you observed a use case like this before with similar segfaults? Do you have any further suggestions for tests or workarounds that will preserve the GPU direct behavior? (Forcing the callback to stall the transfer and place it on a helper list for the main thread to come back around to would incur additional polling overhead that should not be required, and bends the async model we are trying to implement.)

Thanks!
-Paul Sathre
Research Programmer - Synergy Lab
Dept. of Computer Science
Virginia Tech

_______________________________________________
mvapich-discuss mailing list
mvapich-discuss at cse.ohio-state.edu<mailto:mvapich-discuss at cse.ohio-state.edu>
http://mailman.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss

NVIDIA GmbH, Wuerselen, Germany, Amtsgericht Aachen, HRB 8361
Managing Director: Karen Theresa Burns

-----------------------------------------------------------------------------------
This email message is for the sole use of the intended recipient(s) and may contain
confidential information.  Any unauthorized review, use, disclosure or distribution
is prohibited.  If you are not the intended recipient, please contact the sender by
reply email and destroy all copies of the original message.
-----------------------------------------------------------------------------------
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20150214/d44e804d/attachment.html>