[mvapich-discuss] Segfault w/ GPUDirect MPISend fired from CUDA Callback (SMP)

Fri Feb 13 14:30:43 EST 2015

Hi,

I am constructing a library which requires fully asynchronous "pack and
send" functionality, with a custom pack kernel, and (hopefully) a GPUDirect
send. Therefore I have setup a pipeline via CUDA's callback mechanism, such
that when the custom pack kernel completes asynchronously, the CUDA runtime
automatically triggers a small function which launches an MPISend of the
packed device buffer, and stores the request for the user application to
test later. We are currently only testing intra-node exchanges via SMP.

However this segfaults with the following backtrace (for eager protocol,
rendezvous similarly fails on a __memcpy_sse2_unaligned.)

#0  __memcpy_sse2_unaligned ()
    at ../sysdeps/x86_64/multiarch/memcpy-sse2-unaligned.S:37
#1  0x00007f0abf8c13d7 in MPIDI_CH3I_SMP_writev ()
   from /home/psath/mvapich2-2.1rc1/build/install/lib/libmpi.so.12
#2  0x00007f0abf8b6026 in MPIDI_CH3_iSendv ()
   from /home/psath/mvapich2-2.1rc1/build/install/lib/libmpi.so.12
#3  0x00007f0abf8a4c87 in MPIDI_CH3_EagerContigIsend ()
   from /home/psath/mvapich2-2.1rc1/build/install/lib/libmpi.so.12
#4  0x00007f0abf8ab9c1 in MPID_Isend ()
   from /home/psath/mvapich2-2.1rc1/build/install/lib/libmpi.so.12
#5  0x00007f0abf83272d in PMPI_Isend ()
   from /home/psath/mvapich2-2.1rc1/build/install/lib/libmpi.so.12
#6  0x00007f0abfc79a1f in cuda_sap_isend_cb (stream=0x0, status=cudaSuccess,
    data=0xb52d70) at metamorph_mpi.c:435

I am successfully able to transfer the same device buffer from the primary
thread of the application, but when the MPISend is launched from the third
thread (launched by the CUDA driver, which invokes the callback function)
it seems to not understand that it is still a device pointer and cannot be
copied with a CPU memcpy.

Hao Wang who is currently at our lab suggested explicitly disabling IPC
(and separately trying to *enable* SMP_IPC), which I attempted, but didn't
help.

We are using MVAPICH 2.1rc1
The configure line is:

../mvapich2-2.1rc1/configure --prefix=/home/psath/mvapich2-
2.1rc1/build/install --enable-cuda --disable-mcast
--with-ib-libpath=/home/psath/libibverbs/install/lib
--with-ib-include=/home/psath/libibverbs/install/include
--with-libcuda=/usr/local/cuda-6.0/lib64
--with-libcudart=/usr/local/cuda-6.0/lib64/

The system has 2 K20x GPUs running Nvidia driver 331.67. We are using a
userspace build of libibverbs.so v1.1.8-1 from the Debian repos.

Have you observed a use case like this before with similar segfaults? Do
you have any further suggestions for tests or workarounds that will
preserve the GPU direct behavior? (Forcing the callback to stall the
transfer and place it on a helper list for the main thread to come back
around to would incur additional polling overhead that should not be
required, and bends the async model we are trying to implement.)

Thanks!
-Paul Sathre
Research Programmer - Synergy Lab
Dept. of Computer Science
Virginia Tech
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20150213/b2c6d08d/attachment.html>