[mvapich-discuss] Segfault w/ GPUDirect MPISend fired from CUDA Callback (SMP)

Sat Feb 14 11:37:14 EST 2015

Hi Paul,

I have two suggestions that might help in your case :

1) Check if Cuda_aware MPI_Pack of MVAPICH2 can satisfy your requirements.
2) In your current design use the main thread to call MPI_Send from GPU
buffers. (The call back will notify the main thread)

Please let us know how we can be of any help.

Thanks

On Sat, Feb 14, 2015 at 11:06 AM, Paul Sathre <sath6220 at cs.vt.edu> wrote:

> Hmm interesting, I had not noticed that caveat. Thanks for pointing it
> out! I'm curious though, what API calls do you use?
>
> Well our slower host-staged route for OpenCL and non GPUDirect MPIs simply
> enqueues a D2H transfer after the pack kernel, and registers the callback
> to the transfer instead. (Which sidesteps API calls inside the callback.)
> we can use that for the time being as it still preserves the fully-async
> behavior we want.
>
> Thanks again!
>
> Sent from Note 3. Please excuse brevity and possible typos.
> On Feb 14, 2015 9:28 AM, "Jiri Kraus" <jkraus at nvidia.com> wrote:
>
>>  Hi Paul,
>>
>> if you pass a device pointer to ISend it needs to do some calls into the
>> CUDA API these are not allowed from stream call backs [1]. I am afraid that
>> the best you can do is to use the polling aproche that you described.
>>
>> Jiri
>>
>> [1]
>> http://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__STREAM.html#group__CUDART__STREAM_1g4d2688e1c3f3cf7da4bf55121fc0b0a1
>>
>> Sent from my smartphone. Please excuse autocorrect typos.
>>
>>
>> ---- Paul Sathre schrieb ----
>>
>>      Thanks for the quick reply, Khaled,
>>
>>  1) I will work on isolating a minimal test case next week.
>>  2) No, would it be safe to have the callback function call Init_thread
>> immediately before the ISend? (Much later than the global MPI_Init at the
>> start of the program). Can i similarly MPI_finalize_thread in the callback
>> function?
>>  3) No, the main thread is actually blocking on a return from
>> cudaThreadSynchronize to force the pack kernel to finish (because CUDA is
>> blocking on the return of the segfaulting callback function.)
>>  4) The thread is created by the CUDA runtime, I am unsure whether or not
>> it has the same context, but would lean towards thinking the CUDA runtime
>> smart enough to ensure it does.
>>  5) I am unsure whether they are on the same socket, and capable of peer
>> access, I will have to check. (I did check with MV2_CUDA_IPC=0 and =1
>> though, and both had segfaults.)
>>  6) I was under the impression MPI_pack was for packing custom data
>> types, whereas we are packing arbitrary regions of a multi-dimensional
>> grid, based on arrays of user-specified offset/contig_length pairs. Am I
>> incorrect and there's a lightweight way to achieve this through custom data
>> types? Also we seamlessly interchange between OpenCL, CUDA, and OpenMP
>> backends, so we are still stuck implementing an OpenCL pack kernel for
>> transparent execution. However its callback chain isn't subject to this
>> segfault, since we have to host-stage the buffer before MPI transfer
>> anyway.. That is, unless you have a GPUDirect equivalent for OpenCL that
>> I'm unaware of - which we'd be *very* interested in =)
>>
>>  Thanks again!
>>
>>   -Paul Sathre
>> Research Programmer - Synergy Lab
>>  Dept. of Computer Science
>>  Virginia Tech
>>
>> On Fri, Feb 13, 2015 at 4:44 PM, khaled hamidouche <
>> hamidouc at cse.ohio-state.edu> wrote:
>>
>>> Hi Paul,
>>>
>>>  In order to help debugging this issue, can you please provide us some
>>> more information :
>>>
>>>  1) Can we have a reproducer of this issue, this will help us debugging
>>> the issue faster
>>> 2) Does your example use MPI_Init_thread? Your scenario belongs to the
>>> Multiple thread, so MPI needs to be aware of it.
>>> 3) Are the Thread and the main thread accessing the same buffer at the
>>> same time ?
>>> 4) How the thread is created ? Is the thread created with the same
>>> context than the processes ?
>>> 5) In your system configuration, is both the GPUs on same socket (i.e :
>>> can the IPC be used ?). If yes enabling IPC does it reach the same issue
>>> (seg fault at memcopy ?)
>>> 6) What is the exact use case of this, in other words why MPI_pack (the
>>> MVAPICH2 Kernel) is not sufficient ?
>>>
>>>
>>>  Thanks a lot
>>>
>>>  On Fri, Feb 13, 2015 at 2:30 PM, Paul Sathre <sath6220 at cs.vt.edu>
>>> wrote:
>>>
>>>>   Hi,
>>>>
>>>>  I am constructing a library which requires fully asynchronous "pack
>>>> and send" functionality, with a custom pack kernel, and (hopefully) a
>>>> GPUDirect send. Therefore I have setup a pipeline via CUDA's callback
>>>> mechanism, such that when the custom pack kernel completes asynchronously,
>>>> the CUDA runtime automatically triggers a small function which launches an
>>>> MPISend of the packed device buffer, and stores the request for the user
>>>> application to test later. We are currently only testing intra-node
>>>> exchanges via SMP.
>>>>
>>>>  However this segfaults with the following backtrace (for eager
>>>> protocol, rendezvous similarly fails on a __memcpy_sse2_unaligned.)
>>>>
>>>> #0  __memcpy_sse2_unaligned ()
>>>>     at ../sysdeps/x86_64/multiarch/memcpy-sse2-unaligned.S:37
>>>> #1  0x00007f0abf8c13d7 in MPIDI_CH3I_SMP_writev ()
>>>>    from /home/psath/mvapich2-2.1rc1/build/install/lib/libmpi.so.12
>>>> #2  0x00007f0abf8b6026 in MPIDI_CH3_iSendv ()
>>>>    from /home/psath/mvapich2-2.1rc1/build/install/lib/libmpi.so.12
>>>> #3  0x00007f0abf8a4c87 in MPIDI_CH3_EagerContigIsend ()
>>>>    from /home/psath/mvapich2-2.1rc1/build/install/lib/libmpi.so.12
>>>> #4  0x00007f0abf8ab9c1 in MPID_Isend ()
>>>>    from /home/psath/mvapich2-2.1rc1/build/install/lib/libmpi.so.12
>>>> #5  0x00007f0abf83272d in PMPI_Isend ()
>>>>    from /home/psath/mvapich2-2.1rc1/build/install/lib/libmpi.so.12
>>>> #6  0x00007f0abfc79a1f in cuda_sap_isend_cb (stream=0x0,
>>>> status=cudaSuccess,
>>>>     data=0xb52d70) at metamorph_mpi.c:435
>>>>
>>>> I am successfully able to transfer the same device buffer from the
>>>> primary thread of the application, but when the MPISend is launched from
>>>> the third thread (launched by the CUDA driver, which invokes the callback
>>>> function) it seems to not understand that it is still a device pointer and
>>>> cannot be copied with a CPU memcpy.
>>>>
>>>>  Hao Wang who is currently at our lab suggested explicitly disabling
>>>> IPC (and separately trying to *enable* SMP_IPC), which I attempted,
>>>> but didn't help.
>>>>
>>>>  We are using MVAPICH 2.1rc1
>>>>  The configure line is:
>>>>
>>>> ../mvapich2-2.1rc1/configure --prefix=/home/psath/mvapich2-
>>>> 2.1rc1/build/install --enable-cuda --disable-mcast
>>>> --with-ib-libpath=/home/psath/libibverbs/install/lib
>>>> --with-ib-include=/home/psath/libibverbs/install/include
>>>> --with-libcuda=/usr/local/cuda-6.0/lib64
>>>> --with-libcudart=/usr/local/cuda-6.0/lib64/
>>>>
>>>>  The system has 2 K20x GPUs running Nvidia driver 331.67. We are using
>>>> a userspace build of libibverbs.so v1.1.8-1 from the Debian repos.
>>>>
>>>>  Have you observed a use case like this before with similar segfaults?
>>>> Do you have any further suggestions for tests or workarounds that will
>>>> preserve the GPU direct behavior? (Forcing the callback to stall the
>>>> transfer and place it on a helper list for the main thread to come back
>>>> around to would incur additional polling overhead that should not be
>>>> required, and bends the async model we are trying to implement.)
>>>>
>>>>
>>>>  Thanks!
>>>>   -Paul Sathre
>>>> Research Programmer - Synergy Lab
>>>>  Dept. of Computer Science
>>>>  Virginia Tech
>>>>
>>>>
>>>>  _______________________________________________
>>>> mvapich-discuss mailing list
>>>> mvapich-discuss at cse.ohio-state.edu
>>>> http://mailman.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>>>>
>>>>
>>>
>>
>> -----------------------------------------------------------------------------------
>> Nvidia GmbH
>> Würselen
>> Amtsgericht Aachen
>> HRB 8361
>> Managing Director: Karen Theresa Burns
>>
>>
>> -----------------------------------------------------------------------------------
>> This email message is for the sole use of the intended recipient(s) and
>> may contain
>> confidential information.  Any unauthorized review, use, disclosure or
>> distribution
>> is prohibited.  If you are not the intended recipient, please contact the
>> sender by
>> reply email and destroy all copies of the original message.
>>
>> -----------------------------------------------------------------------------------
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20150214/c46a1a24/attachment.html>