[mvapich-discuss] MVAPICH and CUDA IPC

Chu, Ching-Hsiang chu.368 at buckeyemail.osu.edu
Fri Feb 8 14:39:39 EST 2019


Hi, Kate,

MVAPICH and MVAPICH2-GDR currently do not support the scenario you described. We are looking at it. Meanwhile, it will be very helpful if you could let us know the following questions.

  1.  Why the application is trying to use IPC to transfer data between GPUs instead of just using MPI point-to-point? Is it because IPC is used in some CUDA kernels for computing?
  2.  Are you aware of any CUDA runtime/driver APIs can detect whether an opened IPC memory handle is opened or not? or do you think the applications have a way to pass such information to the MPI runtime?
  3.  Is it possible for you to provide a simple reproducer?

Thanks,

________________________________
From: mvapich-discuss <mvapich-discuss-bounces at cse.ohio-state.edu> on behalf of Kate Clark <mclark at nvidia.com>
Sent: Thursday, February 7, 2019 6:29 PM
To: mvapich-discuss at cse.ohio-state.edu
Subject: [mvapich-discuss] MVAPICH and CUDA IPC


Hi MVAPICH developers,



I’m seeing occasional lock ups (public MVAPICH) or segmentation faults (MVAPICH-GDR) when using MVAPICH for CUDA IPC message exchange within a node, where the send/recv buffers have already been registered for CUDA IPC before the call to MPI, e.g., their memory handle has been already been exchanged between source and destination nodes.  The issue doesn’t seem to arise if the buffers are not registered prior to MPI.



As per the CUDA documentation, a given memory handle can only be opened once per context per device:



https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__DEVICE.html#group__CUDART__DEVICE_1g01050a29fefde385b1042081ada4cde9



  *   cudaIpcMemHandles from each device in a given process may only be opened by one context per device per other process.



For CUDA IPC, I was wondering does MVAPICH check if a given buffer has already had its message handle opened and reuse this, as opposed to potentially failing?  If this is not the case, could this situation be improved to make it more robust?  For example, checking if a given memory handle has already been opened, and if so, reusing it.  Similarly, if a handle is marked as being opened by the calling application, defer the closing of the memory handle to the calling application as well.



Thanks for your continued development of the MVAPICH library ☺



Kate.

________________________________
This email message is for the sole use of the intended recipient(s) and may contain confidential information.  Any unauthorized review, use, disclosure or distribution is prohibited.  If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message.
________________________________
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20190208/4c372a93/attachment-0001.html>


More information about the mvapich-discuss mailing list