[mvapich-discuss] MVAPICH and CUDA IPC

Wed Feb 13 20:11:40 EST 2019

Hi Ching-Hsiang,

Thank you for your response, and sorry for taking a couple of days to respond.  Let me first answer your questions.:

1.) Yes, I am using direct load/store as well as using MPI for exchange of data, which is why this use case comes up.

2.) When using the driver API, if you try to open an IPC memory handle that is already open you will receive the error “CUDA_ERROR_ALREADY_MAPPED”.  See for example this code in UCX: https://github.com/openucx/ucx/blob/master/src/uct/cuda/cuda_ipc/cuda_ipc_cache.c#L58.  So, this issue can be checked at runtime with a sensible error message.

3.) Sure, if needed I can do this if needed.  I imagine this is fairly easy to put together.

Anyway, I have been doing more testing since my first email, and I think I am encountering two issues, not this single issue.  I am seeing occasional hangs in MVAPICH 2.3, and I initially thought this was related to this handle issue, however, I’ve found that the issue does persist if I disable CUDA IPC communication from my app (e.g., I do not open memory handles in my app) and use CUDA IPC from MVAPICH only.  Moreover, the hangs do not happen with 2.2.

What I’m specifically seeing is that when I run 3 processes on my workstation, with each mapped to a different GPU, and do point-to-point communication between all GPUs, one GPU will complete all its communication, while the other two will get stuck waiting on the other to finish.  If I look at the profiler output as to what’s happening, I can see that all processes have started their respective send and recv communication (MPI_Start is called for all persistent communicators), but MPI_Test never returns completed for one pair of communication.  This does not happen deterministically, and never happens with 2.2.  I can attach a debugger to the processes, and this confirms what the visual profiler indicates.  I can get provide stack traces if needed if It helps, or any other information you need.

So, in summary, the two issues I think I am seeing are:

  1.  MVAPICH2-GDR leads to runtime errors when I open IPC memory handles in my app (QUDA) and also let MVAPICH do point to point communication using the same base pointers.  Note this issue is also present in UCX, and I have reported this issue to those developers as well: https://github.com/openucx/ucx/issues/3192
  2.  MVAPICH2-2.3 occasionally hangs when I use CUDA-IPC communication with persistent communicators, running on 3 processes / 3 GPUs.  MVAPICH2-2.2 does not hang.  I have not tested if this issue arises in the GDR variant of MVAPICH.

Let me know what information you need to help make progress here.

Regards,

Kate.

From: "Chu, Ching-Hsiang" <chu.368 at buckeyemail.osu.edu>
Date: Friday, February 8, 2019 at 11:40 AM
To: Kate Clark <mclark at nvidia.com>, "mvapich-discuss at cse.ohio-state.edu" <mvapich-discuss at mailman.cse.ohio-state.edu>
Subject: Re: MVAPICH and CUDA IPC

Hi, Kate,

MVAPICH and MVAPICH2-GDR currently do not support the scenario you described. We are looking at it. Meanwhile, it will be very helpful if you could let us know the following questions.
1.      Why the application is trying to use IPC to transfer data between GPUs instead of just using MPI point-to-point? Is it because IPC is used in some CUDA kernels for computing?
2.      Are you aware of any CUDA runtime/driver APIs can detect whether an opened IPC memory handle is opened or not? or do you think the applications have a way to pass such information to the MPI runtime?
3.      Is it possible for you to provide a simple reproducer?
Thanks,

________________________________
From: mvapich-discuss <mvapich-discuss-bounces at cse.ohio-state.edu> on behalf of Kate Clark <mclark at nvidia.com>
Sent: Thursday, February 7, 2019 6:29 PM
To: mvapich-discuss at cse.ohio-state.edu
Subject: [mvapich-discuss] MVAPICH and CUDA IPC

Hi MVAPICH developers,

I’m seeing occasional lock ups (public MVAPICH) or segmentation faults (MVAPICH-GDR) when using MVAPICH for CUDA IPC message exchange within a node, where the send/recv buffers have already been registered for CUDA IPC before the call to MPI, e.g., their memory handle has been already been exchanged between source and destination nodes.  The issue doesn’t seem to arise if the buffers are not registered prior to MPI.

As per the CUDA documentation, a given memory handle can only be opened once per context per device:

https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__DEVICE.html#group__CUDART__DEVICE_1g01050a29fefde385b1042081ada4cde9

·         cudaIpcMemHandles from each device in a given process may only be opened by one context per device per other process.

For CUDA IPC, I was wondering does MVAPICH check if a given buffer has already had its message handle opened and reuse this, as opposed to potentially failing?  If this is not the case, could this situation be improved to make it more robust?  For example, checking if a given memory handle has already been opened, and if so, reusing it.  Similarly, if a handle is marked as being opened by the calling application, defer the closing of the memory handle to the calling application as well.

Thanks for your continued development of the MVAPICH library ☺

Kate.

________________________________
This email message is for the sole use of the intended recipient(s) and may contain confidential information.  Any unauthorized review, use, disclosure or distribution is prohibited.  If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message.
________________________________
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20190214/c0d7838d/attachment-0001.html>