[mvapich-discuss] CUDA-aware MVAPICH with persistent communicators
Panda, Dhabaleswar
panda at cse.ohio-state.edu
Sat Sep 23 16:54:25 EDT 2017
Hi,
Thanks for your note. During last week, one of your colleagues also asked us a similar question and we had sent some feedback. Hope you have received those. We are assuming that both issues are coming from the same application. Thanks for providing details on this issue. This is the first time we are seeing a concrete use case for persistent communicators and GPU Direct RDMA. We had never seen such use case earlier. We will investigate this further and plan to optimize it.
Thanks,
DK
________________________________
From: mvapich-discuss-bounces at cse.ohio-state.edu on behalf of Kate Clark [mclark at nvidia.com]
Sent: Saturday, September 23, 2017 12:42 PM
To: mvapich-discuss at cse.ohio-state.edu
Subject: [mvapich-discuss] CUDA-aware MVAPICH with persistent communicators
Hi,
I’ve been doing testing of MVAPICH with CUDA-awareness enabled, for P2P within a node and for GPU Direct RDMA exchange between nodes. What I am finding is that MVAPICH appears to be poorly optimized for persistent communicators, e.g., MPI_Send_init / MPI_Recv_init / MPI_Start / MPI_Wait when used with two GPUs.
A couple of immediate issues worth mentioning:
- There appears to be no optimization for persistent communicators. For example, my application uses the same handles 1000s of times over, but for every call to MPI_Start, a query of the pointer location is initiated, e.g., cuGetPointerAttribute is called repeatedly. This adds a noticeable latency for both CPU and GPU messages, e.g., around 0.7 us of API overhead. My application is bound by CPU CUDA API latency and removing any unneeded API calls is highly desirable. For example, see the below trace taken from profiling my application in nvprof, every call cuPointerGetAttribute is coming from MVAPICH.
API calls: 72.60% 146.888s 2.35e+08 624ns 333ns 57.367ms cudaEventQuery
11.45% 23.1676s 2000002 11.583us 4.2710us 275.34ms cudaMemcpy2DAsync
4.95% 10.0224s 5000011 2.0040us 419ns 32.029ms cudaEventRecord
4.65% 9.40582s 14001154 671ns 211ns 28.149ms cuPointerGetAttribute
4.09% 8.27296s 4000004 2.0680us 464ns 28.110ms cudaStreamWaitEvent
- Comparing the performance between persistent message handles and MPI_Isend / MPI_Irecv shows that persistent message handles can actually be slower than using regular message handles. This is true for both peer-to-peer exchange as well as using GPU Direct RDMA. This suggests that underneath the hood, persistent message handles are not simply falling back to using MPI_Isend / MPI_Irecv and some other overhead is being introduced. This is surprising, since in principle, persistent message handles should have the lowest latency and maximal performance compared to using regular exchange since all setup overheads can be amortized.
Are there any plans to better optimize persistent message handles in MVAPICH?
Thanks,
Kate.
________________________________
This email message is for the sole use of the intended recipient(s) and may contain confidential information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message.
________________________________
-------------- next part --------------
A non-text attachment was scrubbed...
Name: winmail.dat
Type: application/ms-tnef
Size: 15703 bytes
Desc: not available
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20170923/d5f34499/attachment.bin>
More information about the mvapich-discuss
mailing list