[mvapich-discuss] CUDA-aware MVAPICH with persistent communicators

Sat Sep 23 16:54:25 EDT 2017

Hi,

Thanks for your note. During last week, one of your colleagues also asked us a similar question and we had sent some feedback. Hope you have received those. We are assuming that both issues are coming from the same application. Thanks for providing details on this issue. This is the first time we are seeing a concrete use case for persistent communicators and GPU Direct RDMA. We had never seen such use case earlier. We will investigate this further and plan to optimize it.

Thanks,

DK
________________________________
From: mvapich-discuss-bounces at cse.ohio-state.edu on behalf of Kate Clark [mclark at nvidia.com]
Sent: Saturday, September 23, 2017 12:42 PM
To: mvapich-discuss at cse.ohio-state.edu
Subject: [mvapich-discuss] CUDA-aware MVAPICH with persistent communicators

Hi,

I’ve been doing testing of MVAPICH with CUDA-awareness enabled, for P2P within a node and for GPU Direct RDMA exchange between nodes.  What I am finding is that MVAPICH appears to be poorly optimized for persistent communicators, e.g., MPI_Send_init / MPI_Recv_init / MPI_Start / MPI_Wait when used with two GPUs.

A couple of immediate issues worth mentioning:

-          There appears to be no optimization for persistent communicators.  For example, my application uses the same handles 1000s of times over, but for every call to MPI_Start, a query of the pointer location is initiated, e.g., cuGetPointerAttribute is called repeatedly.  This adds a noticeable latency for both CPU and GPU messages, e.g., around 0.7 us of API overhead.  My application is bound by CPU CUDA API latency and removing any unneeded API calls is highly desirable.  For example, see the below trace taken from profiling my application in nvprof, every call cuPointerGetAttribute is coming from MVAPICH.

      API calls:   72.60%  146.888s  2.35e+08     624ns     333ns  57.367ms  cudaEventQuery

                   11.45%  23.1676s   2000002  11.583us  4.2710us  275.34ms  cudaMemcpy2DAsync

                    4.95%  10.0224s   5000011  2.0040us     419ns  32.029ms  cudaEventRecord

                    4.65%  9.40582s  14001154     671ns     211ns  28.149ms  cuPointerGetAttribute

                    4.09%  8.27296s   4000004  2.0680us     464ns  28.110ms  cudaStreamWaitEvent

-          Comparing the performance between persistent message handles and MPI_Isend / MPI_Irecv shows that persistent message handles can actually be slower than using regular message handles.  This is true for both peer-to-peer exchange as well as using GPU Direct RDMA.  This suggests that underneath the hood, persistent message handles are not simply falling back to using MPI_Isend / MPI_Irecv and some other overhead is being introduced.  This is surprising, since in principle, persistent message handles should have the lowest latency and maximal performance compared to using regular exchange since all setup overheads can be amortized.

Are there any plans to better optimize persistent message handles in MVAPICH?

Thanks,

Kate.
________________________________
This email message is for the sole use of the intended recipient(s) and may contain confidential information.  Any unauthorized review, use, disclosure or distribution is prohibited.  If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message.
________________________________
-------------- next part --------------
A non-text attachment was scrubbed...
Name: winmail.dat
Type: application/ms-tnef
Size: 15703 bytes
Desc: not available
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20170923/d5f34499/attachment.bin>