[mvapich-discuss] CUDA-aware MVAPICH with persistent communicators

Sat Sep 23 12:42:33 EDT 2017

Hi,

I’ve been doing testing of MVAPICH with CUDA-awareness enabled, for P2P within a node and for GPU Direct RDMA exchange between nodes.  What I am finding is that MVAPICH appears to be poorly optimized for persistent communicators, e.g., MPI_Send_init / MPI_Recv_init / MPI_Start / MPI_Wait when used with two GPUs.

A couple of immediate issues worth mentioning:

-          There appears to be no optimization for persistent communicators.  For example, my application uses the same handles 1000s of times over, but for every call to MPI_Start, a query of the pointer location is initiated, e.g., cuGetPointerAttribute is called repeatedly.  This adds a noticeable latency for both CPU and GPU messages, e.g., around 0.7 us of API overhead.  My application is bound by CPU CUDA API latency and removing any unneeded API calls is highly desirable.  For example, see the below trace taken from profiling my application in nvprof, every call cuPointerGetAttribute is coming from MVAPICH.

      API calls:   72.60%  146.888s  2.35e+08     624ns     333ns  57.367ms  cudaEventQuery

                   11.45%  23.1676s   2000002  11.583us  4.2710us  275.34ms  cudaMemcpy2DAsync

                    4.95%  10.0224s   5000011  2.0040us     419ns  32.029ms  cudaEventRecord

                    4.65%  9.40582s  14001154     671ns     211ns  28.149ms  cuPointerGetAttribute

                    4.09%  8.27296s   4000004  2.0680us     464ns  28.110ms  cudaStreamWaitEvent

-          Comparing the performance between persistent message handles and MPI_Isend / MPI_Irecv shows that persistent message handles can actually be slower than using regular message handles.  This is true for both peer-to-peer exchange as well as using GPU Direct RDMA.  This suggests that underneath the hood, persistent message handles are not simply falling back to using MPI_Isend / MPI_Irecv and some other overhead is being introduced.  This is surprising, since in principle, persistent message handles should have the lowest latency and maximal performance compared to using regular exchange since all setup overheads can be amortized.

Are there any plans to better optimize persistent message handles in MVAPICH?

Thanks,

Kate.

-----------------------------------------------------------------------------------
This email message is for the sole use of the intended recipient(s) and may contain
confidential information.  Any unauthorized review, use, disclosure or distribution
is prohibited.  If you are not the intended recipient, please contact the sender by
reply email and destroy all copies of the original message.
-----------------------------------------------------------------------------------
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20170923/f35177f6/attachment-0001.html>