[mvapich-discuss] CUDA-aware MVAPICH with persistent communicators

Thu Sep 28 20:19:35 EDT 2017

Hi DK,

Thanks for your response.  Yes, I have now touched base with the colleague in question.  That report was indeed motivated by conversation about the same application.

For the motivation for persistent communicators (which isn’t necessarily GPU specific), there are multiple applications that use these.  In particular, in the lattice quantum chromodynamics community (LQCD), which is a stencil on a grid type problem, the use of persistent communicators is prevalent.  For example, in the SciDAC-funded USQCD software program, there was an effort coalesce upon common building blocks by all the main LQCD applications (you may have heard of some of these as they are heavily in different supercomputer sites: MILC, Chroma, CPS, QUDA (latter being GPU specific) ).  These all build on top of a common communications framework, QMP (https://github.com/usqcd-software/qmp) which is intended to abstract the communications with different backends for MPI, BlueGene SPI, etc.  For MPI, which is by far and away the most heavily used backend, this uses persistent MPI communicators.  When running on GPUs, we can simply use the MPI backend with CUDA-aware MPI, and everything just works.  As noted in my prior email, the unfortunate aspect of this that persistent communicators don’t seem to be well optimized for CUDA-aware MVAPICH.

I look forward to hearing more about persistent communicator optimizations in CUDA-aware MVAPICH!

Regards,

Kate.

From: "Panda, Dhabaleswar" <panda at cse.ohio-state.edu>
Date: Saturday, September 23, 2017 at 1:54 PM
To: Kate Clark <mclark at nvidia.com>, "mvapich-discuss at cse.ohio-state.edu" <mvapich-discuss at mailman.cse.ohio-state.edu>
Subject: RE: [mvapich-discuss] CUDA-aware MVAPICH with persistent communicators

Hi,

Thanks for your note. During last week, one of your colleagues also asked us a similar question and we had sent some feedback. Hope you have received those. We are assuming that both issues are coming from the same application. Thanks for providing details on this issue. This is the first time we are seeing a concrete use case for persistent communicators and GPU Direct RDMA. We had never seen such use case earlier. We will investigate this further and plan to optimize it.

Thanks,

DK
________________________________
From: mvapich-discuss-bounces at cse.ohio-state.edu on behalf of Kate Clark [mclark at nvidia.com]
Sent: Saturday, September 23, 2017 12:42 PM
To: mvapich-discuss at cse.ohio-state.edu
Subject: [mvapich-discuss] CUDA-aware MVAPICH with persistent communicators
Hi,

I’ve been doing testing of MVAPICH with CUDA-awareness enabled, for P2P within a node and for GPU Direct RDMA exchange between nodes.  What I am finding is that MVAPICH appears to be poorly optimized for persistent communicators, e.g., MPI_Send_init / MPI_Recv_init / MPI_Start / MPI_Wait when used with two GPUs.

A couple of immediate issues worth mentioning:

-          There appears to be no optimization for persistent communicators.  For example, my application uses the same handles 1000s of times over, but for every call to MPI_Start, a query of the pointer location is initiated, e.g., cuGetPointerAttribute is called repeatedly.  This adds a noticeable latency for both CPU and GPU messages, e.g., around 0.7 us of API overhead.  My application is bound by CPU CUDA API latency and removing any unneeded API calls is highly desirable.  For example, see the below trace taken from profiling my application in nvprof, every call cuPointerGetAttribute is coming from MVAPICH.

      API calls:   72.60%  146.888s  2.35e+08     624ns     333ns  57.367ms  cudaEventQuery

                   11.45%  23.1676s   2000002  11.583us  4.2710us  275.34ms  cudaMemcpy2DAsync

                    4.95%  10.0224s   5000011  2.0040us     419ns  32.029ms  cudaEventRecord

                    4.65%  9.40582s  14001154     671ns     211ns  28.149ms  cuPointerGetAttribute

                    4.09%  8.27296s   4000004  2.0680us     464ns  28.110ms  cudaStreamWaitEvent

-          Comparing the performance between persistent message handles and MPI_Isend / MPI_Irecv shows that persistent message handles can actually be slower than using regular message handles.  This is true for both peer-to-peer exchange as well as using GPU Direct RDMA.  This suggests that underneath the hood, persistent message handles are not simply falling back to using MPI_Isend / MPI_Irecv and some other overhead is being introduced.  This is surprising, since in principle, persistent message handles should have the lowest latency and maximal performance compared to using regular exchange since all setup overheads can be amortized.

Are there any plans to better optimize persistent message handles in MVAPICH?

Thanks,

Kate.
________________________________
This email message is for the sole use of the intended recipient(s) and may contain confidential information.  Any unauthorized review, use, disclosure or distribution is prohibited.  If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message.
________________________________
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20170929/e6bdb545/attachment-0001.html>