[Mvapich-discuss] osu_latency improvement opportunity

Mon Jul 26 12:46:30 EDT 2021

Hi, Adam/Todd.

Please accept my apologies here for not getting back earlier. This got slipped.

Thanks for your continued interest in and support of the OSU microbenchmarks. We appreciate it.

Thanks for the report and the proof of concept. We will discuss this internally and get back to you.

Best,
Hari.

-----Original Message-----
From: Goldman, Adam <adam.goldman at intel.com> 
Sent: Wednesday, July 7, 2021 12:57 PM
To: mvapich-discuss at lists.osu.edu; Subramoni, Hari <subramoni.1 at osu.edu>
Cc: Rimmer, Todd <todd.rimmer at intel.com>
Subject: osu_latency improvement opportunity

Thank you for looking into the recent issue we encountered.  The improved fix you provided is working well.

During our use of osu_latency, we encountered a couple interesting improvement opportunities:

  - While many of the OSU pt2pt microbenchmarks are uni-directional (such as osu_bw), osu_latency is bi-directional due to its "ping-pong" approach.  For homogeneous tests, such as CPU to CPU and GPU to GPU, this is good.  However, when it's run for GPU to CPU or CPU to GPU, such as "osu_latency D H", it runs with 1 node using the GPU and 1 node using the CPU.  This means "D H" and "H D" are essentially the same test.  When tuning GPU data movement algorithms, we have found it useful to measure latency for GPU send separately from measuring latency for GPU recv.  The attached patch modifies how osu_latency interprets the D H and H D options such that "D H" measures a GPU buffer sending to a CPU buffer and "H D" measures a CPU buffer sending to a GPU buffer.  For example in "D H" both ranks allocate a GPU sbuf and a CPU rbuf.  In this patch it was implemented with #if 1 so we could easily revert to the prior behavior.

  - There are a number of other MPI data movement options, such as synchronous send (MPI_Ssend) which are not covered in the latency benchmark.  The attached patch has a quick change to permit comments to be adjusted so that MPI_Ssend is used instead of MPI_Send.

The attached diff has these changes as a functional "proof of concept".  If you agree these features would be useful, it would be desirable to turn these into a more official feature of the test.  (Perhaps simply changing the definition of "osu_latency D H" and "osu_latency H D" as is done in this diff).  This diff was against the 5.6.3 rev of OSU benchmarks, but the basics are applicable to newer revs.

Thank you,

Todd Rimmer/Adam Goldman
Intel Corporation