[Mvapich-discuss] Question about osu_bw Benchmark Results

Fri Apr 22 13:09:12 EDT 2022

Hi Byungkwon,

For this question, it may help to reach out to NVIDIA/HPC-X or the developers for the MPI library you are utilizing. The performance and behavior here can also rely on the underlying protocols in the MPI library being utilized for the data transfer.

Thank you,

Kawthar Shafie Khorassani

________________________________
From: Mvapich-discuss <mvapich-discuss-bounces+shafiekhorassani.1=buckeyemail.osu.edu at lists.osu.edu> on behalf of 최병권 via Mvapich-discuss <mvapich-discuss at lists.osu.edu>
Sent: Thursday, April 21, 2022 2:44 AM
To: mvapich-discuss at lists.osu.edu <mvapich-discuss at lists.osu.edu>
Cc: 유준상 <js.louis.you at samsung.com>; 조상욱 <swkhan.cho at samsung.com>
Subject: [Mvapich-discuss] Question about osu_bw Benchmark Results

Dear all, We are conducting the performance benchmark using osu_bw. We want to see how much performance can be delivered when leveraging RDMA & GDR(NVIDIA GPUDirect RDMA). We could not understand some benchmark results by ourselves and ZjQcmQRYFpfptBannerStart
This Message Is From an External Sender
This message came from outside your organization.
<https://us-phishalarm-ewt.proofpoint.com/EWT/v1/KGKeukY!vSQT_YYBQ-aAKpkx3w4EN1riEJSjQqUm9kNLwKHCCcbSVnBFa1Trgc-jlAUbt-kFaKKX5aKlRgLeFDNWsdSwRpky45pfqYiz8wcae_tcnszWCZ7KAm6HE-yTQbp0RSn33g$>
Report Suspicious

ZjQcmQRYFpfptBannerEnd

Dear all,

We are conducting the performance benchmark using osu_bw.

We want to see how much performance can be delivered when leveraging RDMA & GDR(NVIDIA GPUDirect RDMA).

We could not understand some benchmark results by ourselves and hope someone in this mailing list help us for that.

Any advices are welcoming and we believe they will be very helpful.

Our environment:

  - Two of the NVIDIA DGX A100 machines are used.

  - They are connected over 400Gbps Infiniband fabric (each machine has two of 200Gbps HDR Infiniband HCA).

  - We use osu_bw included in the NVIDIA HPC-X package that is a precompiled OpenMPI and UCX packages with CUDA support. (ref: Link<https://urldefense.com/v3/__https://docs.nvidia.com/networking/display/GPUDirectRDMAv17/Benchmark*Tests*BenchmarkTests-RunningGPUDirectRDMAwithOpenMPI__;KyM!!KGKeukY!lQo53wtCqpVhhPtL0pp_CukE4C8dUiwuOzv9XllIUqKPSTo9CYmmeM4F4yw-BbOrvhu6tnTdZQ$>)

  - We run four osu_bw entities in total by using the mpirun command.

Result:

[cid:cafe_image_0 at s-core.co.kr]

 - 'Device-to-Device' is the case where both of the sender and receiver of osu_bw use GPU memory.

 - 'Device-to-Host' is the case where the sender uses GPU memory and the receiver uses the host memory.

 - 'Host-to-Device' is the case where the sender used the host memory and the receiver uses the GPU memory.

 - 'Host-to-Host' is the case where both use the host memory.

 - 'w/ Device Affinity' is an affinity between GPU and IB HCA. When we use GPU and IB HCA connected to the same PCIe root complex,

    we call it 'w/ Device Affinity'. When GPU and IB HCA are not located below the same root complex, then we call it 'w/o Device Affinity'.

    When they are in the same root complex, the GDR feature can be used in the communication and deliver better performance because

    host CPU is not involved in the transmission.

 - 'w/ CPU Affinity' is a NUMA affinity between IB HCA and CPU cores. When we run osu_bw benchmark on the CPU cores that have affinity

    with IB HCA, we call it 'w/ CPU Affinity'. We call the case where they don't have affinity 'w/o CPU Affinity'.

Question:

 We couldn't understand the result of the cases 'Device-to-Host' and 'Host-to-Device' w/o Device Affinity.

 We initially thought one side of the both cases could not benefit from RDMA and GDR at all so that the performance must be much slower than other cases.

 However, as you can see in the figure above, the results of 'Host-to-Device' w/o device affinity are 318Gbps and 325Gbps respectively,

   which are much higher than the result of 'Device-to-Host' (76Gbps).

 Our hypothesis is that the difference would be in the operation type: read or write.

   - The write operation cannot leverage GDR and RDMA w/o Device Affinity. In the case 'Device-to-Host', CPU is involved in the sender side communication

      and the performance drops to 76Gbps.

   - The read operation can benefit from GDR and RDMA even in the case w/o Device Affinity. In the case 'Host-to-Device', the receiver is able to retreive data

      in the GPU memory without the help of CPU so the performance drop is negligible.

Could you give us any comments about our hypothesis?

Thank you so much for reading this long email.

Best regards,

Yours,

Byungkwon Choi

[cid:20220421064422_0 at epcms1p]

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.osu.edu/pipermail/mvapich-discuss/attachments/20220422/d3c3c7e4/attachment-0019.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: ATT00001.png
Type: image/png
Size: 221198 bytes
Desc: ATT00001.png
URL: <http://lists.osu.edu/pipermail/mvapich-discuss/attachments/20220422/d3c3c7e4/attachment-0019.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: ATT00002.gif
Type: image/gif
Size: 13402 bytes
Desc: ATT00002.gif
URL: <http://lists.osu.edu/pipermail/mvapich-discuss/attachments/20220422/d3c3c7e4/attachment-0019.gif>