[mvapich-discuss] [SECURITY WARNING: FREE E-MAIL] Re: Performance of CUDA Managed Memory and DeviceMemory for GDR 2.3a

Fri Jan 12 02:08:04 EST 2018

Hi Ammar,

Thank you for your answer!

It is a x86 system with two nodes. 
The output of “nvidia-smi topo -m” on both nodes is as follows:

 |Node1|
---
# nvidia-smi topo -m
           GPU0    GPU1    GPU2    GPU3    mlx5_0  mlx5_1  CPU Affinity
GPU0     X         NV1       NV1       NV2      PIX        SOC         0-13
GPU1    NV1      X           NV2       NV1      PIX        SOC         0-13
GPU2    NV1     NV2         X          NV1      SOC       PIX          14-27
GPU3    NV2     NV1       NV1         X         SOC       PIX          14-27
mlx5_0  PIX       PIX        SOC      SOC          X        SOC
mlx5_1  SOC     SOC       PIX        PIX         SOC      X

Legend:

  X   = Self
  SOC  = Connection traversing PCIe as well as the SMP link between CPU sockets(e.g. QPI)
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe switches (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing a single PCIe switch
  NV#  = Connection traversing a bonded set of # NVLinks
---

|Node2|
---
# nvidia-smi topo -m
            GPU0    GPU1    GPU2    GPU3    mlx5_0  mlx5_1  CPU Affinity
 GPU0     X          NV1       NV1      NV2      PIX         SOC          0-13
GPU1    NV1          X         NV2      NV1      PIX         SOC          0-13
GPU2    NV1       NV2         X         NV1      SOC        PIX           14-27
GPU3    NV2       NV1       NV1        X         SOC        PIX           14-27
mlx5_0  PIX        PIX         SOC       SOC        X         SOC
mlx5_1  SOC      SOC        PIX        PIX        SOC        X

Legend:

  X   = Self
  SOC  = Connection traversing PCIe as well as the SMP link between CPU sockets(e.g. QPI)
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe switches (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing a single PCIe switch
  NV#  = Connection traversing a bonded set of # NVLinks

Thank you for your help,
Yussuf 

From: Ammar Ahmad Awan
Sent: Friday, January 12, 2018 12:48 AM
To: Yussuf Ali
Cc: mvapich-discuss at cse.ohio-state.edu
Subject: [SECURITY WARNING: FREE E-MAIL] Re: [mvapich-discuss] Performance of CUDA Managed Memory and DeviceMemory for GDR 2.3a

===( By JAEA Mail System )===============================
  URL中の文字「:」を「*」に置換しました。
  Characters of ":" in URL have been replaced with "*". 
=========================================================

Hi Yussuf,

Can you please share the details of your system? Is this an OpenPOWER or an x86 system? 

It will be helpful if you can share the output of 'nvidia-smi topo -m' as well. 

Regards,
Ammar

On Thu, Jan 11, 2018 at 2:18 AM, Yussuf Ali <Yussuf.ali at jaea.go.jp> wrote:
Dear MVAPICH2 developers and users,

I measured the intra node performance of our GPU cluster system(4 x NVIDIA Tesla P100-SXM2-16GB, CUDA 8.0) with the osu bi-directional bandwidth benchmark with the current MVAPICH-GDR 2.3a version.

I executed the benchmark for: 
  Device Memory  <-> Device Memory 
and 
  Managed Memory <-> Managed Memory

The following environment variables were set during both benchmarks in the PBS script:
_______________________________________________
export MV2_USE_CUDA=1
export MV2_GPUDIRECT_GDRCOPY_LIB=./libgdrapi.so
export MV2_USE_GPUDIRECT=1
export MV2_GPUDIRECT_GDRCOPY=1
export MV2_USE_GPUDIRECT_GDRCOPY=1
export MV2_CUDA_IPC=1
export MV2_CUDA_ENABLE_MANAGED=1   
export MV2_CUDA_MANAGED_IPC=1

I obtained the following results:

                M<->M               D<->D
1                3.1                           1.1
2                6.1                           2.2
4               12.3                    4.4
8               24.6                    8.9
16            49.3                  17.4
32            95.3                  17.2
64           182.0                 34.0
128         373.7                 67.3
256         663.5               130.9
512        1,211.0             250.0
1,024               1,927.6             406.9
2,048               2,490.1             653.1
4,096               3,116.4             488.6
8,192                5,528.9            481.6
16,384             8,980.7         2,528.6
32,768             1,118.2         6,553.0
65,536             2,178.6        12,729.1
131,072      4,026.9        18,738.3
262,144      6,930.5        26,631.6
524,288    10,566.6       28,645.9
1,048,576   9,229.6       32,114.8
2,097,152   8,908.8       32,776.5
4,194,304   8,818.7       33,884.9

It seems that for messages sizes up to 16,384 bytes Managed Memory performs better than Device Memory.
For message sizes larger or equal to 32,768 bytes Device Memory achieves a higher performance. 

Is there a way to tune Managed Memory performance in order to get the same performance
as Device Memory for messages sizes larger or equal to 32,768 bytes? Because for convenience we
would like to use CUDA Managed Memory.

Thank you for your help,
Yussuf

_______________________________________________
mvapich-discuss mailing list
mvapich-discuss at cse.ohio-state.edu
http*//mailman.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20180112/2d24d65c/attachment-0001.html>