[mvapich-discuss] Performance of CUDA Managed Memory and Device Memory for GDR 2.3a

Yussuf Ali Yussuf.ali at jaea.go.jp
Thu Jan 11 02:18:26 EST 2018


Dear MVAPICH2 developers and users,

I measured the intra node performance of our GPU cluster system(4 x NVIDIA Tesla P100-SXM2-16GB, CUDA 8.0) with the osu bi-directional bandwidth benchmark with the current MVAPICH-GDR 2.3a version.

I executed the benchmark for: 
  Device Memory  <-> Device Memory 
and 
  Managed Memory <-> Managed Memory

The following environment variables were set during both benchmarks in the PBS script:
_______________________________________________
export MV2_USE_CUDA=1
export MV2_GPUDIRECT_GDRCOPY_LIB=./libgdrapi.so
export MV2_USE_GPUDIRECT=1
export MV2_GPUDIRECT_GDRCOPY=1
export MV2_USE_GPUDIRECT_GDRCOPY=1
export MV2_CUDA_IPC=1
export MV2_CUDA_ENABLE_MANAGED=1   
export MV2_CUDA_MANAGED_IPC=1

I obtained the following results:

                M<->M               D<->D
1	        3.1	                     1.1
2	        6.1	                     2.2
4	       12.3                    4.4
8	       24.6                    8.9
16	       49.3                  17.4
32	       95.3                  17.2
64	      182.0                 34.0
128	      373.7                 67.3
256	      663.5               130.9
512	     1,211.0             250.0
1,024	     1,927.6             406.9
2,048	     2,490.1             653.1
4,096	     3,116.4             488.6
8,192	      5,528.9            481.6
16,384	      8,980.7         2,528.6
32,768	      1,118.2         6,553.0
65,536	      2,178.6        12,729.1
131,072      4,026.9        18,738.3
262,144      6,930.5        26,631.6
524,288    10,566.6       28,645.9
1,048,576   9,229.6       32,114.8
2,097,152   8,908.8       32,776.5
4,194,304   8,818.7       33,884.9

It seems that for messages sizes up to 16,384 bytes Managed Memory performs better than Device Memory.
For message sizes larger or equal to 32,768 bytes Device Memory achieves a higher performance. 

Is there a way to tune Managed Memory performance in order to get the same performance
as Device Memory for messages sizes larger or equal to 32,768 bytes? Because for convenience we
would like to use CUDA Managed Memory.

Thank you for your help,
Yussuf

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20180111/5f7c929b/attachment.html>


More information about the mvapich-discuss mailing list