[mvapich-discuss] Improving the bandwidth performance of Managed Memory for MVAPICH2-GDR with cudaMemAdvise

Wed Jun 20 00:03:47 EDT 2018

Dear MVAPICH2-GDR users and developers,

I ran the OSU benchmarks for managed memory with the following flags:

export MV2_USE_CUDA=1

export MV2_CUDA_ENABLE_MANAGED=1   

export MV2_CUDA_MANAGED_IPC=1

But now I realized that it is possible to get a much higher performance for
managed memory if I remove the export MV2_USE_CUDA=1 and run the benchmark
only with the two following flags.

export MV2_CUDA_ENABLE_MANAGED=1   

export MV2_CUDA_MANAGED_IPC=1

Do this flags somehow have side-effects on each other? And does this imply
it is not possible to mix device and managed memory buffers in a single
program?

Thank you for your help,

Yussuf Ali

From: Yussuf Ali [mailto:yussuf.ali at jaea.go.jp] 
Sent: Monday, June 11, 2018 5:05 PM
To: 'mvapich-discuss at cse.ohio-state.edu'
<mvapich-discuss at cse.ohio-state.edu>
Subject: Improving the bandwidth performance of Managed Memory for
MVAPICH2-GDR with cudaMemAdvise

Dear MVAPICH2-GDR users and developers,

when running the OSU bidirectional bandwidth benchmark on our cluster
systems (4 x P100 per node) we noticed a performance gap between Device and
Managed Memory (MM). 

For device to device we measured ~34 GB/s but for MM to MM only around ~ 9
GB/s for message sizes of size 4,194,304 in the intra node case. 

According to NVIDIA it is recommended to insert hints into the source code
in order to improve the performance
(https://devblogs.nvidia.com/maximizing-unified-memory-performance-cuda/). 

We tried to modify the OSU benchmark in order to test these improvements,
however we were not able to attain the same performance as device to device
memory. 

cudaMemAdvise(r_buf, size*sizeof(char),cudaMemAdviseSetAccessedBy,myDevice);

for(j = 0; j < window_size; j++) 

{

      MPI_CHECK(MPI_Irecv(r_buf, size, MPI_CHAR, 1, 10, MPI_COMM_WORLD,
recv_request + j));

}

cudaMemAdvise(s_buf, size*sizeof(char),cudaMemAdviseSetAccessedBy,myDevice);

for(j = 0; j < window_size; j++) 

{

      MPI_CHECK(MPI_Isend(s_buf, size, MPI_CHAR, 1, 100, MPI_COMM_WORLD,
send_request + j));

}

Is there any recommendation for a setting of cudaMemAdvise in order to
improve the performance of managed memory when used with MVAPICH2-GDR?

Thank you for your help,

Yussuf Ali

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20180620/7f75dc3a/attachment-0001.html>