[mvapich-discuss] Improving the bandwidth performance of Managed Memory for MVAPICH2-GDR with cudaMemAdvise

Mon Jun 11 04:04:36 EDT 2018

Dear MVAPICH2-GDR users and developers,

when running the OSU bidirectional bandwidth benchmark on our cluster
systems (4 x P100 per node) we noticed a performance gap between Device and
Managed Memory (MM). 

For device to device we measured ~34 GB/s but for MM to MM only around ~ 9
GB/s for message sizes of size 4,194,304 in the intra node case. 

According to NVIDIA it is recommended to insert hints into the source code
in order to improve the performance
(https://devblogs.nvidia.com/maximizing-unified-memory-performance-cuda/). 

We tried to modify the OSU benchmark in order to test these improvements,
however we were not able to attain the same performance as device to device
memory. 

cudaMemAdvise(r_buf, size*sizeof(char),cudaMemAdviseSetAccessedBy,myDevice);

for(j = 0; j < window_size; j++) 

{

      MPI_CHECK(MPI_Irecv(r_buf, size, MPI_CHAR, 1, 10, MPI_COMM_WORLD,
recv_request + j));

}

cudaMemAdvise(s_buf, size*sizeof(char),cudaMemAdviseSetAccessedBy,myDevice);

for(j = 0; j < window_size; j++) 

{

      MPI_CHECK(MPI_Isend(s_buf, size, MPI_CHAR, 1, 100, MPI_COMM_WORLD,
send_request + j));

}

Is there any recommendation for a setting of cudaMemAdvise in order to
improve the performance of managed memory when used with MVAPICH2-GDR?

Thank you for your help,

Yussuf Ali

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20180611/d1b64abf/attachment.html>