[mvapich-discuss] Improving the bandwidth performance of Managed Memory for MVAPICH2-GDR with cudaMemAdvise

Wed Jun 20 14:42:19 EDT 2018

Hi Yussuf,

Thanks for the update here. We are looking at all possible interactions
with these flags.

Regards,
Ammar

On Tue, Jun 19, 2018 at 9:04 PM Yussuf Ali <yussuf.ali at jaea.go.jp> wrote:

> Dear MVAPICH2-GDR users and developers,
>
>
>
> I ran the OSU benchmarks for managed memory with the following flags:
>
>
>
> export MV2_USE_CUDA=1
>
> export MV2_CUDA_ENABLE_MANAGED=1
>
> export MV2_CUDA_MANAGED_IPC=1
>
>
>
> But now I realized that it is possible to get a much higher performance
> for managed memory if I remove the export MV2_USE_CUDA=1 and run the
> benchmark only with the two following flags.
>
>
>
> export MV2_CUDA_ENABLE_MANAGED=1
>
> export MV2_CUDA_MANAGED_IPC=1
>
>
>
> Do this flags somehow have side-effects on each other? And does this imply
> it is not possible to mix device and managed memory buffers in a single
> program?
>
>
>
> Thank you for your help,
>
> Yussuf Ali
>
>
>
>
>
> *From:* Yussuf Ali [mailto:yussuf.ali at jaea.go.jp]
> *Sent:* Monday, June 11, 2018 5:05 PM
> *To:* 'mvapich-discuss at cse.ohio-state.edu' <
> mvapich-discuss at cse.ohio-state.edu>
> *Subject:* Improving the bandwidth performance of Managed Memory for
> MVAPICH2-GDR with cudaMemAdvise
>
>
>
> Dear MVAPICH2-GDR users and developers,
>
>
>
> when running the OSU bidirectional bandwidth benchmark on our cluster
> systems (4 x P100 per node) we noticed a performance gap between Device and
> Managed Memory (MM).
>
> For device to device we measured ~34 GB/s but for MM to MM only around ~ 9
> GB/s for message sizes of size 4,194,304 in the intra node case.
>
>
>
> According to NVIDIA it is recommended to insert hints into the source code
> in order to improve the performance (
> https://devblogs.nvidia.com/maximizing-unified-memory-performance-cuda/).
>
>
>
> We tried to modify the OSU benchmark in order to test these improvements,
> however we were not able to attain the same performance as device to device
> memory.
>
>
>
> cudaMemAdvise(r_buf,
> size*sizeof(char),cudaMemAdviseSetAccessedBy,myDevice);
>
> for(j = 0; j < window_size; j++)
>
> {
>
>       MPI_CHECK(MPI_Irecv(r_buf, size, MPI_CHAR, 1, 10, MPI_COMM_WORLD,
> recv_request + j));
>
> }
>
>
>
> cudaMemAdvise(s_buf,
> size*sizeof(char),cudaMemAdviseSetAccessedBy,myDevice);
>
> for(j = 0; j < window_size; j++)
>
> {
>
>       MPI_CHECK(MPI_Isend(s_buf, size, MPI_CHAR, 1, 100, MPI_COMM_WORLD,
> send_request + j));
>
> }
>
>
>
>
>
> Is there any recommendation for a setting of cudaMemAdvise in order to
> improve the performance of managed memory when used with MVAPICH2-GDR?
>
>
>
> Thank you for your help,
>
> Yussuf Ali
> _______________________________________________
> mvapich-discuss mailing list
> mvapich-discuss at cse.ohio-state.edu
> http://mailman.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20180620/659974c1/attachment.html>