[mvapich-discuss] G2G performance with 2.0 vs 1.9a2

Thu Sep 18 10:15:00 EDT 2014

Hi, 

Thanks for your note. We would like to know some more details on your platform, GPU configuration, basic 
communication performance, etc. to find out what is going-on here. We will follow-up with you 
on these details off the mailing list. 

I also want to indicate you that advanced support for NVIDIA GPUs is available in MVAPICH2-GDR (GPUDirect RDMA) 2.0 release, not the regular MVAPICH2 2.0 release. Have you tried this release? If not, I will recommend to use this version. More details on this release are available from:

http://mvapich.cse.ohio-state.edu/overview/

Thanks, 

DK

________________________________________
From: mvapich-discuss-bounces at cse.ohio-state.edu on behalf of Osuna Escamilla  Carlos [carlos.osuna at env.ethz.ch]
Sent: Thursday, September 18, 2014 5:12 AM
To: mvapich-discuss at cse.ohio-state.edu
Subject: [mvapich-discuss] G2G performance with 2.0 vs 1.9a2

Dear experts

I have a code which I use to test and profile the communication layer of weather simulation model.
Since some time we use it in order to benchmark the communication on a fat node with 8 GPUs connected via PCI to host. In the past we used mvapich2 1.9a2 for the benchmark and recently I moved to mvapich2 2.0 for testing, where I discovered some strange behaviour.

The code is not simple, so it would be difficult to post it here, but I will try to explain the effects I observed, hoping that someone can give a hint on whats going on.

The code is doing halo exchanges using G2G, i.e. it is a combination of MPI_Isend and MPI_Irecv with gpu pointers.
For simplicity I test a 1D domain decomposition, so each GPU device will communicate with 2 neighbours.
The test is issuing a total of 90 halo exchanges (where each halo exchange is composed of 2 lines of a field with size 520, but sizes are probably not relevant here).

Overall with mvapich2 2.0 I see a degradation of a factor 2 in communication times with respect to mvapich2 1.9a2.
Here I post the timers for mvapich2 1.9a2
send -->        0.00154489
wait -->        0.0344899

and for mvapich2 2.0
send -->        0.0420051
wait -->        0.041773

In order to follow what is going on, I profiled it with nvprof. I attach below the main output of a nvprof -s, but in summary
there are 180 cuda memcpy DtoD what I would expect, 90 halo exchanges for 2 neighbours. In addition there are two times more cuda memcpy HtoD and DtoH which I dont know what they are doing (but I guess I can ignore them).

Now if we look at results from 2.0, GPUs 0, 3, 4, and 7 show 177 cuda memcpy DtoD (nice 3 are gone), but GPUs 1, 2, 5, and 6 double them to 372 cuda memcpy DtoD. This imbalance could clearly explain the different behaviour in the timers.

Could anyone give me a hint or explain these differences in behaviour between the two versions?

Thanks for the help.

**************************************
NVPROF For mvapich2/1.9a2-gcc-opcode3-4.6.3

[1] ==29663== [1] Profiling result:
[1] Time(%)      Time  [1]    Calls       Avg       Min       Max  Name
[2] ==29665== Profiling result:
[1] 100.00%  [1] 17.903ms       360  49.729us  [1] 1.0240us  177.24us  [CUDA memcpy DtoD]
[2] Time(%)      Time  [2]    Calls       Avg       Min       Max  Name
[2] 100.00%  [2] 22.544ms       360  62.621us  1.0240[2] us  196.12us  [CUDA memcpy DtoD]
[4] ==29662== Profiling result:
[6] ==29666== Profiling result:
[3] ==29660== Profiling result:
[6] Time(%)      Time     Calls  [6]      Avg       Min       Max  Name
[6] 100.00%  22.728[6] ms       360  63.132us  1.1520us  191.99us  [CUDA memcpy DtoD]
[4] Time(%)      Time     Calls  [4]      Avg       Min       Max  Name
[4]  40.80%  [4] 11.606ms       180  64.475us  1.1520us  169.85us  [CUDA memcpy DtoD]
[4]  30.38%  8.6423ms[4]        360  24.006us  9.2480us  [4] 32.735us  [CUDA memcpy HtoD]
[4]  28.82%  8.1964ms[4]        360  22.767us  8.4800us  [4] 37.279us  [CUDA memcpy DtoH]
[3] Time(%)      Time     Calls[3]        Avg       Min       Max  Name
[3]  35.94%  [3] 9.4161ms       360  26.155us  8.8310us  43.678us  [CUDA memcpy DtoH]
[3]  32.44%  8.4987[3] ms       180  47.215us  1.0240us  153.12us  [CUDA memcpy DtoD]
[3]  31.62%  8.2844ms  [3]      360  23.012us  8.7990us  31.167us  [CUDA memcpy HtoD]
[5] ==29661== Profiling result:
[5] Time(%)      Time     Calls[5]        Avg       Min       Max  Name
[5] 100.00%  [5] 18.219ms       360  50.609us  1.1200us  166.49us[5]   [CUDA memcpy DtoD]
[7] ==29664== Profiling result:
[7] Time(%)      Time     Calls       Avg       Min       Max  [7] Name
[7]  36.04%  [7] 9.7502ms       360  27.083us  8.9920us  43.839us  [CUDA memcpy DtoH]
[7]  32.33%  8.7451ms       360  24.291us  [7] 9.0230us  31.935us  [CUDA memcpy HtoD]
[7]  31.63%  8.5558ms       180  [7] 47.532us  1.1520us  148.19us  [CUDA memcpy DtoD]
[0] ==29659== Profiling result:
[0] Time(%)      Time     Calls  [0]      Avg       Min       Max  Name
[0]  41.58%  [0] 11.586ms       180  64.366us  1.0240us  169.05us  [CUDA memcpy DtoD]
[0]  29.85%  8.3187ms  [0]      360  23.107us  9.1830us  31.167us  [CUDA memcpy HtoD][0]
[0]  28.57%  7.9600ms  [0]      360  22.110us  8.3190us  32.799us  [CUDA memcpy DtoH][0]

**************************************
NVPROF For mvapich2/2.0-gcc-opcode3-4.7.2

[0] ==30333== Profiling result:
[7]  33.63%  [7] 10.902ms       177  61.591us  1.1520us  156.44us  [CUDA memcpy DtoD]
[7]  33.43%  10.838ms       360[7]   30.105us  9.1520us  63.135us  [CUDA memcpy HtoD]
[7]  32.93%  10.676ms  [7]      372  28.699us  8.5120us  61.023us  [CUDA memcpy DtoH]
[0] Time(%)      Time     Calls       Avg       Min       Max  Name
[0]  40.13%  [0] 12.263ms       177  69.283us  1.0240us  180.73us  [CUDA memcpy DtoD]
[0]  32.41%  9.9053ms       360  [0] 27.514us  8.3520us  48.863us  [CUDA memcpy DtoH]
[0]  27.46%  8.3908ms       372  [0] 22.555us  8.8000us  33.151us  [CUDA memcpy HtoD]
[3] ==30336== Profiling result:
[3] Time(%)      Time     Calls       Avg       Min       Max  Name
[3]  34.55%  11.036ms       177  62.350us  1.0240us  [3] 177.92us  [CUDA memcpy DtoD]
[3]  33.44%  10.683ms  [3]      372  28.717us  8.8630us  61.374us  [CUDA memcpy HtoD]
[3]  32.01%  10.226ms       360[3]   28.405us  8.3510us  68.350us  [CUDA memcpy DtoH]
[4] ==30340== Profiling result:
[4] Time(%)      Time     Calls       Avg       Min       Max  [4] Name
[4]  39.15%  [4] 12.117ms       177  68.458us  1.1520us[4]   170.52us  [CUDA memcpy DtoD]
[4]  32.85%  [4] 10.168ms       360  28.244us  8.5120us  56.030us[4]   [CUDA memcpy DtoH]
[4]  28.00%  [4] 8.6646ms       372  23.292us[4]   8.8960us  56.479us  [CUDA memcpy HtoD][4]
[1] ==30341== Profiling application: ./CommTest.exe --nprocx 1 --nprocy 8 --ie 520 --je 350 --enable-GCL --ntracer-perHandler 3 --nGCLHandlers 30 --lperi_y --disable-validation-report --nbl_exchg 2
[6] ==30342== Profiling application: ./CommTest.exe --nprocx 1 --nprocy 8 --ie 520 --je 350 --enable-GCL --ntracer-perHandler 3 --nGCLHandlers 30 --lperi_y --disable-validation-report --nbl_exchg 2
[5] ==30332== Profiling application: ./CommTest.exe --nprocx 1 --nprocy 8 --ie 520 --je 350 --enable-GCL --ntracer-perHandler 3 --nGCLHandlers 30 --lperi_y --disable-validation-report --nbl_exchg 2
[2] ==30339== Profiling application: ./CommTest.exe --nprocx 1 --nprocy 8 --ie 520 --je 350 --enable-GCL --ntracer-perHandler 3 --nGCLHandlers 30 --lperi_y --disable-validation-report --nbl_exchg 2
[5] ==30332== Profiling result:
[5] Time(%)      Time     Calls       Avg       Min       Max  Name
[5]  96.24%  [5] 20.186ms       354  57.022us  1.1840us  187.29us  [CUDA memcpy DtoD]
[5]   2.16%  [5] 453.14us        12  37.761us  8.4480us  55.294us[5]   [CUDA memcpy DtoH]
[5]   1.60%  [5] 334.65us        12  27.887us  10.240us  49.566[5] us  [CUDA memcpy HtoD]
[1] ==30341== Profiling result:
[1] Time(%)      Time     Calls       Avg       Min       Max  Name
[1]  97.52%  20.523ms       354  57.973us  1.0240us  187.74[1] us  [CUDA memcpy DtoD]
[1]   1.26%  [1] 265.85us        12  22.154us  8.7680us  [1] 24.031us  [CUDA memcpy HtoD]
[1]   1.21%  [1] 255.48us        12  21.290[1] us  8.3830us  [1] 24.672us  [CUDA memcpy DtoH]
[6] ==30342== Profiling result:
[6] Time(%)      Time     Calls       Avg       Min       Max  Name
[6]  97.21%  23.102ms       354  65.258us  1.1520us  195.45[6] us  [CUDA memcpy DtoD]
[6]   1.46%  [6] 346.55us        12  28.879us  13.088us  [6] 39.519us  [CUDA memcpy DtoH]
[6]   1.33%  [6] 316.35us        12  26.362us  14.656us  [6] 46.463us  [CUDA memcpy HtoD]
[2] ==30339== Profiling result:
[2] Time(%)      Time  [2]    Calls       Avg       Min       Max  Name
[2]  97.62%  23.373ms       354  66.024us  1.0240us  200.63us  [CUDA memcpy DtoD]
[2]   2.38%  569.26us        24  23.719us  9.8560us  32.287us  [CUDA memcpy DtoH]

_______________________________________________
mvapich-discuss mailing list
mvapich-discuss at cse.ohio-state.edu
http://mailman.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
-------------- next part --------------
A non-text attachment was scrubbed...
Name: winmail.dat
Type: application/ms-tnef
Size: 6942 bytes
Desc: not available
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20140918/2c99f628/attachment-0001.bin>