[mvapich-discuss] G2G performance with 2.0 vs 1.9a2
Panda, Dhabaleswar
panda at cse.ohio-state.edu
Thu Sep 18 10:15:00 EDT 2014
Hi,
Thanks for your note. We would like to know some more details on your platform, GPU configuration, basic
communication performance, etc. to find out what is going-on here. We will follow-up with you
on these details off the mailing list.
I also want to indicate you that advanced support for NVIDIA GPUs is available in MVAPICH2-GDR (GPUDirect RDMA) 2.0 release, not the regular MVAPICH2 2.0 release. Have you tried this release? If not, I will recommend to use this version. More details on this release are available from:
http://mvapich.cse.ohio-state.edu/overview/
Thanks,
DK
________________________________________
From: mvapich-discuss-bounces at cse.ohio-state.edu on behalf of Osuna Escamilla Carlos [carlos.osuna at env.ethz.ch]
Sent: Thursday, September 18, 2014 5:12 AM
To: mvapich-discuss at cse.ohio-state.edu
Subject: [mvapich-discuss] G2G performance with 2.0 vs 1.9a2
Dear experts
I have a code which I use to test and profile the communication layer of weather simulation model.
Since some time we use it in order to benchmark the communication on a fat node with 8 GPUs connected via PCI to host. In the past we used mvapich2 1.9a2 for the benchmark and recently I moved to mvapich2 2.0 for testing, where I discovered some strange behaviour.
The code is not simple, so it would be difficult to post it here, but I will try to explain the effects I observed, hoping that someone can give a hint on whats going on.
The code is doing halo exchanges using G2G, i.e. it is a combination of MPI_Isend and MPI_Irecv with gpu pointers.
For simplicity I test a 1D domain decomposition, so each GPU device will communicate with 2 neighbours.
The test is issuing a total of 90 halo exchanges (where each halo exchange is composed of 2 lines of a field with size 520, but sizes are probably not relevant here).
Overall with mvapich2 2.0 I see a degradation of a factor 2 in communication times with respect to mvapich2 1.9a2.
Here I post the timers for mvapich2 1.9a2
send --> 0.00154489
wait --> 0.0344899
and for mvapich2 2.0
send --> 0.0420051
wait --> 0.041773
In order to follow what is going on, I profiled it with nvprof. I attach below the main output of a nvprof -s, but in summary
there are 180 cuda memcpy DtoD what I would expect, 90 halo exchanges for 2 neighbours. In addition there are two times more cuda memcpy HtoD and DtoH which I dont know what they are doing (but I guess I can ignore them).
Now if we look at results from 2.0, GPUs 0, 3, 4, and 7 show 177 cuda memcpy DtoD (nice 3 are gone), but GPUs 1, 2, 5, and 6 double them to 372 cuda memcpy DtoD. This imbalance could clearly explain the different behaviour in the timers.
Could anyone give me a hint or explain these differences in behaviour between the two versions?
Thanks for the help.
**************************************
NVPROF For mvapich2/1.9a2-gcc-opcode3-4.6.3
[1] ==29663== [1] Profiling result:
[1] Time(%) Time [1] Calls Avg Min Max Name
[2] ==29665== Profiling result:
[1] 100.00% [1] 17.903ms 360 49.729us [1] 1.0240us 177.24us [CUDA memcpy DtoD]
[2] Time(%) Time [2] Calls Avg Min Max Name
[2] 100.00% [2] 22.544ms 360 62.621us 1.0240[2] us 196.12us [CUDA memcpy DtoD]
[4] ==29662== Profiling result:
[6] ==29666== Profiling result:
[3] ==29660== Profiling result:
[6] Time(%) Time Calls [6] Avg Min Max Name
[6] 100.00% 22.728[6] ms 360 63.132us 1.1520us 191.99us [CUDA memcpy DtoD]
[4] Time(%) Time Calls [4] Avg Min Max Name
[4] 40.80% [4] 11.606ms 180 64.475us 1.1520us 169.85us [CUDA memcpy DtoD]
[4] 30.38% 8.6423ms[4] 360 24.006us 9.2480us [4] 32.735us [CUDA memcpy HtoD]
[4] 28.82% 8.1964ms[4] 360 22.767us 8.4800us [4] 37.279us [CUDA memcpy DtoH]
[3] Time(%) Time Calls[3] Avg Min Max Name
[3] 35.94% [3] 9.4161ms 360 26.155us 8.8310us 43.678us [CUDA memcpy DtoH]
[3] 32.44% 8.4987[3] ms 180 47.215us 1.0240us 153.12us [CUDA memcpy DtoD]
[3] 31.62% 8.2844ms [3] 360 23.012us 8.7990us 31.167us [CUDA memcpy HtoD]
[5] ==29661== Profiling result:
[5] Time(%) Time Calls[5] Avg Min Max Name
[5] 100.00% [5] 18.219ms 360 50.609us 1.1200us 166.49us[5] [CUDA memcpy DtoD]
[7] ==29664== Profiling result:
[7] Time(%) Time Calls Avg Min Max [7] Name
[7] 36.04% [7] 9.7502ms 360 27.083us 8.9920us 43.839us [CUDA memcpy DtoH]
[7] 32.33% 8.7451ms 360 24.291us [7] 9.0230us 31.935us [CUDA memcpy HtoD]
[7] 31.63% 8.5558ms 180 [7] 47.532us 1.1520us 148.19us [CUDA memcpy DtoD]
[0] ==29659== Profiling result:
[0] Time(%) Time Calls [0] Avg Min Max Name
[0] 41.58% [0] 11.586ms 180 64.366us 1.0240us 169.05us [CUDA memcpy DtoD]
[0] 29.85% 8.3187ms [0] 360 23.107us 9.1830us 31.167us [CUDA memcpy HtoD][0]
[0] 28.57% 7.9600ms [0] 360 22.110us 8.3190us 32.799us [CUDA memcpy DtoH][0]
**************************************
NVPROF For mvapich2/2.0-gcc-opcode3-4.7.2
[0] ==30333== Profiling result:
[7] 33.63% [7] 10.902ms 177 61.591us 1.1520us 156.44us [CUDA memcpy DtoD]
[7] 33.43% 10.838ms 360[7] 30.105us 9.1520us 63.135us [CUDA memcpy HtoD]
[7] 32.93% 10.676ms [7] 372 28.699us 8.5120us 61.023us [CUDA memcpy DtoH]
[0] Time(%) Time Calls Avg Min Max Name
[0] 40.13% [0] 12.263ms 177 69.283us 1.0240us 180.73us [CUDA memcpy DtoD]
[0] 32.41% 9.9053ms 360 [0] 27.514us 8.3520us 48.863us [CUDA memcpy DtoH]
[0] 27.46% 8.3908ms 372 [0] 22.555us 8.8000us 33.151us [CUDA memcpy HtoD]
[3] ==30336== Profiling result:
[3] Time(%) Time Calls Avg Min Max Name
[3] 34.55% 11.036ms 177 62.350us 1.0240us [3] 177.92us [CUDA memcpy DtoD]
[3] 33.44% 10.683ms [3] 372 28.717us 8.8630us 61.374us [CUDA memcpy HtoD]
[3] 32.01% 10.226ms 360[3] 28.405us 8.3510us 68.350us [CUDA memcpy DtoH]
[4] ==30340== Profiling result:
[4] Time(%) Time Calls Avg Min Max [4] Name
[4] 39.15% [4] 12.117ms 177 68.458us 1.1520us[4] 170.52us [CUDA memcpy DtoD]
[4] 32.85% [4] 10.168ms 360 28.244us 8.5120us 56.030us[4] [CUDA memcpy DtoH]
[4] 28.00% [4] 8.6646ms 372 23.292us[4] 8.8960us 56.479us [CUDA memcpy HtoD][4]
[1] ==30341== Profiling application: ./CommTest.exe --nprocx 1 --nprocy 8 --ie 520 --je 350 --enable-GCL --ntracer-perHandler 3 --nGCLHandlers 30 --lperi_y --disable-validation-report --nbl_exchg 2
[6] ==30342== Profiling application: ./CommTest.exe --nprocx 1 --nprocy 8 --ie 520 --je 350 --enable-GCL --ntracer-perHandler 3 --nGCLHandlers 30 --lperi_y --disable-validation-report --nbl_exchg 2
[5] ==30332== Profiling application: ./CommTest.exe --nprocx 1 --nprocy 8 --ie 520 --je 350 --enable-GCL --ntracer-perHandler 3 --nGCLHandlers 30 --lperi_y --disable-validation-report --nbl_exchg 2
[2] ==30339== Profiling application: ./CommTest.exe --nprocx 1 --nprocy 8 --ie 520 --je 350 --enable-GCL --ntracer-perHandler 3 --nGCLHandlers 30 --lperi_y --disable-validation-report --nbl_exchg 2
[5] ==30332== Profiling result:
[5] Time(%) Time Calls Avg Min Max Name
[5] 96.24% [5] 20.186ms 354 57.022us 1.1840us 187.29us [CUDA memcpy DtoD]
[5] 2.16% [5] 453.14us 12 37.761us 8.4480us 55.294us[5] [CUDA memcpy DtoH]
[5] 1.60% [5] 334.65us 12 27.887us 10.240us 49.566[5] us [CUDA memcpy HtoD]
[1] ==30341== Profiling result:
[1] Time(%) Time Calls Avg Min Max Name
[1] 97.52% 20.523ms 354 57.973us 1.0240us 187.74[1] us [CUDA memcpy DtoD]
[1] 1.26% [1] 265.85us 12 22.154us 8.7680us [1] 24.031us [CUDA memcpy HtoD]
[1] 1.21% [1] 255.48us 12 21.290[1] us 8.3830us [1] 24.672us [CUDA memcpy DtoH]
[6] ==30342== Profiling result:
[6] Time(%) Time Calls Avg Min Max Name
[6] 97.21% 23.102ms 354 65.258us 1.1520us 195.45[6] us [CUDA memcpy DtoD]
[6] 1.46% [6] 346.55us 12 28.879us 13.088us [6] 39.519us [CUDA memcpy DtoH]
[6] 1.33% [6] 316.35us 12 26.362us 14.656us [6] 46.463us [CUDA memcpy HtoD]
[2] ==30339== Profiling result:
[2] Time(%) Time [2] Calls Avg Min Max Name
[2] 97.62% 23.373ms 354 66.024us 1.0240us 200.63us [CUDA memcpy DtoD]
[2] 2.38% 569.26us 24 23.719us 9.8560us 32.287us [CUDA memcpy DtoH]
_______________________________________________
mvapich-discuss mailing list
mvapich-discuss at cse.ohio-state.edu
http://mailman.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
-------------- next part --------------
A non-text attachment was scrubbed...
Name: winmail.dat
Type: application/ms-tnef
Size: 6942 bytes
Desc: not available
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20140918/2c99f628/attachment-0001.bin>
More information about the mvapich-discuss
mailing list