[mvapich-discuss] Performance difference in MPI_Allreduce calls betweem MVAPICH2-GDR and OpenMPI

Thu Jan 24 00:25:46 EST 2019

Dear Ammar,

thank you for your email! 
No it is not a DGX-2 system, we are using the ABCI supercomputer, the exact specification can be found here https://abci.ai/en/about_abci/computing_resource.html 
At the moment we are not able to provide you access to the ABCI system because it is not our own. But our organization
has purchased a DGX- 2 system which should be delivered within the next three months. At that time we may be able to provide you access to our DGX-2 system.

I have another question regarding the OSU benchmark. I executed the osu_bibw benchmark on the ABCI system for MVAPICH2-GDR(2.3), Intel MPI and OpenMPI
for Host to Host (H H) communication for the inter node case. 
MVAPICH2 shows a much higher bandwidth for the large messages than Intel or OpenMPI. Are these results correct or do we have a setup error in our benchmarking test?

Size    GDR(2.3)	Intel	OpenMPI
1	0.25	7.28	0
2	1.29	14.95	0.01
4	14.29	28.9	0.02
8	26.32	63.53	0.05
16	57.4	128.32	0.08
32	113.45	234.19	0.17
64	228.58	458.13	0.38
128	461.36	855.57	502.7
256	847.55	1583.76	960.93
512	1682.65	2837.01	1856.56
1024	3036.65	4750.74	3270.54
2048	5136.46	7119.34	5138.45
4096	7392.42	9262.95	7380.5
8192	9936.17 11643.05	 8366.76
16384	11173.19	 12779.45 308.49
32768	19337.39 13080.18 18678.61
65536	22878.94	 12942.33 21546.44
131072	23815.06 12821.71 22481.7
262144	24305.1 15569.08 22718.96
524288	47901.26	 18774.32 22937.39
1048576	48697.25	 20891.05 23036.87
2097152	49069.72	 22002.18 23098.95
4194304	49043.22	 22557.32 23131.02

Thank you for your help,
Yussuf

-----Original Message-----
From: Awan, Ammar Ahmad [mailto:awan.10 at buckeyemail.osu.edu] 
Sent: Thursday, January 24, 2019 4:14 AM
To: Yussuf Ali <yussuf.ali at jaea.go.jp>
Cc: mvapich-discuss at cse.ohio-state.edu <mvapich-discuss at mailman.cse.ohio-state.edu>
Subject: Re: [mvapich-discuss] Performance difference in MPI_Allreduce calls betweem MVAPICH2-GDR and OpenMPI

===( By JAEA Mail System )===============================
  URL中の文字「:」を「*」に置換しました。
  Characters of ":" in URL have been replaced with "*". 
=========================================================

Hi Yussuf,

Sorry to hear that you are seeing performance degradation. I have a few questions and suggestions.

Can you kindly let us know if this is a DGX-2 system? If not, please share some more details like the GPU topology and the availability of NVLink(s) on your system.

We have some new designs for the DGX-2 system that will be available in the next MVAPICH2-GDR release. The new designs provide much better performance.

In the meantime, is it possible for us to get access to your system? This will enable us to help you in a better and faster manner.

Thanks,
Ammar

On Tue, Jan 22, 2019 at 8:13 PM Yussuf Ali <yussuf.ali at jaea.go.jp<mailto:yussuf.ali at jaea.go.jp>> wrote:
Dear MVAPICH developers and users,

in our software we noticed a performance degradation in the MPI_Allreduce calls when using MVAPICH-GDR compared to OpenMPI.
The software (Krylov solver) runs several iterations and in each iteration data is reduced two times using MPI_Allreduce.
The send and receive buffers are both allocated as device memory on the GPU. We measured the total time of the MPI_Allreduce calls.

16 GPU case (V100)

MVAPICH2-GDR(2.3)
1. MPI_Allreduce :  0.27 seconds
2. MPI_Allreduce:  1.9 seconds

OpenMPI
1. MPI_Allreduce: 0.10 seconds
2. MPI_Allreduce; 0.19 seconds

The data sizes are:
1. MPI_Allreduce: 720 byte
2. MPI_Allreduce: 1,160 byte

Are there any parameters to tune the MPI_Allreduce performance in MVAPICH-GDR?

Thank you for your help,
Yussuf
_______________________________________________
mvapich-discuss mailing list
mvapich-discuss at cse.ohio-state.edu<mailto:mvapich-discuss at cse.ohio-state.edu>
http*//mailman.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss