[mvapich-discuss] MVAPICH2 Allreduce Performance

Fri Aug 1 13:15:01 EDT 2008

Peter,

Thanks for reporting these performance numbers and the comparisions.
MVAPICH 0.9.9 is an older version. Several multi-core-aware collective
optimizations went into MVAPICH 1.0 series.  Please check the latest
MVAPICH 1.0.1 version and let us know whether you still see the
performance degradation.

Similarly, the multi-core-aware collective optimizations have gone into
the latest MVAPICH2 1.2 series. Please check out the latest MVAPICH2 1.2
version from the trunk (not RC1, we have added some enhancements and
tuning after RC1 was released) and let us know if you still see the
performance degradation.

DK

On Wed, 30 Jul 2008, Peter Cebull wrote:

> We are looking at some scalability issues for a particular application
> on one of our clusters. Specifically, I plotted the MPI_Allreduce
> performance of MVAPICH2, MVAPICH, Intel MPI, and Open MPI as measured by
> the Intel MPI Allreduce Benchmark. The plot shows average time in
> microseconds vs the number of processes from 2 to 512 for a message size
> of 4 kB.
>
> The results show MVAPICH2 performing very well up to 128 process, but
> for 256 and 512 processes the performance drops off by an order of
> magnitude to match the performance of MVAPICH and Intel MPI. Is this
> expected behavior, and is there a way to improve the scalability for
> 256+ processes? I didn't see this topic in the archive, I apologize if
> it's been discussed before.
>
> We are running dual quad-core EM64t nodes, OFED 1.2, Mellanox
> Technologies MT25204 [InfiniHost III Lx HCA]. This machine is an SGI
> Altix ICE with ProPack 5 SP3. The timing data are listed below.
>
> mpich2version
> Version:           mvapich2-1.0
> Device:            osu_ch3:mrail
> Configure Options:
> '--prefix=/usr/local/mvapich2/mvapich2-1.0.2/intel-opt'
> '--with-device=osu_ch3:mrail' '--with-rdma=gen2' '--with-pm=mpd'
> '--enable-shared=gcc' '--enable-sharedlibs=gcc' '--disable-romio'
> '--without-mpe' 'CC=icc' 'CFLAGS=-fPIC -D_EM64T_ -D_SMP_
> -DUSE_HEADER_CACHING  -DONE_SIDED -DMPIDI_CH3_CHANNEL_RNDV
> -DMPID_USE_SEQUENCE_NUMBERS  -DRDMA_CM   -I/usr/include -fPIC -O2'
> 'CXX=icpc' 'F77=ifort' 'F90=ifort' 'FFLAGS=-L/usr/lib64 -fPIC'
> CC:  icc -fPIC -D_EM64T_ -D_SMP_ -DUSE_HEADER_CACHING  -DONE_SIDED
> -DMPIDI_CH3_CHANNEL_RNDV -DMPID_USE_SEQUENCE_NUMBERS  -DRDMA_CM
> -I/usr/include -fPIC -O2
> CXX: icpc
> F77: ifort -L/usr/lib64 -fPIC
> F90: ifort
>
> Thanks,
> Peter
>
> # processes vs time in us
> Intel MPI 3.1
> 2   7.12
> 4   14.82
> 8   26.07
> 16   83.85
> 32   543.00
> 64   1025.87
> 128   1492.71
> 256   1957.55
> 512   2445.58
>
> MVAPICH 0.9.9
> 2   13.44
> 4   20.72
> 8   37.08
> 16   84.59
> 32   545.56
> 64   1018.50
> 128   1509.70
> 256   1959.09
> 512   2481.70
>
> MVAPICH2 1.0.2
> 2   11.76
> 4   19.16
> 8   37.26
> 16   80.09
> 32   105.88
> 64   111.21
> 128   126.11
> 256   1942.33
> 512   2434.15
>
> Open MPI 1.2.6
> 2   13.23
> 4   30.25
> 8   63.63
> 16   95.66
> 32   155.05
> 64   272.42
> 128   512.11
> 256   752.29
> 512   999.50
>
> --
> Peter Cebull
> Idaho National Laboratory
>
>
>