[mvapich-discuss] Several 2.0-ga collective performance regressions (vs 1.9a2, 1.9b, 2.0a)

Thu Jul 24 08:27:03 EDT 2014

On Tue, 8 Jul 2014 18:46:48 +0000
"Panda, Dhabaleswar" <panda at cse.ohio-state.edu> wrote:

> Hi Peter, 
> 
> Thanks for your note and posting the detailed performance results. 

Thanks for replying and I apologize for my delayed respons (your
comments raised some non-trivial issues and it's vacation times..).

Please see my comments inlines below.

> Please note that collectives in MVAPICH2 2.0 series have been
> optimized with OSU Collective Benchmarks (as a part of the OMB test
> suite, not IMB).

Valid point, the way osu collective benchmarks measures (and reports)
differs quite significantly from what IMB does (and in a way
illustrates how differently applications can react).

I reran my tests (same system, same build, etc.) with osu instead of
IMB. For Alltoall and Allgather it paints about the same picture
(confirms reported regressions). Gather, though, is a different thing
entirely...

See attached graphs for mvp19b vs mvp20ga for these three collectives.

For small sizes on Gather the situation seems quite confusing at first
(not made any less confusing by osu only printing avg as default):

My orignial report:
 mvp19b IMB ~6 us
 mvp20ga IMB ~33 us

Now osu:
 mvp19b osu ~3 us
 mvp20ga osu ~2 us (15x faster than IMB and shows 20 > 19b!)

The difference stems from the big variation in time for different ranks
(natural as gather is just a fire-and-forget send for most ranks). Osu
measures each iteration on each rank and reports the avg over the
ranks. Since I have 127 quick ranks and 1 slow (the root) the number is
dominated by the quick fire-and-forget send. Osu also syncs everything
up with a Barrier between each call.

Looking at min/avg/max from osu it's clear there is a regression in
performance on the root rank but none at the other ranks (root ranks
now takes ~170 us and on 19b ~50 us).

With this understanding it's still hard to see how IMB could come up
with 33 us, right? Turns out IMB rotates the root role for the
collective between iterations and times all 1000 together. Overlapping
of the more expensive (170 us) root role gives a min==avg==max==33.

Conclusions
 * There's not really a correct way of measuring, it differs
 * The regression is still there
 * osu benchmarks for non-symmetric collectives can be misleading by
   reporting only (as default) the avg over all ranks. Add to this that
   many people misunderstand this to be min/avg/max _over the
   iterations_...

> There have been also design changes in 2.0 series to
> deliver better performance at the applications-level.
> 
> Do you see any applications-level performance degradation with 2.0GA
> compared to 1.9b? If so, please let us know. We will be happy to take
> a look at this issue in detail. 

I'm mostly a systems person and this quite significant set of
regression has kind of stopped 2.0ga from reaching our users and
application experts so there is not much to go on yet.

We have run VASP and it was possibly a bit slower on 2.0ga vs 1.9b.

I know that there are a bunch of applications that care about small
alltoall (CPMD comes to mind). And given that there are significantly
more bad spots than good spots it seems likely 1.9b will win (atleast
for collective, at around 128 ranks, on a dual socket SNB with FDR).

But only time and real data will tell.

Regards,
 Peter K

> Thanks, 
> 
> DK
...
> Subject: [mvapich-discuss] Several 2.0-ga collective performance
> regressions (vs 1.9a2, 1.9b, 2.0a)
...
> Short summary:
> 
> I got around to building the final 2.0 release. I noticed best ever
> performance on my non-blocking send/recv tests, yay, but several areas
> of performance regressions in the IMB (intel mpi benchmarks, was PMB,
> collective performance) results (vs. 1.9a2, 1.9b and 2.0a).
...
> there are four
> large (more than a few values / transfer sizes) bad areas (bright
> red, 2.0-ga worse than +200% of the time it took 1.9b):
> 
>  1) AG, AllGather. Increasingly bad but worst at large-ish sizes. Note
>  that the three largest sizes are ok (256K, 512K, 1M).
> 
>  2&3) G, Gather. Bad at small sizes and at large (but ok in the
> middle).
> 
>  4) AA, AlltoAll. Bad for small sizes
> 
>  (and potential 5th would be Bc, Bcast which is bad-ish for everything
>  but large).
...
> Background information:
> 
> Hardware:
>  * dual socket Xeon E5 (2x 8-core) 32G each
>  * Mlnx FDR single switch (for this test)
> 
> Software:
>  * CentOS-6.5
>  * Intel compilers 14.0.2
>  * RHEL/CentOS IB stack
>  * slurm with cgroups (for this test only whole nodes)
>  * HT/SMT not enabled
> 
> MVAPICH build:
>  * configure opts: --enable-hybrid --enable-shared --prefix=...
>  * env CC, CXX, FC, F77 set for intel
>  * no rdmacm, writeable umad0, limic or other oddities
>  * 1.9b rebuilt in exact same env for verification
> 
> Job launch:
>  * verified correct rank pinning and launch
>  * launch cmd: "mpiexec.hydra -bootstrap slurm IMB..."
>  * 1.9b and 2.0-ga run on same node-set
>  * geometry: 128 ranks on 8 nodes
> 

-------------- next part --------------
A non-text attachment was scrubbed...
Name: mv19b_vs_mvp20ga_osu.png
Type: image/png
Size: 36227 bytes
Desc: not available
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20140724/1458a560/attachment-0001.png>