[mvapich-discuss] Several 2.0-ga collective performance regressions (vs 1.9a2, 1.9b, 2.0a)

Fri Jul 25 09:42:51 EDT 2014

Hi Peter,

Thanks for posting performance graphs for some of the collectives. Do you see these discrepancies for only 128 processes or for different configurations. We need to examine to see if the correct algorithms are being selected or not on your platform. I will initiate an offline discussion with you about this and we will investigate the issue further. 

Thanks,

DK

Sent from my iPhone

> On Jul 24, 2014, at 8:27 AM, "Peter Kjellström" <cap at nsc.liu.se> wrote:
> 
> On Tue, 8 Jul 2014 18:46:48 +0000
> "Panda, Dhabaleswar" <panda at cse.ohio-state.edu> wrote:
> 
>> Hi Peter, 
>> 
>> Thanks for your note and posting the detailed performance results.
> 
> Thanks for replying and I apologize for my delayed respons (your
> comments raised some non-trivial issues and it's vacation times..).
> 
> Please see my comments inlines below.
> 
>> Please note that collectives in MVAPICH2 2.0 series have been
>> optimized with OSU Collective Benchmarks (as a part of the OMB test
>> suite, not IMB).
> 
> Valid point, the way osu collective benchmarks measures (and reports)
> differs quite significantly from what IMB does (and in a way
> illustrates how differently applications can react).
> 
> I reran my tests (same system, same build, etc.) with osu instead of
> IMB. For Alltoall and Allgather it paints about the same picture
> (confirms reported regressions). Gather, though, is a different thing
> entirely...
> 
> See attached graphs for mvp19b vs mvp20ga for these three collectives.
> 
> For small sizes on Gather the situation seems quite confusing at first
> (not made any less confusing by osu only printing avg as default):
> 
> My orignial report:
> mvp19b IMB ~6 us
> mvp20ga IMB ~33 us
> 
> Now osu:
> mvp19b osu ~3 us
> mvp20ga osu ~2 us (15x faster than IMB and shows 20 > 19b!)
> 
> The difference stems from the big variation in time for different ranks
> (natural as gather is just a fire-and-forget send for most ranks). Osu
> measures each iteration on each rank and reports the avg over the
> ranks. Since I have 127 quick ranks and 1 slow (the root) the number is
> dominated by the quick fire-and-forget send. Osu also syncs everything
> up with a Barrier between each call.
> 
> Looking at min/avg/max from osu it's clear there is a regression in
> performance on the root rank but none at the other ranks (root ranks
> now takes ~170 us and on 19b ~50 us).
> 
> With this understanding it's still hard to see how IMB could come up
> with 33 us, right? Turns out IMB rotates the root role for the
> collective between iterations and times all 1000 together. Overlapping
> of the more expensive (170 us) root role gives a min==avg==max==33.
> 
> Conclusions
> * There's not really a correct way of measuring, it differs
> * The regression is still there
> * osu benchmarks for non-symmetric collectives can be misleading by
>   reporting only (as default) the avg over all ranks. Add to this that
>   many people misunderstand this to be min/avg/max _over the
>   iterations_...
> 
>> There have been also design changes in 2.0 series to
>> deliver better performance at the applications-level.
>> 
>> Do you see any applications-level performance degradation with 2.0GA
>> compared to 1.9b? If so, please let us know. We will be happy to take
>> a look at this issue in detail.
> 
> I'm mostly a systems person and this quite significant set of
> regression has kind of stopped 2.0ga from reaching our users and
> application experts so there is not much to go on yet.
> 
> We have run VASP and it was possibly a bit slower on 2.0ga vs 1.9b.
> 
> I know that there are a bunch of applications that care about small
> alltoall (CPMD comes to mind). And given that there are significantly
> more bad spots than good spots it seems likely 1.9b will win (atleast
> for collective, at around 128 ranks, on a dual socket SNB with FDR).
> 
> But only time and real data will tell.
> 
> Regards,
> Peter K
> 
>> Thanks, 
>> 
>> DK
> ...
>> Subject: [mvapich-discuss] Several 2.0-ga collective performance
>> regressions (vs 1.9a2, 1.9b, 2.0a)
> ...
>> Short summary:
>> 
>> I got around to building the final 2.0 release. I noticed best ever
>> performance on my non-blocking send/recv tests, yay, but several areas
>> of performance regressions in the IMB (intel mpi benchmarks, was PMB,
>> collective performance) results (vs. 1.9a2, 1.9b and 2.0a).
> ...
>> there are four
>> large (more than a few values / transfer sizes) bad areas (bright
>> red, 2.0-ga worse than +200% of the time it took 1.9b):
>> 
>> 1) AG, AllGather. Increasingly bad but worst at large-ish sizes. Note
>> that the three largest sizes are ok (256K, 512K, 1M).
>> 
>> 2&3) G, Gather. Bad at small sizes and at large (but ok in the
>> middle).
>> 
>> 4) AA, AlltoAll. Bad for small sizes
>> 
>> (and potential 5th would be Bc, Bcast which is bad-ish for everything
>> but large).
> ...
>> Background information:
>> 
>> Hardware:
>> * dual socket Xeon E5 (2x 8-core) 32G each
>> * Mlnx FDR single switch (for this test)
>> 
>> Software:
>> * CentOS-6.5
>> * Intel compilers 14.0.2
>> * RHEL/CentOS IB stack
>> * slurm with cgroups (for this test only whole nodes)
>> * HT/SMT not enabled
>> 
>> MVAPICH build:
>> * configure opts: --enable-hybrid --enable-shared --prefix=...
>> * env CC, CXX, FC, F77 set for intel
>> * no rdmacm, writeable umad0, limic or other oddities
>> * 1.9b rebuilt in exact same env for verification
>> 
>> Job launch:
>> * verified correct rank pinning and launch
>> * launch cmd: "mpiexec.hydra -bootstrap slurm IMB..."
>> * 1.9b and 2.0-ga run on same node-set
>> * geometry: 128 ranks on 8 nodes
> 
> <mv19b_vs_mvp20ga_osu.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: winmail.dat
Type: application/ms-tnef
Size: 6707 bytes
Desc: not available
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20140725/ade5dff4/attachment.bin>