[mvapich-discuss] Several 2.0-ga collective performance regressions (vs 1.9a2, 1.9b, 2.0a)
Panda, Dhabaleswar
panda at cse.ohio-state.edu
Tue Jul 8 14:46:48 EDT 2014
Hi Peter,
Thanks for your note and posting the detailed performance results.
Please note that collectives in MVAPICH2 2.0 series have been optimized with OSU Collective Benchmarks
(as a part of the OMB test suite, not IMB). There have been also design changes in 2.0 series to deliver better
performance at the applications-level.
Do you see any applications-level performance degradation with 2.0GA compared to 1.9b? If so, please
let us know. We will be happy to take a look at this issue in detail.
Thanks,
DK
________________________________________
From: mvapich-discuss-bounces at cse.ohio-state.edu on behalf of Peter Kjellström [cap at nsc.liu.se]
Sent: Wednesday, June 25, 2014 9:39 AM
To: mvapich-discuss at cse.ohio-state.edu
Subject: [mvapich-discuss] Several 2.0-ga collective performance regressions (vs 1.9a2, 1.9b, 2.0a)
Hi there MVAPICH team!
Short summary:
I got around to building the final 2.0 release. I noticed best ever
performance on my non-blocking send/recv tests, yay, but several areas
of performance regressions in the IMB (intel mpi benchmarks, was PMB,
collective performance) results (vs. 1.9a2, 1.9b and 2.0a).
Detailed description:
The attached png shows the difference in performance for my reference
IMB run (128 ranks on 8 full 16-core nodes) between 2.0-ga and 1.9b.
The data is 2.0-ga as compared to 1.9b, that is, green is good for
2.0-ga and red is bad (grey is no difference). Size and brightness is
proportional to the size of the difference:
grey: within +/- 10%
color size1: within +/- 50%
color size2: within +/- 100%
color size3: within +/- 200%
bright color size4: more than +/- 200%
The columns are one per IMB test (SR = SendRecv, AG = AllGather, etc.).
The rows are transfer size (first row smallest, last row 1M).
With that background it should be easy to see that there are four large
(more than a few values / transfer sizes) bad areas (bright red, 2.0-ga
worse than +200% of the time it took 1.9b):
1) AG, AllGather. Increasingly bad but worst at large-ish sizes. Note
that the three largest sizes are ok (256K, 512K, 1M).
2&3) G, Gather. Bad at small sizes and at large (but ok in the middle).
4) AA, AlltoAll. Bad for small sizes
(and potential 5th would be Bc, Bcast which is bad-ish for everything
but large).
Feel free to dig into the attached IMB output to discover the real
numbers behind the graphics...
Regards,
Peter
Background information:
Hardware:
* dual socket Xeon E5 (2x 8-core) 32G each
* Mlnx FDR single switch (for this test)
Software:
* CentOS-6.5
* Intel compilers 14.0.2
* RHEL/CentOS IB stack
* slurm with cgroups (for this test only whole nodes)
* HT/SMT not enabled
MVAPICH build:
* configure opts: --enable-hybrid --enable-shared --prefix=...
* env CC, CXX, FC, F77 set for intel
* no rdmacm, writeable umad0, limic or other oddities
* 1.9b rebuilt in exact same env for verification
Job launch:
* verified correct rank pinning and launch
* launch cmd: "mpiexec.hydra -bootstrap slurm IMB..."
* 1.9b and 2.0-ga run on same node-set
* geometry: 128 ranks on 8 nodes
-------------- next part --------------
A non-text attachment was scrubbed...
Name: winmail.dat
Type: application/ms-tnef
Size: 5207 bytes
Desc: not available
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20140708/8fa0e452/attachment.bin>
More information about the mvapich-discuss
mailing list