[mvapich-discuss] Several 2.0-ga collective performance regressions (vs 1.9a2, 1.9b, 2.0a)

Tue Jul 8 14:46:48 EDT 2014

Hi Peter, 

Thanks for your note and posting the detailed performance results. 

Please note that collectives in MVAPICH2 2.0 series have been optimized with OSU Collective Benchmarks 
(as a part of the OMB test suite, not IMB). There have been also design changes in 2.0 series to deliver better
performance at the applications-level.

Do you see any applications-level performance degradation with 2.0GA compared to 1.9b? If so, please
let us know. We will be happy to take a look at this issue in detail. 

Thanks, 

DK
________________________________________
From: mvapich-discuss-bounces at cse.ohio-state.edu on behalf of Peter Kjellström [cap at nsc.liu.se]
Sent: Wednesday, June 25, 2014 9:39 AM
To: mvapich-discuss at cse.ohio-state.edu
Subject: [mvapich-discuss] Several 2.0-ga collective performance regressions (vs 1.9a2, 1.9b, 2.0a)

Hi there MVAPICH team!

Short summary:

I got around to building the final 2.0 release. I noticed best ever
performance on my non-blocking send/recv tests, yay, but several areas
of performance regressions in the IMB (intel mpi benchmarks, was PMB,
collective performance) results (vs. 1.9a2, 1.9b and 2.0a).

Detailed description:

The attached png shows the difference in performance for my reference
IMB run (128 ranks on 8 full 16-core nodes) between 2.0-ga and 1.9b.
The data is 2.0-ga as compared to 1.9b, that is, green is good for
2.0-ga and red is bad (grey is no difference). Size and brightness is
proportional to the size of the difference:

 grey: within +/- 10%
 color size1: within +/- 50%
 color size2: within +/- 100%
 color size3: within +/- 200%
 bright color size4: more than +/- 200%

The columns are one per IMB test (SR = SendRecv, AG = AllGather, etc.).
The rows are transfer size (first row smallest, last row 1M).

With that background it should be easy to see that there are four large
(more than a few values / transfer sizes) bad areas (bright red, 2.0-ga
worse than +200% of the time it took 1.9b):

 1) AG, AllGather. Increasingly bad but worst at large-ish sizes. Note
 that the three largest sizes are ok (256K, 512K, 1M).

 2&3) G, Gather. Bad at small sizes and at large (but ok in the middle).

 4) AA, AlltoAll. Bad for small sizes

 (and potential 5th would be Bc, Bcast which is bad-ish for everything
 but large).

Feel free to dig into the attached IMB output to discover the real
numbers behind the graphics...

Regards,
 Peter

Background information:

Hardware:
 * dual socket Xeon E5 (2x 8-core) 32G each
 * Mlnx FDR single switch (for this test)

Software:
 * CentOS-6.5
 * Intel compilers 14.0.2
 * RHEL/CentOS IB stack
 * slurm with cgroups (for this test only whole nodes)
 * HT/SMT not enabled

MVAPICH build:
 * configure opts: --enable-hybrid --enable-shared --prefix=...
 * env CC, CXX, FC, F77 set for intel
 * no rdmacm, writeable umad0, limic or other oddities
 * 1.9b rebuilt in exact same env for verification

Job launch:
 * verified correct rank pinning and launch
 * launch cmd: "mpiexec.hydra -bootstrap slurm IMB..."
 * 1.9b and 2.0-ga run on same node-set
 * geometry: 128 ranks on 8 nodes

-------------- next part --------------
A non-text attachment was scrubbed...
Name: winmail.dat
Type: application/ms-tnef
Size: 5207 bytes
Desc: not available
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20140708/8fa0e452/attachment.bin>