[mvapich-discuss] Performance differences between mvapich2-1.0 and mvapich2-1.2

Tue Jul 29 12:45:38 EDT 2008

It seems to me that mvapich2-1.2rc1 seems to be slower that previous
versions when compiling/using defaults. I'd like to know if I forgot
some secret preprocessor flag or configure option for 1.2.

I compiled the nighty build for mvapich2-1.0 as of July 28 (I guess it
is something like mvapich2-1.0.5) with the following settings:

export CC=icc
export CXX=icpc
export F77=ifort
export F90=ifort
export CFLAGS='-D_EM64T_ -D_SMP_ -DUSE_HEADER_CACHING -DONE_SIDED
-DMPIDI_CH3_CHANNEL_RNDV -DMPID_USE_SEQUENCE_NUMBERS -DRDMA_CM -O2'
configure --with-device=osu_ch3:mrail --with-rdma=gen2 --with-pm=mpd
--disable-romio --enable-sharedlibs=gcc --without-mpe

I compiled the tarball source of mvapich2-1.2rc1 with
unset CFLAGS
./configure --enable-romio --with-file-system=lustre+nfs
--enable-fast=defopt --with-rdma=gen2 --with-thread-package
--enable-sharedlibs=gcc --without-mpe

I get the following when running osu_alltoall with 1 task per node on
two nodes after setting MV2_NUM_PORTS=2 MV2_ENABLE_AFFINITY=0:

mvapich2-1.0.5-intel:
# OSU MPI All-to-All Personalized Exchange Latency Test v3.1
# Size            Latency (us)
1                         1.62
2                         1.71
4                         1.66
8                         1.64
16                        1.68
32                        1.74
64                        1.97
128                       3.04
256                       3.42
512                       4.01
1024                      5.26
2048                      6.62
4096                      9.45
8192                     15.20
16384                    17.76
32768                    23.21
65536                    38.60
131072                   76.32
262144                  151.70
524288                  296.74
1048576                 591.68

mvapich2-1.2rc1-intel:
# OSU MPI All-to-All Personalized Exchange Latency Test v3.1
# Size            Latency (us)
1                         1.87
2                         1.80
4                         1.81
8                         1.82
16                        1.86
32                        1.92
64                        2.10
128                       3.16
256                       3.53
512                       4.07
1024                      5.33
2048                      6.79
4096                      9.54
8192                     15.34
16384                    17.48
32768                    22.88
65536                    38.78
131072                   76.55
262144                  149.74
524288                  297.11
1048576                 591.25

Other OSU benchmarks yield no visible differences between the two
builds, e.g. osu_mbw_mr with 2 nodes and 4 tasks per node:

mvapich2-1.0.5-intel:
# OSU MPI Multiple Bandwidth / Message Rate Test v3.1
# [ pairs: 4 ] [ window size: 64 ]
# Size                    MB/s          Messages/s
1                         3.45          3447336.26
2                         6.93          3463236.43
4                        13.83          3458551.26
8                        27.68          3460000.08
16                       62.91          3931824.03
32                      109.74          3429389.41
64                      213.14          3330258.12
128                     353.90          2764881.74
256                     624.27          2438548.84
512                     980.57          1915173.15
1024                   1241.38          1212281.33
2048                   1463.71           714703.42
4096                   1612.25           393616.25
8192                   1721.11           210096.00
16384                  1851.29           112993.94
32768                  2051.28            62600.09
65536                  2062.08            31464.92
131072                 2065.59            15759.17
262144                 2074.04             7911.82
524288                 2082.66             3972.35
1048576                2087.94             1991.22
2097152                2090.20              996.69
4194304                2075.23              494.77

mvapich2-1.2rc1-intel:
# OSU MPI Multiple Bandwidth / Message Rate Test v3.1
# [ pairs: 4 ] [ window size: 64 ]
# Size                    MB/s          Messages/s
1                         3.42          3424686.07
2                         6.92          3459442.70
4                        13.73          3431691.09
8                        27.59          3449218.84
16                       62.63          3914337.15
32                      108.91          3403302.14
64                      210.89          3295101.65
128                     347.89          2717920.88
256                     621.49          2427687.32
512                     982.32          1918595.24
1024                   1246.40          1217187.35
2048                   1490.18           727625.11
4096                   1684.54           411264.55
8192                   1768.11           215833.58
16384                  1852.36           113059.37
32768                  2048.83            62525.18
65536                  2062.01            31463.76
131072                 2066.38            15765.20
262144                 2074.90             7915.12
524288                 2082.75             3972.54
1048576                2088.07             1991.34
2097152                2090.04              996.61
4194304                2077.47              495.31

I also compiled the quantum chemistry code CPMD 3.11.1 with both libs.
The code has own profiling. A benchmark run yields for a run with 64
nodes, 1 task per node, 1 thread per task, application-defined task
pinning, MV2_NUM_PORTS=2 MV2_ENABLE_AFFINITY=0:

mvapich2-1.0.5-intel:
...
       CPU TIME :    0 HOURS 17 MINUTES  7.53 SECONDS
   ELAPSED TIME :    0 HOURS 17 MINUTES 40.26 SECONDS
...
 ================================================================
 = COMMUNICATION TASK  AVERAGE MESSAGE LENGTH  NUMBER OF CALLS  =
 = SEND/RECEIVE               36385. BYTES             722421.  =
 = BROADCAST                  37880. BYTES                368.  =
 = GLOBAL SUMMATION          393974. BYTES              10556.  =
 = GLOBAL MULTIPLICATION          0. BYTES                  1.  =
 = ALL TO ALL COMM           484310. BYTES              46464.  =
 =                             PERFORMANCE          TOTAL TIME  =
 = SEND/RECEIVE              681.133  MB/S          38.591 SEC  =
 = BROADCAST                  87.115  MB/S           0.160 SEC  =
 = GLOBAL SUMMATION         1520.563  MB/S          16.410 SEC  =
 = GLOBAL MULTIPLICATION       0.000  MB/S           0.001 SEC  =
 = ALL TO ALL COMM            86.898  MB/S         258.959 SEC  =
 = SYNCHRONISATION                                   1.750 SEC  =
 ================================================================

mvapich2-1.2rc1-intel:
...
       CPU TIME :    0 HOURS 18 MINUTES 59.23 SECONDS     
   ELAPSED TIME :    0 HOURS 19 MINUTES 31.68 SECONDS     
...
 ================================================================
 = COMMUNICATION TASK  AVERAGE MESSAGE LENGTH  NUMBER OF CALLS  =
 = SEND/RECEIVE               36385. BYTES             722421.  =
 = BROADCAST                  37880. BYTES                368.  =
 = GLOBAL SUMMATION          393974. BYTES              10556.  =
 = GLOBAL MULTIPLICATION          0. BYTES                  1.  =
 = ALL TO ALL COMM           484310. BYTES              46464.  =
 =                             PERFORMANCE          TOTAL TIME  =
 = SEND/RECEIVE              699.651  MB/S          37.570 SEC  =
 = BROADCAST                  87.114  MB/S           0.160 SEC  =
 = GLOBAL SUMMATION         1557.608  MB/S          16.020 SEC  =
 = GLOBAL MULTIPLICATION       0.000  MB/S           0.001 SEC  =
 = ALL TO ALL COMM            61.302  MB/S         367.082 SEC  =
 = SYNCHRONISATION                                   1.950 SEC  =
 ================================================================

The difference is reproducible (mvapich2-1.2rc1-intel is slower, seems
to be the reason of slow all to all comm.), also compared to
mvapich2-1.0.3 from tarball, or mvapich2-1.0.1 and mvapich-0.9.9 (both
precompiled from SGI, available from SGI). Note that the benchmarks are
run with no intra-node communication.

Sincerely, BK
-- 
Dr. Bernd Kallies
Konrad-Zuse-Zentrum für Informationstechnik Berlin
Takustr. 7
14195 Berlin
Tel: +49-30-84185-270
Fax: +49-30-84185-311
e-mail: kallies at zib.de