[mvapich-discuss] Performance differences between mvapich2-1.0 and mvapich2-1.2

Fri Aug 1 14:54:51 EDT 2008

Bernd,

I've looked over your configuration for MVAPICH2-1.2rc1 and I have some 
suggestions.  The default configuration of MVAPICH2-1.2 includes 
--enable-fast=defopt,ndebug.  By specifying --enable-fast=defopt, you 
lose the ndebug option which results in the inclusion of assertions and 
other debug statements.  The "--with-thread-package" parameter is a NOOP 
unless you specify an option (as is, it is selecting pthread).  I 
recommend checking out our latest source from trunk and using the 
following configuration:

./configure --with-file-system=lustre+nfs --without-mpe 
--enable-sharedlibs=gcc

Please note that ROMIO is enabled by default and gen2 is selected by 
default for Linux.  Let us know if your performance improves by using 
the latest source from trunk and the recommend configuration.

Brian

Bernd Kallies wrote:
> It seems to me that mvapich2-1.2rc1 seems to be slower that previous
> versions when compiling/using defaults. I'd like to know if I forgot
> some secret preprocessor flag or configure option for 1.2.
>
> I compiled the nighty build for mvapich2-1.0 as of July 28 (I guess it
> is something like mvapich2-1.0.5) with the following settings:
>
> export CC=icc
> export CXX=icpc
> export F77=ifort
> export F90=ifort
> export CFLAGS='-D_EM64T_ -D_SMP_ -DUSE_HEADER_CACHING -DONE_SIDED
> -DMPIDI_CH3_CHANNEL_RNDV -DMPID_USE_SEQUENCE_NUMBERS -DRDMA_CM -O2'
> configure --with-device=osu_ch3:mrail --with-rdma=gen2 --with-pm=mpd
> --disable-romio --enable-sharedlibs=gcc --without-mpe
>
> I compiled the tarball source of mvapich2-1.2rc1 with
> unset CFLAGS
> ./configure --enable-romio --with-file-system=lustre+nfs
> --enable-fast=defopt --with-rdma=gen2 --with-thread-package
> --enable-sharedlibs=gcc --without-mpe
>
> I get the following when running osu_alltoall with 1 task per node on
> two nodes after setting MV2_NUM_PORTS=2 MV2_ENABLE_AFFINITY=0:
>
> mvapich2-1.0.5-intel:
> # OSU MPI All-to-All Personalized Exchange Latency Test v3.1
> # Size            Latency (us)
> 1                         1.62
> 2                         1.71
> 4                         1.66
> 8                         1.64
> 16                        1.68
> 32                        1.74
> 64                        1.97
> 128                       3.04
> 256                       3.42
> 512                       4.01
> 1024                      5.26
> 2048                      6.62
> 4096                      9.45
> 8192                     15.20
> 16384                    17.76
> 32768                    23.21
> 65536                    38.60
> 131072                   76.32
> 262144                  151.70
> 524288                  296.74
> 1048576                 591.68
>
> mvapich2-1.2rc1-intel:
> # OSU MPI All-to-All Personalized Exchange Latency Test v3.1
> # Size            Latency (us)
> 1                         1.87
> 2                         1.80
> 4                         1.81
> 8                         1.82
> 16                        1.86
> 32                        1.92
> 64                        2.10
> 128                       3.16
> 256                       3.53
> 512                       4.07
> 1024                      5.33
> 2048                      6.79
> 4096                      9.54
> 8192                     15.34
> 16384                    17.48
> 32768                    22.88
> 65536                    38.78
> 131072                   76.55
> 262144                  149.74
> 524288                  297.11
> 1048576                 591.25
>
> Other OSU benchmarks yield no visible differences between the two
> builds, e.g. osu_mbw_mr with 2 nodes and 4 tasks per node:
>
> mvapich2-1.0.5-intel:
> # OSU MPI Multiple Bandwidth / Message Rate Test v3.1
> # [ pairs: 4 ] [ window size: 64 ]
> # Size                    MB/s          Messages/s
> 1                         3.45          3447336.26
> 2                         6.93          3463236.43
> 4                        13.83          3458551.26
> 8                        27.68          3460000.08
> 16                       62.91          3931824.03
> 32                      109.74          3429389.41
> 64                      213.14          3330258.12
> 128                     353.90          2764881.74
> 256                     624.27          2438548.84
> 512                     980.57          1915173.15
> 1024                   1241.38          1212281.33
> 2048                   1463.71           714703.42
> 4096                   1612.25           393616.25
> 8192                   1721.11           210096.00
> 16384                  1851.29           112993.94
> 32768                  2051.28            62600.09
> 65536                  2062.08            31464.92
> 131072                 2065.59            15759.17
> 262144                 2074.04             7911.82
> 524288                 2082.66             3972.35
> 1048576                2087.94             1991.22
> 2097152                2090.20              996.69
> 4194304                2075.23              494.77
>
> mvapich2-1.2rc1-intel:
> # OSU MPI Multiple Bandwidth / Message Rate Test v3.1
> # [ pairs: 4 ] [ window size: 64 ]
> # Size                    MB/s          Messages/s
> 1                         3.42          3424686.07
> 2                         6.92          3459442.70
> 4                        13.73          3431691.09
> 8                        27.59          3449218.84
> 16                       62.63          3914337.15
> 32                      108.91          3403302.14
> 64                      210.89          3295101.65
> 128                     347.89          2717920.88
> 256                     621.49          2427687.32
> 512                     982.32          1918595.24
> 1024                   1246.40          1217187.35
> 2048                   1490.18           727625.11
> 4096                   1684.54           411264.55
> 8192                   1768.11           215833.58
> 16384                  1852.36           113059.37
> 32768                  2048.83            62525.18
> 65536                  2062.01            31463.76
> 131072                 2066.38            15765.20
> 262144                 2074.90             7915.12
> 524288                 2082.75             3972.54
> 1048576                2088.07             1991.34
> 2097152                2090.04              996.61
> 4194304                2077.47              495.31
>
> I also compiled the quantum chemistry code CPMD 3.11.1 with both libs.
> The code has own profiling. A benchmark run yields for a run with 64
> nodes, 1 task per node, 1 thread per task, application-defined task
> pinning, MV2_NUM_PORTS=2 MV2_ENABLE_AFFINITY=0:
>
> mvapich2-1.0.5-intel:
> ...
>        CPU TIME :    0 HOURS 17 MINUTES  7.53 SECONDS
>    ELAPSED TIME :    0 HOURS 17 MINUTES 40.26 SECONDS
> ...
>  ================================================================
>  = COMMUNICATION TASK  AVERAGE MESSAGE LENGTH  NUMBER OF CALLS  =
>  = SEND/RECEIVE               36385. BYTES             722421.  =
>  = BROADCAST                  37880. BYTES                368.  =
>  = GLOBAL SUMMATION          393974. BYTES              10556.  =
>  = GLOBAL MULTIPLICATION          0. BYTES                  1.  =
>  = ALL TO ALL COMM           484310. BYTES              46464.  =
>  =                             PERFORMANCE          TOTAL TIME  =
>  = SEND/RECEIVE              681.133  MB/S          38.591 SEC  =
>  = BROADCAST                  87.115  MB/S           0.160 SEC  =
>  = GLOBAL SUMMATION         1520.563  MB/S          16.410 SEC  =
>  = GLOBAL MULTIPLICATION       0.000  MB/S           0.001 SEC  =
>  = ALL TO ALL COMM            86.898  MB/S         258.959 SEC  =
>  = SYNCHRONISATION                                   1.750 SEC  =
>  ================================================================
>
> mvapich2-1.2rc1-intel:
> ...
>        CPU TIME :    0 HOURS 18 MINUTES 59.23 SECONDS     
>    ELAPSED TIME :    0 HOURS 19 MINUTES 31.68 SECONDS     
> ...
>  ================================================================
>  = COMMUNICATION TASK  AVERAGE MESSAGE LENGTH  NUMBER OF CALLS  =
>  = SEND/RECEIVE               36385. BYTES             722421.  =
>  = BROADCAST                  37880. BYTES                368.  =
>  = GLOBAL SUMMATION          393974. BYTES              10556.  =
>  = GLOBAL MULTIPLICATION          0. BYTES                  1.  =
>  = ALL TO ALL COMM           484310. BYTES              46464.  =
>  =                             PERFORMANCE          TOTAL TIME  =
>  = SEND/RECEIVE              699.651  MB/S          37.570 SEC  =
>  = BROADCAST                  87.114  MB/S           0.160 SEC  =
>  = GLOBAL SUMMATION         1557.608  MB/S          16.020 SEC  =
>  = GLOBAL MULTIPLICATION       0.000  MB/S           0.001 SEC  =
>  = ALL TO ALL COMM            61.302  MB/S         367.082 SEC  =
>  = SYNCHRONISATION                                   1.950 SEC  =
>  ================================================================
>
> The difference is reproducible (mvapich2-1.2rc1-intel is slower, seems
> to be the reason of slow all to all comm.), also compared to
> mvapich2-1.0.3 from tarball, or mvapich2-1.0.1 and mvapich-0.9.9 (both
> precompiled from SGI, available from SGI). Note that the benchmarks are
> run with no intra-node communication.
>
> Sincerely, BK
>