[mvapich-discuss] Performance differences between mvapich2-1.0 and mvapich2-1.2 (fwd)

Wed Jul 30 10:47:16 EDT 2008

Hi Bernd,

Thanks for trying out mvapich2-1.2rc1 and let us know the problem. We are
in the process of performance tuning and are looking at this issue. We
will get back to you soon. Thanks.

-- Wei

> ---------- Forwarded message ----------
> Date: Tue, 29 Jul 2008 18:45:38 +0200
> From: Bernd Kallies <kallies at zib.de>
> To: mvapich-discuss at cse.ohio-state.edu
> Subject: [mvapich-discuss] Performance differences between mvapich2-1.0 and
>     mvapich2-1.2
>
> It seems to me that mvapich2-1.2rc1 seems to be slower that previous
> versions when compiling/using defaults. I'd like to know if I forgot
> some secret preprocessor flag or configure option for 1.2.
>
> I compiled the nighty build for mvapich2-1.0 as of July 28 (I guess it
> is something like mvapich2-1.0.5) with the following settings:
>
> export CC=icc
> export CXX=icpc
> export F77=ifort
> export F90=ifort
> export CFLAGS='-D_EM64T_ -D_SMP_ -DUSE_HEADER_CACHING -DONE_SIDED
> -DMPIDI_CH3_CHANNEL_RNDV -DMPID_USE_SEQUENCE_NUMBERS -DRDMA_CM -O2'
> configure --with-device=osu_ch3:mrail --with-rdma=gen2 --with-pm=mpd
> --disable-romio --enable-sharedlibs=gcc --without-mpe
>
> I compiled the tarball source of mvapich2-1.2rc1 with
> unset CFLAGS
> ./configure --enable-romio --with-file-system=lustre+nfs
> --enable-fast=defopt --with-rdma=gen2 --with-thread-package
> --enable-sharedlibs=gcc --without-mpe
>
> I get the following when running osu_alltoall with 1 task per node on
> two nodes after setting MV2_NUM_PORTS=2 MV2_ENABLE_AFFINITY=0:
>
> mvapich2-1.0.5-intel:
> # OSU MPI All-to-All Personalized Exchange Latency Test v3.1
> # Size            Latency (us)
> 1                         1.62
> 2                         1.71
> 4                         1.66
> 8                         1.64
> 16                        1.68
> 32                        1.74
> 64                        1.97
> 128                       3.04
> 256                       3.42
> 512                       4.01
> 1024                      5.26
> 2048                      6.62
> 4096                      9.45
> 8192                     15.20
> 16384                    17.76
> 32768                    23.21
> 65536                    38.60
> 131072                   76.32
> 262144                  151.70
> 524288                  296.74
> 1048576                 591.68
>
> mvapich2-1.2rc1-intel:
> # OSU MPI All-to-All Personalized Exchange Latency Test v3.1
> # Size            Latency (us)
> 1                         1.87
> 2                         1.80
> 4                         1.81
> 8                         1.82
> 16                        1.86
> 32                        1.92
> 64                        2.10
> 128                       3.16
> 256                       3.53
> 512                       4.07
> 1024                      5.33
> 2048                      6.79
> 4096                      9.54
> 8192                     15.34
> 16384                    17.48
> 32768                    22.88
> 65536                    38.78
> 131072                   76.55
> 262144                  149.74
> 524288                  297.11
> 1048576                 591.25
>
> Other OSU benchmarks yield no visible differences between the two
> builds, e.g. osu_mbw_mr with 2 nodes and 4 tasks per node:
>
> mvapich2-1.0.5-intel:
> # OSU MPI Multiple Bandwidth / Message Rate Test v3.1
> # [ pairs: 4 ] [ window size: 64 ]
> # Size                    MB/s          Messages/s
> 1                         3.45          3447336.26
> 2                         6.93          3463236.43
> 4                        13.83          3458551.26
> 8                        27.68          3460000.08
> 16                       62.91          3931824.03
> 32                      109.74          3429389.41
> 64                      213.14          3330258.12
> 128                     353.90          2764881.74
> 256                     624.27          2438548.84
> 512                     980.57          1915173.15
> 1024                   1241.38          1212281.33
> 2048                   1463.71           714703.42
> 4096                   1612.25           393616.25
> 8192                   1721.11           210096.00
> 16384                  1851.29           112993.94
> 32768                  2051.28            62600.09
> 65536                  2062.08            31464.92
> 131072                 2065.59            15759.17
> 262144                 2074.04             7911.82
> 524288                 2082.66             3972.35
> 1048576                2087.94             1991.22
> 2097152                2090.20              996.69
> 4194304                2075.23              494.77
>
> mvapich2-1.2rc1-intel:
> # OSU MPI Multiple Bandwidth / Message Rate Test v3.1
> # [ pairs: 4 ] [ window size: 64 ]
> # Size                    MB/s          Messages/s
> 1                         3.42          3424686.07
> 2                         6.92          3459442.70
> 4                        13.73          3431691.09
> 8                        27.59          3449218.84
> 16                       62.63          3914337.15
> 32                      108.91          3403302.14
> 64                      210.89          3295101.65
> 128                     347.89          2717920.88
> 256                     621.49          2427687.32
> 512                     982.32          1918595.24
> 1024                   1246.40          1217187.35
> 2048                   1490.18           727625.11
> 4096                   1684.54           411264.55
> 8192                   1768.11           215833.58
> 16384                  1852.36           113059.37
> 32768                  2048.83            62525.18
> 65536                  2062.01            31463.76
> 131072                 2066.38            15765.20
> 262144                 2074.90             7915.12
> 524288                 2082.75             3972.54
> 1048576                2088.07             1991.34
> 2097152                2090.04              996.61
> 4194304                2077.47              495.31
>
> I also compiled the quantum chemistry code CPMD 3.11.1 with both libs.
> The code has own profiling. A benchmark run yields for a run with 64
> nodes, 1 task per node, 1 thread per task, application-defined task
> pinning, MV2_NUM_PORTS=2 MV2_ENABLE_AFFINITY=0:
>
> mvapich2-1.0.5-intel:
> ..
>        CPU TIME :    0 HOURS 17 MINUTES  7.53 SECONDS
>    ELAPSED TIME :    0 HOURS 17 MINUTES 40.26 SECONDS
> ..
>  ================================================================
>  = COMMUNICATION TASK  AVERAGE MESSAGE LENGTH  NUMBER OF CALLS  =
>  = SEND/RECEIVE               36385. BYTES             722421.  =
>  = BROADCAST                  37880. BYTES                368.  =
>  = GLOBAL SUMMATION          393974. BYTES              10556.  =
>  = GLOBAL MULTIPLICATION          0. BYTES                  1.  =
>  = ALL TO ALL COMM           484310. BYTES              46464.  =
>  =                             PERFORMANCE          TOTAL TIME  =
>  = SEND/RECEIVE              681.133  MB/S          38.591 SEC  =
>  = BROADCAST                  87.115  MB/S           0.160 SEC  =
>  = GLOBAL SUMMATION         1520.563  MB/S          16.410 SEC  =
>  = GLOBAL MULTIPLICATION       0.000  MB/S           0.001 SEC  =
>  = ALL TO ALL COMM            86.898  MB/S         258.959 SEC  =
>  = SYNCHRONISATION                                   1.750 SEC  =
>  ================================================================
>
> mvapich2-1.2rc1-intel:
> ..
>        CPU TIME :    0 HOURS 18 MINUTES 59.23 SECONDS
>    ELAPSED TIME :    0 HOURS 19 MINUTES 31.68 SECONDS
> ..
>  ================================================================
>  = COMMUNICATION TASK  AVERAGE MESSAGE LENGTH  NUMBER OF CALLS  =
>  = SEND/RECEIVE               36385. BYTES             722421.  =
>  = BROADCAST                  37880. BYTES                368.  =
>  = GLOBAL SUMMATION          393974. BYTES              10556.  =
>  = GLOBAL MULTIPLICATION          0. BYTES                  1.  =
>  = ALL TO ALL COMM           484310. BYTES              46464.  =
>  =                             PERFORMANCE          TOTAL TIME  =
>  = SEND/RECEIVE              699.651  MB/S          37.570 SEC  =
>  = BROADCAST                  87.114  MB/S           0.160 SEC  =
>  = GLOBAL SUMMATION         1557.608  MB/S          16.020 SEC  =
>  = GLOBAL MULTIPLICATION       0.000  MB/S           0.001 SEC  =
>  = ALL TO ALL COMM            61.302  MB/S         367.082 SEC  =
>  = SYNCHRONISATION                                   1.950 SEC  =
>  ================================================================
>
> The difference is reproducible (mvapich2-1.2rc1-intel is slower, seems
> to be the reason of slow all to all comm.), also compared to
> mvapich2-1.0.3 from tarball, or mvapich2-1.0.1 and mvapich-0.9.9 (both
> precompiled from SGI, available from SGI). Note that the benchmarks are
> run with no intra-node communication.
>
> Sincerely, BK
> --
> Dr. Bernd Kallies
> Konrad-Zuse-Zentrum für Informationstechnik Berlin
> Takustr. 7
> 14195 Berlin
> Tel: +49-30-84185-270
> Fax: +49-30-84185-311
> e-mail: kallies at zib.de
>
>
> _______________________________________________
> mvapich-discuss mailing list
> mvapich-discuss at cse.ohio-state.edu
> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>
>