[mvapich-discuss] Performance differences between mvapich2-1.0
and mvapich2-1.2
Brian Curtis
curtisbr at cse.ohio-state.edu
Fri Aug 1 14:54:51 EDT 2008
Bernd,
I've looked over your configuration for MVAPICH2-1.2rc1 and I have some
suggestions. The default configuration of MVAPICH2-1.2 includes
--enable-fast=defopt,ndebug. By specifying --enable-fast=defopt, you
lose the ndebug option which results in the inclusion of assertions and
other debug statements. The "--with-thread-package" parameter is a NOOP
unless you specify an option (as is, it is selecting pthread). I
recommend checking out our latest source from trunk and using the
following configuration:
./configure --with-file-system=lustre+nfs --without-mpe
--enable-sharedlibs=gcc
Please note that ROMIO is enabled by default and gen2 is selected by
default for Linux. Let us know if your performance improves by using
the latest source from trunk and the recommend configuration.
Brian
Bernd Kallies wrote:
> It seems to me that mvapich2-1.2rc1 seems to be slower that previous
> versions when compiling/using defaults. I'd like to know if I forgot
> some secret preprocessor flag or configure option for 1.2.
>
> I compiled the nighty build for mvapich2-1.0 as of July 28 (I guess it
> is something like mvapich2-1.0.5) with the following settings:
>
> export CC=icc
> export CXX=icpc
> export F77=ifort
> export F90=ifort
> export CFLAGS='-D_EM64T_ -D_SMP_ -DUSE_HEADER_CACHING -DONE_SIDED
> -DMPIDI_CH3_CHANNEL_RNDV -DMPID_USE_SEQUENCE_NUMBERS -DRDMA_CM -O2'
> configure --with-device=osu_ch3:mrail --with-rdma=gen2 --with-pm=mpd
> --disable-romio --enable-sharedlibs=gcc --without-mpe
>
> I compiled the tarball source of mvapich2-1.2rc1 with
> unset CFLAGS
> ./configure --enable-romio --with-file-system=lustre+nfs
> --enable-fast=defopt --with-rdma=gen2 --with-thread-package
> --enable-sharedlibs=gcc --without-mpe
>
> I get the following when running osu_alltoall with 1 task per node on
> two nodes after setting MV2_NUM_PORTS=2 MV2_ENABLE_AFFINITY=0:
>
> mvapich2-1.0.5-intel:
> # OSU MPI All-to-All Personalized Exchange Latency Test v3.1
> # Size Latency (us)
> 1 1.62
> 2 1.71
> 4 1.66
> 8 1.64
> 16 1.68
> 32 1.74
> 64 1.97
> 128 3.04
> 256 3.42
> 512 4.01
> 1024 5.26
> 2048 6.62
> 4096 9.45
> 8192 15.20
> 16384 17.76
> 32768 23.21
> 65536 38.60
> 131072 76.32
> 262144 151.70
> 524288 296.74
> 1048576 591.68
>
> mvapich2-1.2rc1-intel:
> # OSU MPI All-to-All Personalized Exchange Latency Test v3.1
> # Size Latency (us)
> 1 1.87
> 2 1.80
> 4 1.81
> 8 1.82
> 16 1.86
> 32 1.92
> 64 2.10
> 128 3.16
> 256 3.53
> 512 4.07
> 1024 5.33
> 2048 6.79
> 4096 9.54
> 8192 15.34
> 16384 17.48
> 32768 22.88
> 65536 38.78
> 131072 76.55
> 262144 149.74
> 524288 297.11
> 1048576 591.25
>
> Other OSU benchmarks yield no visible differences between the two
> builds, e.g. osu_mbw_mr with 2 nodes and 4 tasks per node:
>
> mvapich2-1.0.5-intel:
> # OSU MPI Multiple Bandwidth / Message Rate Test v3.1
> # [ pairs: 4 ] [ window size: 64 ]
> # Size MB/s Messages/s
> 1 3.45 3447336.26
> 2 6.93 3463236.43
> 4 13.83 3458551.26
> 8 27.68 3460000.08
> 16 62.91 3931824.03
> 32 109.74 3429389.41
> 64 213.14 3330258.12
> 128 353.90 2764881.74
> 256 624.27 2438548.84
> 512 980.57 1915173.15
> 1024 1241.38 1212281.33
> 2048 1463.71 714703.42
> 4096 1612.25 393616.25
> 8192 1721.11 210096.00
> 16384 1851.29 112993.94
> 32768 2051.28 62600.09
> 65536 2062.08 31464.92
> 131072 2065.59 15759.17
> 262144 2074.04 7911.82
> 524288 2082.66 3972.35
> 1048576 2087.94 1991.22
> 2097152 2090.20 996.69
> 4194304 2075.23 494.77
>
> mvapich2-1.2rc1-intel:
> # OSU MPI Multiple Bandwidth / Message Rate Test v3.1
> # [ pairs: 4 ] [ window size: 64 ]
> # Size MB/s Messages/s
> 1 3.42 3424686.07
> 2 6.92 3459442.70
> 4 13.73 3431691.09
> 8 27.59 3449218.84
> 16 62.63 3914337.15
> 32 108.91 3403302.14
> 64 210.89 3295101.65
> 128 347.89 2717920.88
> 256 621.49 2427687.32
> 512 982.32 1918595.24
> 1024 1246.40 1217187.35
> 2048 1490.18 727625.11
> 4096 1684.54 411264.55
> 8192 1768.11 215833.58
> 16384 1852.36 113059.37
> 32768 2048.83 62525.18
> 65536 2062.01 31463.76
> 131072 2066.38 15765.20
> 262144 2074.90 7915.12
> 524288 2082.75 3972.54
> 1048576 2088.07 1991.34
> 2097152 2090.04 996.61
> 4194304 2077.47 495.31
>
> I also compiled the quantum chemistry code CPMD 3.11.1 with both libs.
> The code has own profiling. A benchmark run yields for a run with 64
> nodes, 1 task per node, 1 thread per task, application-defined task
> pinning, MV2_NUM_PORTS=2 MV2_ENABLE_AFFINITY=0:
>
> mvapich2-1.0.5-intel:
> ...
> CPU TIME : 0 HOURS 17 MINUTES 7.53 SECONDS
> ELAPSED TIME : 0 HOURS 17 MINUTES 40.26 SECONDS
> ...
> ================================================================
> = COMMUNICATION TASK AVERAGE MESSAGE LENGTH NUMBER OF CALLS =
> = SEND/RECEIVE 36385. BYTES 722421. =
> = BROADCAST 37880. BYTES 368. =
> = GLOBAL SUMMATION 393974. BYTES 10556. =
> = GLOBAL MULTIPLICATION 0. BYTES 1. =
> = ALL TO ALL COMM 484310. BYTES 46464. =
> = PERFORMANCE TOTAL TIME =
> = SEND/RECEIVE 681.133 MB/S 38.591 SEC =
> = BROADCAST 87.115 MB/S 0.160 SEC =
> = GLOBAL SUMMATION 1520.563 MB/S 16.410 SEC =
> = GLOBAL MULTIPLICATION 0.000 MB/S 0.001 SEC =
> = ALL TO ALL COMM 86.898 MB/S 258.959 SEC =
> = SYNCHRONISATION 1.750 SEC =
> ================================================================
>
> mvapich2-1.2rc1-intel:
> ...
> CPU TIME : 0 HOURS 18 MINUTES 59.23 SECONDS
> ELAPSED TIME : 0 HOURS 19 MINUTES 31.68 SECONDS
> ...
> ================================================================
> = COMMUNICATION TASK AVERAGE MESSAGE LENGTH NUMBER OF CALLS =
> = SEND/RECEIVE 36385. BYTES 722421. =
> = BROADCAST 37880. BYTES 368. =
> = GLOBAL SUMMATION 393974. BYTES 10556. =
> = GLOBAL MULTIPLICATION 0. BYTES 1. =
> = ALL TO ALL COMM 484310. BYTES 46464. =
> = PERFORMANCE TOTAL TIME =
> = SEND/RECEIVE 699.651 MB/S 37.570 SEC =
> = BROADCAST 87.114 MB/S 0.160 SEC =
> = GLOBAL SUMMATION 1557.608 MB/S 16.020 SEC =
> = GLOBAL MULTIPLICATION 0.000 MB/S 0.001 SEC =
> = ALL TO ALL COMM 61.302 MB/S 367.082 SEC =
> = SYNCHRONISATION 1.950 SEC =
> ================================================================
>
> The difference is reproducible (mvapich2-1.2rc1-intel is slower, seems
> to be the reason of slow all to all comm.), also compared to
> mvapich2-1.0.3 from tarball, or mvapich2-1.0.1 and mvapich-0.9.9 (both
> precompiled from SGI, available from SGI). Note that the benchmarks are
> run with no intra-node communication.
>
> Sincerely, BK
>
More information about the mvapich-discuss
mailing list