[mvapich-discuss] Performance differences between mvapich2-1.0 and
mvapich2-1.2
Bernd Kallies
kallies at zib.de
Tue Jul 29 12:45:38 EDT 2008
It seems to me that mvapich2-1.2rc1 seems to be slower that previous
versions when compiling/using defaults. I'd like to know if I forgot
some secret preprocessor flag or configure option for 1.2.
I compiled the nighty build for mvapich2-1.0 as of July 28 (I guess it
is something like mvapich2-1.0.5) with the following settings:
export CC=icc
export CXX=icpc
export F77=ifort
export F90=ifort
export CFLAGS='-D_EM64T_ -D_SMP_ -DUSE_HEADER_CACHING -DONE_SIDED
-DMPIDI_CH3_CHANNEL_RNDV -DMPID_USE_SEQUENCE_NUMBERS -DRDMA_CM -O2'
configure --with-device=osu_ch3:mrail --with-rdma=gen2 --with-pm=mpd
--disable-romio --enable-sharedlibs=gcc --without-mpe
I compiled the tarball source of mvapich2-1.2rc1 with
unset CFLAGS
./configure --enable-romio --with-file-system=lustre+nfs
--enable-fast=defopt --with-rdma=gen2 --with-thread-package
--enable-sharedlibs=gcc --without-mpe
I get the following when running osu_alltoall with 1 task per node on
two nodes after setting MV2_NUM_PORTS=2 MV2_ENABLE_AFFINITY=0:
mvapich2-1.0.5-intel:
# OSU MPI All-to-All Personalized Exchange Latency Test v3.1
# Size Latency (us)
1 1.62
2 1.71
4 1.66
8 1.64
16 1.68
32 1.74
64 1.97
128 3.04
256 3.42
512 4.01
1024 5.26
2048 6.62
4096 9.45
8192 15.20
16384 17.76
32768 23.21
65536 38.60
131072 76.32
262144 151.70
524288 296.74
1048576 591.68
mvapich2-1.2rc1-intel:
# OSU MPI All-to-All Personalized Exchange Latency Test v3.1
# Size Latency (us)
1 1.87
2 1.80
4 1.81
8 1.82
16 1.86
32 1.92
64 2.10
128 3.16
256 3.53
512 4.07
1024 5.33
2048 6.79
4096 9.54
8192 15.34
16384 17.48
32768 22.88
65536 38.78
131072 76.55
262144 149.74
524288 297.11
1048576 591.25
Other OSU benchmarks yield no visible differences between the two
builds, e.g. osu_mbw_mr with 2 nodes and 4 tasks per node:
mvapich2-1.0.5-intel:
# OSU MPI Multiple Bandwidth / Message Rate Test v3.1
# [ pairs: 4 ] [ window size: 64 ]
# Size MB/s Messages/s
1 3.45 3447336.26
2 6.93 3463236.43
4 13.83 3458551.26
8 27.68 3460000.08
16 62.91 3931824.03
32 109.74 3429389.41
64 213.14 3330258.12
128 353.90 2764881.74
256 624.27 2438548.84
512 980.57 1915173.15
1024 1241.38 1212281.33
2048 1463.71 714703.42
4096 1612.25 393616.25
8192 1721.11 210096.00
16384 1851.29 112993.94
32768 2051.28 62600.09
65536 2062.08 31464.92
131072 2065.59 15759.17
262144 2074.04 7911.82
524288 2082.66 3972.35
1048576 2087.94 1991.22
2097152 2090.20 996.69
4194304 2075.23 494.77
mvapich2-1.2rc1-intel:
# OSU MPI Multiple Bandwidth / Message Rate Test v3.1
# [ pairs: 4 ] [ window size: 64 ]
# Size MB/s Messages/s
1 3.42 3424686.07
2 6.92 3459442.70
4 13.73 3431691.09
8 27.59 3449218.84
16 62.63 3914337.15
32 108.91 3403302.14
64 210.89 3295101.65
128 347.89 2717920.88
256 621.49 2427687.32
512 982.32 1918595.24
1024 1246.40 1217187.35
2048 1490.18 727625.11
4096 1684.54 411264.55
8192 1768.11 215833.58
16384 1852.36 113059.37
32768 2048.83 62525.18
65536 2062.01 31463.76
131072 2066.38 15765.20
262144 2074.90 7915.12
524288 2082.75 3972.54
1048576 2088.07 1991.34
2097152 2090.04 996.61
4194304 2077.47 495.31
I also compiled the quantum chemistry code CPMD 3.11.1 with both libs.
The code has own profiling. A benchmark run yields for a run with 64
nodes, 1 task per node, 1 thread per task, application-defined task
pinning, MV2_NUM_PORTS=2 MV2_ENABLE_AFFINITY=0:
mvapich2-1.0.5-intel:
...
CPU TIME : 0 HOURS 17 MINUTES 7.53 SECONDS
ELAPSED TIME : 0 HOURS 17 MINUTES 40.26 SECONDS
...
================================================================
= COMMUNICATION TASK AVERAGE MESSAGE LENGTH NUMBER OF CALLS =
= SEND/RECEIVE 36385. BYTES 722421. =
= BROADCAST 37880. BYTES 368. =
= GLOBAL SUMMATION 393974. BYTES 10556. =
= GLOBAL MULTIPLICATION 0. BYTES 1. =
= ALL TO ALL COMM 484310. BYTES 46464. =
= PERFORMANCE TOTAL TIME =
= SEND/RECEIVE 681.133 MB/S 38.591 SEC =
= BROADCAST 87.115 MB/S 0.160 SEC =
= GLOBAL SUMMATION 1520.563 MB/S 16.410 SEC =
= GLOBAL MULTIPLICATION 0.000 MB/S 0.001 SEC =
= ALL TO ALL COMM 86.898 MB/S 258.959 SEC =
= SYNCHRONISATION 1.750 SEC =
================================================================
mvapich2-1.2rc1-intel:
...
CPU TIME : 0 HOURS 18 MINUTES 59.23 SECONDS
ELAPSED TIME : 0 HOURS 19 MINUTES 31.68 SECONDS
...
================================================================
= COMMUNICATION TASK AVERAGE MESSAGE LENGTH NUMBER OF CALLS =
= SEND/RECEIVE 36385. BYTES 722421. =
= BROADCAST 37880. BYTES 368. =
= GLOBAL SUMMATION 393974. BYTES 10556. =
= GLOBAL MULTIPLICATION 0. BYTES 1. =
= ALL TO ALL COMM 484310. BYTES 46464. =
= PERFORMANCE TOTAL TIME =
= SEND/RECEIVE 699.651 MB/S 37.570 SEC =
= BROADCAST 87.114 MB/S 0.160 SEC =
= GLOBAL SUMMATION 1557.608 MB/S 16.020 SEC =
= GLOBAL MULTIPLICATION 0.000 MB/S 0.001 SEC =
= ALL TO ALL COMM 61.302 MB/S 367.082 SEC =
= SYNCHRONISATION 1.950 SEC =
================================================================
The difference is reproducible (mvapich2-1.2rc1-intel is slower, seems
to be the reason of slow all to all comm.), also compared to
mvapich2-1.0.3 from tarball, or mvapich2-1.0.1 and mvapich-0.9.9 (both
precompiled from SGI, available from SGI). Note that the benchmarks are
run with no intra-node communication.
Sincerely, BK
--
Dr. Bernd Kallies
Konrad-Zuse-Zentrum für Informationstechnik Berlin
Takustr. 7
14195 Berlin
Tel: +49-30-84185-270
Fax: +49-30-84185-311
e-mail: kallies at zib.de
More information about the mvapich-discuss
mailing list