[mvapich-discuss] About "Running Collectives with Hardware based Multicast support"
miaocb
miaocb at sina.cn
Tue Feb 11 23:08:22 EST 2014
Hi, all
The user manual of MVAPICH2 2.0b says that "In MVAPICH2, support for multicast based collectives has been enabled for MPI applications running over
OFA-IB-CH3 interface." Should it be faster using multicast than not using multicast? For my results, the multicast based collectives is slower than normal collectives, is this normal?
The details of my test are given as follows:
1. MVAPICH2 configuration:
[wrf at TC5000 bin]$ ./mpichversion
MVAPICH2 Version: 2.0b
MVAPICH2 Release date: Fri Nov 8 11:17:40 EST 2013
MVAPICH2 Device: ch3:mrail
MVAPICH2 configure: CC=icc CXX=icpc FC=ifort F77=ifort --with-device=ch3:mrail --with-rdma=gen2 --prefix=/public/home/wrf/test/software/mvapich2-20b-intel
MVAPICH2 CC: icc -DNDEBUG -DNVALGRIND -O2
MVAPICH2 CXX: icpc -DNDEBUG -DNVALGRIND -O2
MVAPICH2 F77: ifort -L/lib -L/lib -O2
MVAPICH2 FC: ifort -O2
2. MVAPICH2_INSTALL_DIR/libexec/mvapich2/osu_bcast is used to do the test.
A total of 960 processes (60 nodes, 16 process per node) were launched to do the test, and the network is Mellanox Infiniband FDR.
(a). In the first run, normal collectives are used, and the resuts is as follows.
==> bcast.out.71462.node221 <==
# OSU MPI Broadcast Latency Test
# Size Avg Latency(us) Min Latency(us) Max Latency(us) Iterations
1 4.07 0.51 7.42 1000
2 3.42 0.49 6.57 1000
4 3.43 0.50 6.63 1000
8 3.40 0.47 6.57 1000
16 3.42 0.47 6.56 1000
32 3.70 0.53 7.03 1000
64 3.90 0.53 7.30 1000
128 4.11 0.54 7.57 1000
256 4.69 0.51 8.18 1000
512 5.05 0.58 8.65 1000
1024 6.16 0.78 9.92 1000
2048 8.68 1.14 12.93 1000
4096 13.29 2.29 18.99 1000
8192 19.49 4.15 27.91 1000
16384 28.54 7.20 40.66 100
32768 43.67 14.44 59.75 100
65536 100.79 66.94 120.17 100
131072 183.14 135.33 207.23 100
262144 504.61 497.89 512.42 100
524288 863.65 853.12 874.10 100
1048576 1746.41 1721.36 1771.10 100
(b). In the second run, environment variable MV2_USE_MCAST=1, and hence multicast based collectives are used, and the result is as follows:
==> bmcast.out.71464.node221 <==
/public/home/wrf/test/software/mvapich2-20b-intel/bin/mpirun
# OSU MPI Broadcast Latency Test
# Size Avg Latency(us) Min Latency(us) Max Latency(us) Iterations
1 9.35 2.54 16.09 1000
2 8.61 2.56 15.21 1000
4 8.42 2.53 14.67 1000
8 9.96 2.50 18.09 1000
16 8.48 2.56 14.82 1000
32 8.64 2.61 14.84 1000
64 8.84 2.74 15.09 1000
128 9.69 2.41 16.91 1000
256 10.65 2.60 18.83 1000
512 11.17 2.69 19.95 1000
1024 12.65 3.11 22.09 1000
2048 15.60 4.05 26.30 1000
4096 19.08 5.50 29.90 1000
8192 27.51 8.31 40.73 1000
16384 36.64 16.81 49.95 100
32768 58.02 33.50 72.97 100
65536 100.98 66.34 119.24 100
131072 184.48 135.63 210.04 100
262144 515.71 507.25 523.48 100
524288 783.13 771.61 791.25 100
1048576 1817.53 1793.44 1843.86 100
3. For small messages, multicast based bcast is obviously slower than normal bcast, while the difference is small with large messages.
Is the results OK? Shouldn't multicast based bcast be faster than normal bcast ?
I found some information related to "Multicast Group Configuration".
http://www.cisco.com/c/en/us/td/docs/server_nw_virtual/2-9-0_update1/2-9-0_release/element_manager/element/IB.html#wp1210200
Does the infiniband switch need to be configured in order to use the multicast based collectives? Thanks.
miaocb
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20140212/4003a844/attachment-0001.html>
More information about the mvapich-discuss
mailing list