[mvapich-discuss] About "Running Collectives with Hardware based Multicast support"

Tue Feb 11 23:08:22 EST 2014

Hi, all
  The user manual of MVAPICH2 2.0b says that "In MVAPICH2, support for multicast based collectives has been enabled for MPI applications running over
OFA-IB-CH3 interface." Should it be faster using multicast than not using multicast? For my results, the multicast based collectives is slower than normal collectives, is this normal?

The details of my test are given as follows:
1. MVAPICH2 configuration:
[wrf at TC5000 bin]$ ./mpichversion 
MVAPICH2 Version:      2.0b
MVAPICH2 Release date: Fri Nov  8 11:17:40 EST 2013
MVAPICH2 Device:       ch3:mrail
MVAPICH2 configure:    CC=icc CXX=icpc FC=ifort F77=ifort --with-device=ch3:mrail --with-rdma=gen2 --prefix=/public/home/wrf/test/software/mvapich2-20b-intel
MVAPICH2 CC:   icc    -DNDEBUG -DNVALGRIND -O2
MVAPICH2 CXX:  icpc   -DNDEBUG -DNVALGRIND -O2
MVAPICH2 F77:  ifort -L/lib -L/lib   -O2
MVAPICH2 FC:   ifort   -O2

2.  MVAPICH2_INSTALL_DIR/libexec/mvapich2/osu_bcast is used to do the test. 
A total of 960 processes (60 nodes, 16 process per node) were launched to do the test, and the network is Mellanox Infiniband FDR. 
(a). In the first run, normal collectives are used, and the resuts is as follows.
==> bcast.out.71462.node221 <==
# OSU MPI Broadcast Latency Test
# Size         Avg Latency(us)     Min Latency(us)     Max Latency(us)  Iterations
1                         4.07                0.51                7.42        1000
2                         3.42                0.49                6.57        1000
4                         3.43                0.50                6.63        1000
8                         3.40                0.47                6.57        1000
16                        3.42                0.47                6.56        1000
32                        3.70                0.53                7.03        1000
64                        3.90                0.53                7.30        1000
128                       4.11                0.54                7.57        1000
256                       4.69                0.51                8.18        1000
512                       5.05                0.58                8.65        1000
1024                      6.16                0.78                9.92        1000
2048                      8.68                1.14               12.93        1000
4096                     13.29                2.29               18.99        1000
8192                     19.49                4.15               27.91        1000
16384                    28.54                7.20               40.66         100
32768                    43.67               14.44               59.75         100
65536                   100.79               66.94              120.17         100
131072                  183.14              135.33              207.23         100
262144                  504.61              497.89              512.42         100
524288                  863.65              853.12              874.10         100
1048576                1746.41             1721.36             1771.10         100

(b). In the second run, environment variable MV2_USE_MCAST=1, and hence  multicast based collectives are used, and the result is as follows:
==> bmcast.out.71464.node221 <==
/public/home/wrf/test/software/mvapich2-20b-intel/bin/mpirun
# OSU MPI Broadcast Latency Test
# Size         Avg Latency(us)     Min Latency(us)     Max Latency(us)  Iterations
1                         9.35                2.54               16.09        1000
2                         8.61                2.56               15.21        1000
4                         8.42                2.53               14.67        1000
8                         9.96                2.50               18.09        1000
16                        8.48                2.56               14.82        1000
32                        8.64                2.61               14.84        1000
64                        8.84                2.74               15.09        1000
128                       9.69                2.41               16.91        1000
256                      10.65                2.60               18.83        1000
512                      11.17                2.69               19.95        1000
1024                     12.65                3.11               22.09        1000
2048                     15.60                4.05               26.30        1000
4096                     19.08                5.50               29.90        1000
8192                     27.51                8.31               40.73        1000
16384                    36.64               16.81               49.95         100
32768                    58.02               33.50               72.97         100
65536                   100.98               66.34              119.24         100
131072                  184.48              135.63              210.04         100
262144                  515.71              507.25              523.48         100
524288                  783.13              771.61              791.25         100
1048576                1817.53             1793.44             1843.86         100

3. For small messages,  multicast based bcast is obviously slower than normal bcast, while the difference is small with large messages.
Is the results OK? Shouldn't  multicast based bcast be faster than normal bcast ?
   I found some information related to "Multicast Group Configuration". 
http://www.cisco.com/c/en/us/td/docs/server_nw_virtual/2-9-0_update1/2-9-0_release/element_manager/element/IB.html#wp1210200 
  Does the infiniband switch need to be configured in order to use the multicast based collectives?  Thanks.

miaocb
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20140212/4003a844/attachment-0001.html>