[mvapich-discuss] Broadcast performance with and without IB Multicast, and comparison to Unicast

Wed Jul 24 14:23:30 EDT 2013

I have been running some basic comparison tests between MPI_Bcast (with HW Mcast on and off), and MPI_Isend/ MPI_Irecv.

My setup is as follows:

              2 X9DRG-QF Supermicro motherboard based servers
              2 Mellanox ConnectX-3 NIC s
              1 Mellanox  SX6015 switch (default config)
              The 2 servers are connected to each other through the NICs going through the switch, running 56 Gb/s FDR
              mvapich2-1.9rc1 on RHEL 6.4
              I am using the osu_bw and osu_bcast benchmarks

Since in my application I am mainly concerned with throughput vs. latency, I have been looking at the throughput #'s

Results: (all 1 MB messages, one sender and one receiver, and data rates verified by looking at the loading being reported by the NIC card):

              MPI_Isend/ MPI_Irecv  Unicast:                                                                                       6.2 GB/s
              Broadcast (no HW mcast):                                                                                                 4.1 GB/s
              Broadcast (HW mcast):                                                                                                       2.7 GB/s
              Broadcast (no HW mcast, 2 concurrent send/receive pair, aggregate BW):           2.2 GB/s
              Broadcast (HW mcast, 2 concurrent send/receive pair, aggregate BW):                 82  MB/s

My questions are:

              - Should Bcast be that much worse than Unicast in a  one to one send?
              - Why is HW mcast worse when doing Bcast?  This conflicts with the results at the below link, but I don't have all the details of the configuration for that test:
http://mvapich.cse.ohio-state.edu/performance/mvapich2/coll_multicast_ri.shtml
              - Why when running two instances of the Bcast (no HW Mcast) , why doesn't the combined aggregate reach the Unicast maximum (I am assuming increased latency causes the slower performance, thus 2 independent streams should be able to max out the IB link.  I also assume the two MPI_COMM_WORLD's in the two different instances started with different mpirun_rsh's don't know about each other).    It is almost like something is serialized at the MPI layer, but the aggregate is actual worse than a single instance test.
              - Why do 2 streams of the Broadcast (HW mcast) cause the aggregate BW to drop to the floor much worse than the non HW mcast test?

Insights would be appreciated.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20130724/02fd31b3/attachment.html