[mvapich-discuss] Profiling of osu_mbw_mr test

Thu Apr 16 09:52:30 EDT 2009

A couple of points here. When multiple processes are concurrently doing
the data transfer, you need to make sure that they start around the same
time and the communication steps overlap. This is achieved by running the
test for multiple number of iterations. Typically we skip some of the
initial iterations and take average of the remaining iterations. You have
reduced the number of iterations to a smaller value.  You can try to
increase the and see the impact. Also, you may add double-barriers in the
begining to make sure that processes are almost synchronized.

Hope this helps.

DK

On Thu, 16 Apr 2009, Maya Khaliullina wrote:

> Hello,
>
> We develop model of concurrent communications for Infiniband network of our
> HPC-cluster:
> Node: 2xQuad Core Intel Xeon 2.33 GHz
> O/S: RHEL4.5
> File System: GPFS
>
> To investigate behaviour of infiniband  interconnect we used profiling of
> osu_mbw_mr test from OMB with Allinea Optimization & Profiling Tool (OPT).
>
> Version of MVAPICH is 2-1.2.
>
> We reduced number of iterations in osu_mbw_mr test to 5 and used only 2 MB
> messages.
>
> We found that there are 2 main variants of communication behavior resulting
> in different summary bandwidth:
>
> 1) (pic.1) In the first case we see that one pair of communicating processes
> (2 and 6) works faster than others and finishes earlier. Corresponding
> bandwidth ~ 950 MB/sec.
>
> 2) (pic.2) In the second case all pairs work similarly with summary
> bandwidth ~ 960 MB/sec.
>
> Could you please explain the reason why we have such cases?
>