[mvapich-discuss] Re: Question on bandwidth test

Tue Jun 5 09:53:19 EDT 2007

Hi Wenli,

Thanks for using MVAPICH and reporting the performance issue to us.

IMHO, this is not a problem of the MPI layer, but the performance
degradation should be visible on the tests at the verbs layer too.
I am assuming that you are using OFED-1.1 and HCA firmware version 3.3.3
or greater.

To see whether this is the case, may i request you to do the following:

Say you want to run the tests on inode28 and inode30

1. On inode28:
% ib_rdma_bw -s1048576 -n100

2. On inode30:
% ib_rdma_bw -s1048576 -n100 inode28

I feel that you should see a similar performance degradation, as you are
seeing at the MPI layer.

My answers with respect to Multi-Rail Paper are inline, please scroll
down.
>  
>  
> My system is:
> -- 2.2GHz Dual Core AMD Opteron(tm) Processor 275, 8GB Mem
> -- Linux 2.6.9-42.ELsmp x86_64
> -- openib-1.1  { Detected the following HCAs: 1) mthca0 [ Mellanox PCI-X ] }
>  
> 1. Test inter-node bandwidth with -DVIADEV_RGET_SUPPORT .
> setup_ch_gen2 starts... -D_X86_64_ -DEARLY_SEND_COMPLETION -DMEMORY_SCALE
> -DVIADEV_RGET_SUPPORT -DLAZY_MEM_UNREGISTER -DCH_GEN2 -D_SMP_ -D_SMP_RNDV_
> -D_MLX_PCI_X_ -I/usr/local/ofed/include -O3
>  
> $ mpirun_rsh -rsh -np 2 inode28 inode30 ./osu_bw
> # OSU MPI Bandwidth Test (Version 2.2)
> # Size          Bandwidth (MB/s)
> 1               0.243180
> 2               0.507795
> 4               1.008787
> 8               2.030054
> 16              4.008455
> 32              8.113140
> 64              16.160978
> 128             33.764735
> 256             67.708075
> 512             161.522157
> 1024            335.222506
> 2048            491.421716
> 4096            568.259955
> 8192            606.043232
> 16384           662.063392
> 32768           738.589843
> 65536           783.586601
> 131072          807.462616
> 262144          820.750931
> 524288          685.880335
> 1048576         660.237959
> 2097152         659.233480
> 4194304         659.946110
> 2. Test inter-node bandwidth with -DVIADEV_RPUT_SUPPORT .
> setup_ch_gen2 starts... -D_X86_64_ -DEARLY_SEND_COMPLETION -DMEMORY_SCALE
> -DVIADEV_RPUT_SUPPORT -DLAZY_MEM_UNREGISTER -DCH_GEN2 -D_SMP_ -D_SMP_RNDV_
> -D_MLX_PCI_X_ -I/usr/local/ofed/include -O3
>  
> $ mpirun_rsh -rsh -np 2 inode28 inode30 ./osu_bw
> # OSU MPI Bandwidth Test (Version 2.2)
> # Size          Bandwidth (MB/s)
> 1               0.248081
> 2               0.516046
> 4               1.034260
> 8               2.069607
> 16              4.110799
> 32              8.282444
> 64              16.593745
> 128             34.620911
> 256             69.113305
> 512             163.455879
> 1024            341.066875
> 2048            496.503655
> 4096            569.049428
> 8192            606.183374
> 16384           624.840449
> 32768           713.280615
> 65536           769.011487
> 131072          800.359506
> 262144          814.869019
> 524288          679.025085
> 1048576         652.137840
> 2097152         650.207077
> 4194304         650.629356
> 3. Test intra-node bandwidth with -DVIADEV_RPUT_SUPPORT .
> $ mpirun_rsh -rsh -np 2 inode28 inode28 ./osu_bw
> # OSU MPI Bandwidth Test (Version 2.2)
> # Size          Bandwidth (MB/s)
> 1               2.173175
> 2               4.449079
> 4               9.049134
> 8               20.301348
> 16              42.489627
> 32              85.085168
> 64              153.869271
> 128             286.734337
> 256             480.187573
> 512             741.525232
> 1024            932.896797
> 2048            1145.834426
> 4096            1291.731546
> 8192            1388.989562
> 16384           1428.285773
> 32768           1453.529249
> 65536           1431.307671
> 131072          1445.227803
> 262144          1393.404399
> 524288          1168.315567
> 1048576         1071.952093
> 2097152         1072.327638
> 4194304         1064.196619
>  
> I have seen test results on your homepage (http://mvapich.cse.ohio-state.edu/
> performance/mvapich/opteron/MVAPICH-opteron-gen2-DDR.shtml, http://
> mvapich.cse.ohio-state.edu/performance/mvapich/intra_opteron.shtml), that
> inter-node bandwidth results seem normal but intra-node bandwidth results are
> like mine. And bandwidth results in your paper BUILDING MULTIRAIL INFINIBAND
> CLUSTERS: MPI-LEVEL DESIGN AND PERFORMANCE EVALUATION: SC2004(Fig. 9) seem that
> striping or binding optimization will remove improve this problem.

Yes, actually striping the data on multiple paths helps the performance
of microbenchmarks and applications as shown in the paper. However, as
per your system, you are using only one HCA and one port for
communication. Hence, these scheduling policies are unlikely to solve
the situation. 

I think we will have a clearer idea about the point of performance
degradation, once you have the results from ib_rdma_bw. Please let us
know the outcome of your experimentation.

Thanks,

:- Abhinav

>  
> What do you think will be the problem source for my bandwidth tests? In order
> to get optimal bandwidth value, what do you think I should modify based on
> default options in original MVAPICH 0.9.8 packet? 
>  
>  
> Any reply is appreciated! 
>  
> Thanks,
> Wenli
>