[mvapich-discuss] OSU benchmarks interpretation

Wed Mar 2 07:26:47 EST 2011

Peter,

You were right that I was actually using shared memory. It turned out that
for nodes=2:ppn=1 job specification MAUI runs both jobs on the same node.
JOBNODEMATCHPOLICY EXACTNODE changes this policy.

Now I have for mrail:
osu_bw		1770.10
osu_bibw		3500.82
osu_put_bw		1768.58
osu_put_bibw	3405.17
osu_get_bw		1769.05

Thanks for help.

Regards,
Nikita

-----Original Message-----
From: mvapich-discuss-bounces at cse.ohio-state.edu
[mailto:mvapich-discuss-bounces at cse.ohio-state.edu] On Behalf Of Peter
Kjellstrom
Sent: Wednesday, March 02, 2011 5:00 PM
To: mvapich-discuss at cse.ohio-state.edu
Subject: Re: [mvapich-discuss] OSU benchmarks interpretation

On Wednesday, March 02, 2011 08:03:42 am Nikita Andreev wrote:
> I'm benchmarking bandwidth between two compute nodes equipped with 
> Mellanox ConnectX DDR InfiniBand two-port HCAs. I run benchmarks under 
> OpenMPI which supports dual-rail configurations.
> 
> Results for message size 4194304:
> 
> osu_bw               4917.75 MB/s
> osu_bibw           5007.49 MB/s
> osu_put_bw     3489.35 MB/s
> osu_put_bibw  3876.96 MB/s
> osu_get_bw      3482.18 MB/s

This is way too fast for a single DDR ConnectX you're probably running the
test using shared memory on one node.

Expected DDR performance (one port) is roughly:
 unidir PCIe 2.5GT: 1400 MB/s
 unidir PCIe 5.0GT: 1950 MB/s
 bidir: ~2x unidir

Using both ports will on 2.5GT PCIe be pointless (can't even push one port)
and on 5.0GT I'm guessing you'd max out at ~3000 MB/s unidir but have not
tried it myself.

=> Multirail for performace pretty much needs two HCAs.

/Peter

> 
> 
> I have several questions:
> 
> 
> 
> 1. As far as I understand DDR IB has 16Gb/s data rate. Hence dual-rail 
> has 32Gb/s or 4GB/s theoretical peak throughput. But osu_bw shows data 
> rate higher than theoretical. How is that possible?
> 
> 
> 
> 2. osu_bw is unidirectional and osu_bibw is bidirectional test. So I 
> suppose it should have two times higher throughput but it's almost the 
> same as unidirectional.
> 
> 
> 
> 3. RDMA put/get do not involve target node in operation and should be 
> faster than ordinary send/recv. Why are they slower?
> 
> 
> 
> Regards,
> 
> Nikita

--
-= Peter Kjellström
-= National Supercomputer Centre