[mvapich-discuss] IB benchmarks that are a bit strange

Wed Mar 19 16:17:15 EDT 2008

Brian,

You can try the user defined cpu affinity feature provided by mvapich-1.0. For example, if you want the two processes to be run on core 2 of each node, then:

$ mpirun_rsh -np 2 node0 node1 VIADEV_CPU_MAPPING=2 ./a.out

More information can be found in the userguide:

http://mvapich/support/mvapich_user_guide.html#x1-1420009.6.6

You can make the master node as the receiver (e.g. -np 2 slave master), and try various cores. One thing is that you might have more cores on the master node, and when you try a high core number, e.g. VIADEV_CPU_MAPPING=7 the slave node may report an error since it may not have core 7, but it shouldn't affect the result, the program should still run and you should be able to see the trend.

Hope this helps.

Lei

----- Original Message -----
From: Brian Budge <brian.budge at gmail.com>
Date: Wednesday, March 19, 2008 3:30 pm
Subject: Re: [mvapich-discuss] IB benchmarks that are a bit strange

> Hi Lei -
> 
> Thanks for the information.  I suppose that could be the problem.  The
> master node does have a different setup from the slaves, and is a 4
> socket system, while the slaves are 2 socket systems.   Is there any
> way to verify this?
> 
> It makes sense that the bandwidth would be lower from one processor
> than another, but it still seems a little bit strange, since I would
> only expect that 200 MB/s to show up if we were talking about
> bandwidths over about 2 GB/s (I typically see about 2.3 GB/s per
> processor).
> 
> Is it possible using a numa library to specify which physical
> processor to run on?  If so, I could attempt to run this on varying
> processors and observe the changes.
> 
> I would like to try without ipoib, but it may be some time before I
> can try that since this involves the head node and I have users other
> than myself on some of the cluster.
> 
> Thanks again for the help,
>  Brian
> 
> On Wed, Mar 19, 2008 at 12:07 PM, LEI CHAI <chai.15 at osu.edu> wrote:
> > Hi Brian,
> >
> >
> >  > Thanks DK.  The systems are Opteron NUMA.  They both run 
> processors at
> >  > the 2.4 GHz and memory is ddr2 667.  They are both running 
> mellanox>  > infinihost III Lx HCAs.  Both nodes are running linux 
> 2.6.24 with the
> >  > infiniband drivers built in.
> >
> >  Thanks for the system information.
> >
> >
> >  > It definitely has something to do with my head node.  I've been
> >  > running tests on 3 nodes to better characterize what I've been 
> seeing,>  > and in my test I see close to 800 MB/s from slave to 
> slave and about
> >  > 600 MB/s from either slave to head or head to either slave.
> >
> >  Since they are Opteron NUMA systems, we suspect that it has 
> something to do with the distance between the processor and the IB 
> adapter.  I'm illustrating it in a diagram below:
> >
> >   node 0               node 1
> >   3----2               2----3
> >   |    |               |    |
> >   |    |               |    |
> >   1----0               0----1
> >       |               |
> >     adapter         adapter
> >
> >  In the above diagram core 0 is the closest to the adapter, 
> therefore, if core 0's on the two nodes communicate to each other 
> you can get the highest bandwidth. On the other hand, if core 3's 
> communicate, the bandwidth can be 200MB/s less. We have observed 
> this kind of behavior on Opteron NUMA systems before and see that 
> the receive process's position matters the most (sender side almost 
> has no impact).
> >
> >  Based on your description, it is very likely that the topology 
> on your head node is different than that on the slave nodes - 
> probably core 0 is the nearest to the adapter on the slave nodes 
> while farthest on the head node. SInce by default the two processes 
> are running on core 0's, when the head node is the receiver, you 
> could observe lower bandwidth. Could you verify the physical setup 
> of your head node? It's preferrable to make it identical with the 
> slave nodes.
> >
> >
> >  > If it's relevant, I'm also running ip over ib, and my head 
> node runs
> >  > iptables.  Could this cause the slowdown?
> >
> >  We feel that this shouldn't be the cause. To make sure, could 
> you disable ipoib and see if the bandwidth difference disappear?
> >
> >  Lei
> >
> >
> >
> >
> >  >
> >  > On Tue, Mar 18, 2008 at 7:36 PM, Dhabaleswar Panda
> >  > <panda at cse.ohio-state.edu> wrote:
> >  > > What kind of platforms you are using - Opteron NUMA or 
> Intel? Are
> >  > the two
> >  > >  systems homogeneous in terms of processor speed and memory
> >  > speed? Do the
> >  > >  two systems have identical NICs - hw and firmware? Such
> >  > information will
> >  > >  help to understand this problem better.
> >  > >
> >  > >  DK
> >  > >
> >  > >
> >  > >
> >  > >  On Sun, 16 Mar 2008, Brian Budge wrote:
> >  > >
> >  > >  > Hi all -
> >  > >  >
> >  > >  > I am running the osu_bw bandwidth test on my small 
> cluster.  I'm
> >  > >  > seeing some slightly strange behavior:  Let's say I have 
> nodes>  > 0 and
> >  > >  > 1.  When I launch the bw test from node 0 and run the 
> test on
> >  > 0 and 1,
> >  > >  > I see a max bandwidth of 650 MB/s.  However, when I run from
> >  > node 1
> >  > >  > and run the test on 0 and 1, I see a max bandwidth of 
> close to
> >  > 850>  > MB/s.  Does anyone know how I might diagnose/fix this
> >  > issue?  Has
> >  > >  > anyone seen it before?
> >  > >  >
> >  > >  > Thanks,
> >  > >  >   Brian
> >  > >  > _______________________________________________
> >  > >  > mvapich-discuss mailing list
> >  > >  > mvapich-discuss at cse.ohio-state.edu
> >  > >  > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-
> discuss>  > >  >
> >  > >
> >  > >
> >  > _______________________________________________
> >  > mvapich-discuss mailing list
> >  > mvapich-discuss at cse.ohio-state.edu
> >  > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
> >  >
> >
> >
>