[mvapich-discuss] help getting multirail working

Abhinav Vishnu vishnu at cse.ohio-state.edu
Fri Apr 14 13:18:46 EDT 2006


Hi Jimmy,

Thanks for your mail.


> Hi Abhinav,
>
> I didn't realise that it was possible to have more than one HCA in
> a machine, the last time I checked with our IB vendor, the drivers
> could only handle one HCA on a machine. I guess my last post was kinda
> pointless.
>
> Out of curiosity, is there a list of HCA's that are capable of the
> function that James is looking for? ( more than one HCA in one machine )
>

AFAIK, there should not be a problem running multiple HCAs both on the
VAPI and OpenIB Gen2 Driver. We have done testing and experimentation
on VAPI and OpenIB Gen2 Driver using multiple HCAs with MVAPICH.

We have done experimentation with a combination of Mellanox PCI-X
baed HCAs, MT23108. Performance Evaluation with this combination is
available at MVAPICH website:

http://nowlab.cse.ohio-state.edu/projects/mpi-iba/ -> Performance

In-depth evaluation of this setup was published in SuperComputing 2004
and is available at:

http://nowlab.cse.ohio-state.edu/publications/conf-papers/2004/liuj-sc04.pdf

In addition, we have also tested with combinations of PCI-X and
PCI-Express(Memfull and MemFree) adapters and also with different
combinations of PCI-Express adapters themselves (MT25208) (SDR/DDR
and MemFree/MemFull).

However, AFAIK, there should not be a problem with having multiple
HCAs and be able to use them simultaenously. We would also be happy
to help you out in running MVAPICH on such configurations.

Thanks and best regards,

-- Abhinav

> Jim.
>
> On Fri, Apr 14, 2006 at 11:52:30AM -0400, Abhinav Vishnu wrote:
> > Hi James,
> >
> > Thanks for using multirail MVAPICH and reporting the problem.
> >
> > There are various parameters which are used to define the number of
> > ports/HCA and number of HCAs to be used for communication. By default,
> > the number of ports is 2. This can be changed using environment variable
> > NUM_PORTS. Since you are using one port/HCA and two HCAs, i would
> > recommend using NUM_PORTS=1 and NUM_HCAS=2.
> >
> > Please let me know if the problem persists.
> >
> > With best regards,
> >
> > -- Abhinav
> > -------------------------------
> > Abhinav Vishnu,
> > Graduate Research Associate,
> > Department Of Comp. Sc. & Engg.
> > The Ohio State University.
> > -------------------------------
> >
> > On Fri, 14 Apr 2006, James T Klosowski wrote:
> >
> > > Hi,
> > >
> > > I'm trying to get the multirail feature working but have not had any
> > > success.  I have not found much documentation on how to do it.  If you can
> > > point me to some, I'd appreciate it.
> > >
> > >
> > > My current configuation is simply 2 nodes, each with 2 HCAs (MT23108).  I
> > > downloaded the MVPAICH 0.9.7 version (for VAPI) and compiled it using the
> > > TopSpin stack (3.1.0-113).
> > >
> > > I'm running on RHEL 4 U1 machines.  In one machine, both HCAs are on
> > > differnt PCI-X 133 buses, in the other machine one HCA is on a 133 bus and
> > > the other is on a 100Hz bus.
> > >
> > >
> > > I first compiled using make.mvapich.vapi and was able to run the OSU
> > > benchmarks without any problems.
> > >
> > > I then compiled successfully using make.mvapich.vapi_multirail, but when I
> > > tried to run the OSU benchmaks, I get VAPI_RETRY_EXC_ERR midway through
> > > the benchmark, ... presumably when the code is finally trying to use the
> > > 2nd rail.
> > >
> > > Below is the output of my benchmark run.  It is consistent in that it will
> > > always fail after the 4096 test.  Again, using the version compiled
> > > without mulitrail support works just fine (without changing anything other
> > > than the version of mvapich I'm using).
> > >
> > > If you have any suggestions on what to try, I'd appreciate it.  I'm not
> > > exactly sure how I should set up the IP addresses... so I included that
> > > information below too.  I am using only one port on each of the two HCAs,
> > > and all four cables connect to the same TopSpin TS120 switch.
> > >
> > > I suspect a configuration problem on my part, but short of that, I was
> > > also thinking of trying the IBGD code from Mellanox.
> > >
> > >
> > > Thanks in advance!
> > >
> > > Jim
> > >
> > >
> > >
> > > ./mpirun_rsh -rsh -np 2 -hostfile /root/hostfile
> > > /root/OSU-benchmarks/osu_bw
> > >
> > > # OSU MPI Bandwidth Test (Version 2.2)
> > > # Size          Bandwidth (MB/s)
> > > 1               0.284546
> > > 2               0.645845
> > > 4               1.159683
> > > 8               2.591093
> > > 16              4.963886
> > > 32              10.483747
> > > 64              20.685824
> > > 128             36.271862
> > > 256             78.276241
> > > 512             146.724578
> > > 1024            237.888853
> > > 2048            295.633345
> > > 4096            347.127837
> > > [0] Abort: [vis460.watson.ibm.com:0] Got completion with error,
> > >         code=VAPI_RETRY_EXC_ERR, vendor code=81
> > >         at line 2114 in file viacheck.c
> > >         Timeout alarm signaled
> > >         Cleaning up all processes ...done.
> > >
> > >
> > > My machine file is just the 2 hostnames:
> > >
> > > cat /root/hostfile
> > > vis460
> > > vis30
> > >
> > >
> > >
> > >
> > > ifconfig
> > > eth0      Link encap:Ethernet  HWaddr 00:0D:60:98:20:B8
> > >           inet addr:9.2.12.221  Bcast:9.2.15.255  Mask:255.255.248.0
> > >           inet6 addr: fe80::20d:60ff:fe98:20b8/64 Scope:Link
> > >           UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
> > >           RX packets:9787508 errors:841 dropped:0 overruns:0 frame:0
> > >           TX packets:1131808 errors:0 dropped:0 overruns:0 carrier:0
> > >           collisions:0 txqueuelen:1000
> > >           RX bytes:926406322 (883.4 MiB)  TX bytes:94330491 (89.9 MiB)
> > >           Interrupt:185
> > >
> > > ib0       Link encap:Ethernet  HWaddr 93:C9:C9:6F:5D:7C
> > >           inet addr:10.10.5.46  Bcast:10.10.5.255  Mask:255.255.255.0
> > >           inet6 addr: fe80::6bc9:c9ff:fe66:c15b/64 Scope:Link
> > >           UP BROADCAST RUNNING MULTICAST  MTU:2044  Metric:1
> > >           RX packets:175 errors:0 dropped:0 overruns:0 frame:0
> > >           TX packets:174 errors:0 dropped:18 overruns:0 carrier:0
> > >           collisions:0 txqueuelen:128
> > >           RX bytes:11144 (10.8 KiB)  TX bytes:11638 (11.3 KiB)
> > >
> > > ib2       Link encap:Ethernet  HWaddr 65:9A:4B:CF:8D:00
> > >           inet addr:12.12.5.46  Bcast:12.12.5.255  Mask:255.255.255.0
> > >           inet6 addr: fe80::c19a:4bff:fed2:f3a0/64 Scope:Link
> > >           UP BROADCAST RUNNING MULTICAST  MTU:2044  Metric:1
> > >           RX packets:257 errors:0 dropped:0 overruns:0 frame:0
> > >           TX packets:235 errors:0 dropped:30 overruns:0 carrier:0
> > >           collisions:0 txqueuelen:128
> > >           RX bytes:15180 (14.8 KiB)  TX bytes:15071 (14.7 KiB)
> > >
> > > lo        Link encap:Local Loopback
> > >           inet addr:127.0.0.1  Mask:255.0.0.0
> > >           inet6 addr: ::1/128 Scope:Host
> > >           UP LOOPBACK RUNNING  MTU:16436  Metric:1
> > >           RX packets:14817 errors:0 dropped:0 overruns:0 frame:0
> > >           TX packets:14817 errors:0 dropped:0 overruns:0 carrier:0
> > >           collisions:0 txqueuelen:0
> > >           RX bytes:7521844 (7.1 MiB)  TX bytes:7521844 (7.1 MiB)
> > >
> > >
> > >
> > >
> >
> > _______________________________________________
> > mvapich-discuss mailing list
> > mvapich-discuss at cse.ohio-state.edu
> > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
> ---end quoted text---
>
> --
> Jimmy Tang
> Trinity Centre for High Performance Computing,
> Lloyd Building, Trinity College Dublin.
> http://www.tchpc.tcd.ie/
>



More information about the mvapich-discuss mailing list