[mvapich-discuss] help getting multirail working

Jimmy Tang jtang at tchpc.tcd.ie
Fri Apr 14 12:18:51 EDT 2006


Hi Abhinav,

I didn't realise that it was possible to have more than one HCA in
a machine, the last time I checked with our IB vendor, the drivers
could only handle one HCA on a machine. I guess my last post was kinda
pointless.

Out of curiosity, is there a list of HCA's that are capable of the
function that James is looking for? ( more than one HCA in one machine )

Jim.

On Fri, Apr 14, 2006 at 11:52:30AM -0400, Abhinav Vishnu wrote:
> Hi James,
> 
> Thanks for using multirail MVAPICH and reporting the problem.
> 
> There are various parameters which are used to define the number of
> ports/HCA and number of HCAs to be used for communication. By default,
> the number of ports is 2. This can be changed using environment variable
> NUM_PORTS. Since you are using one port/HCA and two HCAs, i would
> recommend using NUM_PORTS=1 and NUM_HCAS=2.
> 
> Please let me know if the problem persists.
> 
> With best regards,
> 
> -- Abhinav
> -------------------------------
> Abhinav Vishnu,
> Graduate Research Associate,
> Department Of Comp. Sc. & Engg.
> The Ohio State University.
> -------------------------------
> 
> On Fri, 14 Apr 2006, James T Klosowski wrote:
> 
> > Hi,
> >
> > I'm trying to get the multirail feature working but have not had any
> > success.  I have not found much documentation on how to do it.  If you can
> > point me to some, I'd appreciate it.
> >
> >
> > My current configuation is simply 2 nodes, each with 2 HCAs (MT23108).  I
> > downloaded the MVPAICH 0.9.7 version (for VAPI) and compiled it using the
> > TopSpin stack (3.1.0-113).
> >
> > I'm running on RHEL 4 U1 machines.  In one machine, both HCAs are on
> > differnt PCI-X 133 buses, in the other machine one HCA is on a 133 bus and
> > the other is on a 100Hz bus.
> >
> >
> > I first compiled using make.mvapich.vapi and was able to run the OSU
> > benchmarks without any problems.
> >
> > I then compiled successfully using make.mvapich.vapi_multirail, but when I
> > tried to run the OSU benchmaks, I get VAPI_RETRY_EXC_ERR midway through
> > the benchmark, ... presumably when the code is finally trying to use the
> > 2nd rail.
> >
> > Below is the output of my benchmark run.  It is consistent in that it will
> > always fail after the 4096 test.  Again, using the version compiled
> > without mulitrail support works just fine (without changing anything other
> > than the version of mvapich I'm using).
> >
> > If you have any suggestions on what to try, I'd appreciate it.  I'm not
> > exactly sure how I should set up the IP addresses... so I included that
> > information below too.  I am using only one port on each of the two HCAs,
> > and all four cables connect to the same TopSpin TS120 switch.
> >
> > I suspect a configuration problem on my part, but short of that, I was
> > also thinking of trying the IBGD code from Mellanox.
> >
> >
> > Thanks in advance!
> >
> > Jim
> >
> >
> >
> > ./mpirun_rsh -rsh -np 2 -hostfile /root/hostfile
> > /root/OSU-benchmarks/osu_bw
> >
> > # OSU MPI Bandwidth Test (Version 2.2)
> > # Size          Bandwidth (MB/s)
> > 1               0.284546
> > 2               0.645845
> > 4               1.159683
> > 8               2.591093
> > 16              4.963886
> > 32              10.483747
> > 64              20.685824
> > 128             36.271862
> > 256             78.276241
> > 512             146.724578
> > 1024            237.888853
> > 2048            295.633345
> > 4096            347.127837
> > [0] Abort: [vis460.watson.ibm.com:0] Got completion with error,
> >         code=VAPI_RETRY_EXC_ERR, vendor code=81
> >         at line 2114 in file viacheck.c
> >         Timeout alarm signaled
> >         Cleaning up all processes ...done.
> >
> >
> > My machine file is just the 2 hostnames:
> >
> > cat /root/hostfile
> > vis460
> > vis30
> >
> >
> >
> >
> > ifconfig
> > eth0      Link encap:Ethernet  HWaddr 00:0D:60:98:20:B8
> >           inet addr:9.2.12.221  Bcast:9.2.15.255  Mask:255.255.248.0
> >           inet6 addr: fe80::20d:60ff:fe98:20b8/64 Scope:Link
> >           UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
> >           RX packets:9787508 errors:841 dropped:0 overruns:0 frame:0
> >           TX packets:1131808 errors:0 dropped:0 overruns:0 carrier:0
> >           collisions:0 txqueuelen:1000
> >           RX bytes:926406322 (883.4 MiB)  TX bytes:94330491 (89.9 MiB)
> >           Interrupt:185
> >
> > ib0       Link encap:Ethernet  HWaddr 93:C9:C9:6F:5D:7C
> >           inet addr:10.10.5.46  Bcast:10.10.5.255  Mask:255.255.255.0
> >           inet6 addr: fe80::6bc9:c9ff:fe66:c15b/64 Scope:Link
> >           UP BROADCAST RUNNING MULTICAST  MTU:2044  Metric:1
> >           RX packets:175 errors:0 dropped:0 overruns:0 frame:0
> >           TX packets:174 errors:0 dropped:18 overruns:0 carrier:0
> >           collisions:0 txqueuelen:128
> >           RX bytes:11144 (10.8 KiB)  TX bytes:11638 (11.3 KiB)
> >
> > ib2       Link encap:Ethernet  HWaddr 65:9A:4B:CF:8D:00
> >           inet addr:12.12.5.46  Bcast:12.12.5.255  Mask:255.255.255.0
> >           inet6 addr: fe80::c19a:4bff:fed2:f3a0/64 Scope:Link
> >           UP BROADCAST RUNNING MULTICAST  MTU:2044  Metric:1
> >           RX packets:257 errors:0 dropped:0 overruns:0 frame:0
> >           TX packets:235 errors:0 dropped:30 overruns:0 carrier:0
> >           collisions:0 txqueuelen:128
> >           RX bytes:15180 (14.8 KiB)  TX bytes:15071 (14.7 KiB)
> >
> > lo        Link encap:Local Loopback
> >           inet addr:127.0.0.1  Mask:255.0.0.0
> >           inet6 addr: ::1/128 Scope:Host
> >           UP LOOPBACK RUNNING  MTU:16436  Metric:1
> >           RX packets:14817 errors:0 dropped:0 overruns:0 frame:0
> >           TX packets:14817 errors:0 dropped:0 overruns:0 carrier:0
> >           collisions:0 txqueuelen:0
> >           RX bytes:7521844 (7.1 MiB)  TX bytes:7521844 (7.1 MiB)
> >
> >
> >
> >
> 
> _______________________________________________
> mvapich-discuss mailing list
> mvapich-discuss at cse.ohio-state.edu
> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
---end quoted text---

-- 
Jimmy Tang
Trinity Centre for High Performance Computing,
Lloyd Building, Trinity College Dublin.
http://www.tchpc.tcd.ie/


More information about the mvapich-discuss mailing list