[mvapich-discuss] help getting multirail working
Jimmy Tang
jtang at tchpc.tcd.ie
Fri Apr 14 12:18:51 EDT 2006
Hi Abhinav,
I didn't realise that it was possible to have more than one HCA in
a machine, the last time I checked with our IB vendor, the drivers
could only handle one HCA on a machine. I guess my last post was kinda
pointless.
Out of curiosity, is there a list of HCA's that are capable of the
function that James is looking for? ( more than one HCA in one machine )
Jim.
On Fri, Apr 14, 2006 at 11:52:30AM -0400, Abhinav Vishnu wrote:
> Hi James,
>
> Thanks for using multirail MVAPICH and reporting the problem.
>
> There are various parameters which are used to define the number of
> ports/HCA and number of HCAs to be used for communication. By default,
> the number of ports is 2. This can be changed using environment variable
> NUM_PORTS. Since you are using one port/HCA and two HCAs, i would
> recommend using NUM_PORTS=1 and NUM_HCAS=2.
>
> Please let me know if the problem persists.
>
> With best regards,
>
> -- Abhinav
> -------------------------------
> Abhinav Vishnu,
> Graduate Research Associate,
> Department Of Comp. Sc. & Engg.
> The Ohio State University.
> -------------------------------
>
> On Fri, 14 Apr 2006, James T Klosowski wrote:
>
> > Hi,
> >
> > I'm trying to get the multirail feature working but have not had any
> > success. I have not found much documentation on how to do it. If you can
> > point me to some, I'd appreciate it.
> >
> >
> > My current configuation is simply 2 nodes, each with 2 HCAs (MT23108). I
> > downloaded the MVPAICH 0.9.7 version (for VAPI) and compiled it using the
> > TopSpin stack (3.1.0-113).
> >
> > I'm running on RHEL 4 U1 machines. In one machine, both HCAs are on
> > differnt PCI-X 133 buses, in the other machine one HCA is on a 133 bus and
> > the other is on a 100Hz bus.
> >
> >
> > I first compiled using make.mvapich.vapi and was able to run the OSU
> > benchmarks without any problems.
> >
> > I then compiled successfully using make.mvapich.vapi_multirail, but when I
> > tried to run the OSU benchmaks, I get VAPI_RETRY_EXC_ERR midway through
> > the benchmark, ... presumably when the code is finally trying to use the
> > 2nd rail.
> >
> > Below is the output of my benchmark run. It is consistent in that it will
> > always fail after the 4096 test. Again, using the version compiled
> > without mulitrail support works just fine (without changing anything other
> > than the version of mvapich I'm using).
> >
> > If you have any suggestions on what to try, I'd appreciate it. I'm not
> > exactly sure how I should set up the IP addresses... so I included that
> > information below too. I am using only one port on each of the two HCAs,
> > and all four cables connect to the same TopSpin TS120 switch.
> >
> > I suspect a configuration problem on my part, but short of that, I was
> > also thinking of trying the IBGD code from Mellanox.
> >
> >
> > Thanks in advance!
> >
> > Jim
> >
> >
> >
> > ./mpirun_rsh -rsh -np 2 -hostfile /root/hostfile
> > /root/OSU-benchmarks/osu_bw
> >
> > # OSU MPI Bandwidth Test (Version 2.2)
> > # Size Bandwidth (MB/s)
> > 1 0.284546
> > 2 0.645845
> > 4 1.159683
> > 8 2.591093
> > 16 4.963886
> > 32 10.483747
> > 64 20.685824
> > 128 36.271862
> > 256 78.276241
> > 512 146.724578
> > 1024 237.888853
> > 2048 295.633345
> > 4096 347.127837
> > [0] Abort: [vis460.watson.ibm.com:0] Got completion with error,
> > code=VAPI_RETRY_EXC_ERR, vendor code=81
> > at line 2114 in file viacheck.c
> > Timeout alarm signaled
> > Cleaning up all processes ...done.
> >
> >
> > My machine file is just the 2 hostnames:
> >
> > cat /root/hostfile
> > vis460
> > vis30
> >
> >
> >
> >
> > ifconfig
> > eth0 Link encap:Ethernet HWaddr 00:0D:60:98:20:B8
> > inet addr:9.2.12.221 Bcast:9.2.15.255 Mask:255.255.248.0
> > inet6 addr: fe80::20d:60ff:fe98:20b8/64 Scope:Link
> > UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
> > RX packets:9787508 errors:841 dropped:0 overruns:0 frame:0
> > TX packets:1131808 errors:0 dropped:0 overruns:0 carrier:0
> > collisions:0 txqueuelen:1000
> > RX bytes:926406322 (883.4 MiB) TX bytes:94330491 (89.9 MiB)
> > Interrupt:185
> >
> > ib0 Link encap:Ethernet HWaddr 93:C9:C9:6F:5D:7C
> > inet addr:10.10.5.46 Bcast:10.10.5.255 Mask:255.255.255.0
> > inet6 addr: fe80::6bc9:c9ff:fe66:c15b/64 Scope:Link
> > UP BROADCAST RUNNING MULTICAST MTU:2044 Metric:1
> > RX packets:175 errors:0 dropped:0 overruns:0 frame:0
> > TX packets:174 errors:0 dropped:18 overruns:0 carrier:0
> > collisions:0 txqueuelen:128
> > RX bytes:11144 (10.8 KiB) TX bytes:11638 (11.3 KiB)
> >
> > ib2 Link encap:Ethernet HWaddr 65:9A:4B:CF:8D:00
> > inet addr:12.12.5.46 Bcast:12.12.5.255 Mask:255.255.255.0
> > inet6 addr: fe80::c19a:4bff:fed2:f3a0/64 Scope:Link
> > UP BROADCAST RUNNING MULTICAST MTU:2044 Metric:1
> > RX packets:257 errors:0 dropped:0 overruns:0 frame:0
> > TX packets:235 errors:0 dropped:30 overruns:0 carrier:0
> > collisions:0 txqueuelen:128
> > RX bytes:15180 (14.8 KiB) TX bytes:15071 (14.7 KiB)
> >
> > lo Link encap:Local Loopback
> > inet addr:127.0.0.1 Mask:255.0.0.0
> > inet6 addr: ::1/128 Scope:Host
> > UP LOOPBACK RUNNING MTU:16436 Metric:1
> > RX packets:14817 errors:0 dropped:0 overruns:0 frame:0
> > TX packets:14817 errors:0 dropped:0 overruns:0 carrier:0
> > collisions:0 txqueuelen:0
> > RX bytes:7521844 (7.1 MiB) TX bytes:7521844 (7.1 MiB)
> >
> >
> >
> >
>
> _______________________________________________
> mvapich-discuss mailing list
> mvapich-discuss at cse.ohio-state.edu
> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
---end quoted text---
--
Jimmy Tang
Trinity Centre for High Performance Computing,
Lloyd Building, Trinity College Dublin.
http://www.tchpc.tcd.ie/
More information about the mvapich-discuss
mailing list