[mvapich-discuss] help getting multirail working
Abhinav Vishnu
vishnu at cse.ohio-state.edu
Fri Apr 14 13:55:26 EDT 2006
Hi Jimmy,
> Hi James
>
> I've only tested the multirail stuff with voltaire HCA's and switches,
> but to my knowledge multirail only works with dual port HCA's, and each
> machine can only have one HCA, or else your system will just get confused
> and fail to work.
AFAIK, there should not be a problem running a combination of multiple
ports and multiple HCAs together. As i mentioned in my previous mail to
mvapich-discuss, we have done in-depth performance evaluation and testing
using the combinations of multiple ports, multiple HCAs and combinations.
In fact, starting 0.9.7 version of MVAPICH, we also support using multiple
queue pairs per port. This can be changed using NUM_QP_PER_PORT
environment variable.
Please see section 9.3 of MVAPICH user guide at:
http://nowlab.cse.ohio-state.edu/projects/mpi-iba/mvapich_user_guide.html
We will be more than happy to help you out with running any of these
combinations with MVAPICH.
>
> also you do not need to set IP's for the ipoib (or ib* in your case),
> mvapich will use your existing tcp/ipv4 address and figure out what/where
> to connect to on the IB fabric, so as long as your ethernet network is
> working and running your HCA diagnostic tools show that your IB interface
> is up it should be ok.
>
> you should try again using one HCA in each machine and plug both ports
> of each HCA into your switch, you shouldnt need to recompile mvapich,
> it should just work.
>
As mentioned above, there would be no need to remove the IBA cards at all.
The reason multirail device of MVAPICH works with both ports connected is
because the default value of NUM_PORTS environment variable to be used is
2. Please refer to sections 9.4 and 9.5, which mention about NUM_HCAS and
NUM_PORTS environment variables.
Please let us know if you face any problems running ano of the
combinations mentioned above.
Thanks and best regards,
-- Abhinav
> but someone else who knows more can confirm or clarify the above.
>
>
> Jim.
>
> On Fri, Apr 14, 2006 at 11:29:17AM -0400, James T Klosowski wrote:
> > Hi,
> >
> > I'm trying to get the multirail feature working but have not had any
> > success. I have not found much documentation on how to do it. If you can
> > point me to some, I'd appreciate it.
> >
> >
> > My current configuation is simply 2 nodes, each with 2 HCAs (MT23108). I
> > downloaded the MVPAICH 0.9.7 version (for VAPI) and compiled it using the
> > TopSpin stack (3.1.0-113).
> >
> > I'm running on RHEL 4 U1 machines. In one machine, both HCAs are on
> > differnt PCI-X 133 buses, in the other machine one HCA is on a 133 bus and
> > the other is on a 100Hz bus.
> >
> >
> > I first compiled using make.mvapich.vapi and was able to run the OSU
> > benchmarks without any problems.
> >
> > I then compiled successfully using make.mvapich.vapi_multirail, but when I
> > tried to run the OSU benchmaks, I get VAPI_RETRY_EXC_ERR midway through
> > the benchmark, ... presumably when the code is finally trying to use the
> > 2nd rail.
> >
> > Below is the output of my benchmark run. It is consistent in that it will
> > always fail after the 4096 test. Again, using the version compiled
> > without mulitrail support works just fine (without changing anything other
> > than the version of mvapich I'm using).
> >
> > If you have any suggestions on what to try, I'd appreciate it. I'm not
> > exactly sure how I should set up the IP addresses... so I included that
> > information below too. I am using only one port on each of the two HCAs,
> > and all four cables connect to the same TopSpin TS120 switch.
> >
> > I suspect a configuration problem on my part, but short of that, I was
> > also thinking of trying the IBGD code from Mellanox.
> >
> >
> > Thanks in advance!
> >
> > Jim
> >
> >
> >
> > ./mpirun_rsh -rsh -np 2 -hostfile /root/hostfile
> > /root/OSU-benchmarks/osu_bw
> >
> > # OSU MPI Bandwidth Test (Version 2.2)
> > # Size Bandwidth (MB/s)
> > 1 0.284546
> > 2 0.645845
> > 4 1.159683
> > 8 2.591093
> > 16 4.963886
> > 32 10.483747
> > 64 20.685824
> > 128 36.271862
> > 256 78.276241
> > 512 146.724578
> > 1024 237.888853
> > 2048 295.633345
> > 4096 347.127837
> > [0] Abort: [vis460.watson.ibm.com:0] Got completion with error,
> > code=VAPI_RETRY_EXC_ERR, vendor code=81
> > at line 2114 in file viacheck.c
> > Timeout alarm signaled
> > Cleaning up all processes ...done.
> >
> >
> > My machine file is just the 2 hostnames:
> >
> > cat /root/hostfile
> > vis460
> > vis30
> >
> >
> >
> >
> > ifconfig
> > eth0 Link encap:Ethernet HWaddr 00:0D:60:98:20:B8
> > inet addr:9.2.12.221 Bcast:9.2.15.255 Mask:255.255.248.0
> > inet6 addr: fe80::20d:60ff:fe98:20b8/64 Scope:Link
> > UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
> > RX packets:9787508 errors:841 dropped:0 overruns:0 frame:0
> > TX packets:1131808 errors:0 dropped:0 overruns:0 carrier:0
> > collisions:0 txqueuelen:1000
> > RX bytes:926406322 (883.4 MiB) TX bytes:94330491 (89.9 MiB)
> > Interrupt:185
> >
> > ib0 Link encap:Ethernet HWaddr 93:C9:C9:6F:5D:7C
> > inet addr:10.10.5.46 Bcast:10.10.5.255 Mask:255.255.255.0
> > inet6 addr: fe80::6bc9:c9ff:fe66:c15b/64 Scope:Link
> > UP BROADCAST RUNNING MULTICAST MTU:2044 Metric:1
> > RX packets:175 errors:0 dropped:0 overruns:0 frame:0
> > TX packets:174 errors:0 dropped:18 overruns:0 carrier:0
> > collisions:0 txqueuelen:128
> > RX bytes:11144 (10.8 KiB) TX bytes:11638 (11.3 KiB)
> >
> > ib2 Link encap:Ethernet HWaddr 65:9A:4B:CF:8D:00
> > inet addr:12.12.5.46 Bcast:12.12.5.255 Mask:255.255.255.0
> > inet6 addr: fe80::c19a:4bff:fed2:f3a0/64 Scope:Link
> > UP BROADCAST RUNNING MULTICAST MTU:2044 Metric:1
> > RX packets:257 errors:0 dropped:0 overruns:0 frame:0
> > TX packets:235 errors:0 dropped:30 overruns:0 carrier:0
> > collisions:0 txqueuelen:128
> > RX bytes:15180 (14.8 KiB) TX bytes:15071 (14.7 KiB)
> >
> > lo Link encap:Local Loopback
> > inet addr:127.0.0.1 Mask:255.0.0.0
> > inet6 addr: ::1/128 Scope:Host
> > UP LOOPBACK RUNNING MTU:16436 Metric:1
> > RX packets:14817 errors:0 dropped:0 overruns:0 frame:0
> > TX packets:14817 errors:0 dropped:0 overruns:0 carrier:0
> > collisions:0 txqueuelen:0
> > RX bytes:7521844 (7.1 MiB) TX bytes:7521844 (7.1 MiB)
> >
> >
> >
>
> > _______________________________________________
> > mvapich-discuss mailing list
> > mvapich-discuss at cse.ohio-state.edu
> > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>
> ---end quoted text---
>
> --
> Jimmy Tang
> Trinity Centre for High Performance Computing,
> Lloyd Building, Trinity College Dublin.
> http://www.tchpc.tcd.ie/
> _______________________________________________
> mvapich-discuss mailing list
> mvapich-discuss at cse.ohio-state.edu
> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>
More information about the mvapich-discuss
mailing list