[mvapich-discuss] help getting multirail working
James T Klosowski
jklosow at us.ibm.com
Fri Apr 14 11:29:17 EDT 2006
Hi,
I'm trying to get the multirail feature working but have not had any
success. I have not found much documentation on how to do it. If you can
point me to some, I'd appreciate it.
My current configuation is simply 2 nodes, each with 2 HCAs (MT23108). I
downloaded the MVPAICH 0.9.7 version (for VAPI) and compiled it using the
TopSpin stack (3.1.0-113).
I'm running on RHEL 4 U1 machines. In one machine, both HCAs are on
differnt PCI-X 133 buses, in the other machine one HCA is on a 133 bus and
the other is on a 100Hz bus.
I first compiled using make.mvapich.vapi and was able to run the OSU
benchmarks without any problems.
I then compiled successfully using make.mvapich.vapi_multirail, but when I
tried to run the OSU benchmaks, I get VAPI_RETRY_EXC_ERR midway through
the benchmark, ... presumably when the code is finally trying to use the
2nd rail.
Below is the output of my benchmark run. It is consistent in that it will
always fail after the 4096 test. Again, using the version compiled
without mulitrail support works just fine (without changing anything other
than the version of mvapich I'm using).
If you have any suggestions on what to try, I'd appreciate it. I'm not
exactly sure how I should set up the IP addresses... so I included that
information below too. I am using only one port on each of the two HCAs,
and all four cables connect to the same TopSpin TS120 switch.
I suspect a configuration problem on my part, but short of that, I was
also thinking of trying the IBGD code from Mellanox.
Thanks in advance!
Jim
./mpirun_rsh -rsh -np 2 -hostfile /root/hostfile
/root/OSU-benchmarks/osu_bw
# OSU MPI Bandwidth Test (Version 2.2)
# Size Bandwidth (MB/s)
1 0.284546
2 0.645845
4 1.159683
8 2.591093
16 4.963886
32 10.483747
64 20.685824
128 36.271862
256 78.276241
512 146.724578
1024 237.888853
2048 295.633345
4096 347.127837
[0] Abort: [vis460.watson.ibm.com:0] Got completion with error,
code=VAPI_RETRY_EXC_ERR, vendor code=81
at line 2114 in file viacheck.c
Timeout alarm signaled
Cleaning up all processes ...done.
My machine file is just the 2 hostnames:
cat /root/hostfile
vis460
vis30
ifconfig
eth0 Link encap:Ethernet HWaddr 00:0D:60:98:20:B8
inet addr:9.2.12.221 Bcast:9.2.15.255 Mask:255.255.248.0
inet6 addr: fe80::20d:60ff:fe98:20b8/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:9787508 errors:841 dropped:0 overruns:0 frame:0
TX packets:1131808 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:926406322 (883.4 MiB) TX bytes:94330491 (89.9 MiB)
Interrupt:185
ib0 Link encap:Ethernet HWaddr 93:C9:C9:6F:5D:7C
inet addr:10.10.5.46 Bcast:10.10.5.255 Mask:255.255.255.0
inet6 addr: fe80::6bc9:c9ff:fe66:c15b/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:2044 Metric:1
RX packets:175 errors:0 dropped:0 overruns:0 frame:0
TX packets:174 errors:0 dropped:18 overruns:0 carrier:0
collisions:0 txqueuelen:128
RX bytes:11144 (10.8 KiB) TX bytes:11638 (11.3 KiB)
ib2 Link encap:Ethernet HWaddr 65:9A:4B:CF:8D:00
inet addr:12.12.5.46 Bcast:12.12.5.255 Mask:255.255.255.0
inet6 addr: fe80::c19a:4bff:fed2:f3a0/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:2044 Metric:1
RX packets:257 errors:0 dropped:0 overruns:0 frame:0
TX packets:235 errors:0 dropped:30 overruns:0 carrier:0
collisions:0 txqueuelen:128
RX bytes:15180 (14.8 KiB) TX bytes:15071 (14.7 KiB)
lo Link encap:Local Loopback
inet addr:127.0.0.1 Mask:255.0.0.0
inet6 addr: ::1/128 Scope:Host
UP LOOPBACK RUNNING MTU:16436 Metric:1
RX packets:14817 errors:0 dropped:0 overruns:0 frame:0
TX packets:14817 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:0
RX bytes:7521844 (7.1 MiB) TX bytes:7521844 (7.1 MiB)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20060414/2e856375/attachment.html
More information about the mvapich-discuss
mailing list