[mvapich-discuss] help getting multirail working
Abhinav Vishnu
vishnu at cse.ohio-state.edu
Fri Apr 14 12:50:48 EDT 2006
Hi James,
Sorry, I forgot to mention about the MVAPICH user guide, which also
provides a list of configuration examples, debugging information and
also a list of environment variables, which can be used.
Please refer to the user guide at:
http://nowlab.cse.ohio-state.edu/projects/mpi-iba/mvapich_user_guide.html
In section 7 of the user guide, there are a couple of troubleshooting
examples. 7.3.3 is an example in which a user application aborts with
VAPI_RETRY_EXEC_ERROR.
VAPI provides a utility, vstat which can be used for checking the status
of the IB communication ports. As an example,
[vishnu at e8-lustre:~] vstat
hca_id=InfiniHost_III_Ex0
pci_location={BUS=0x04,DEV/FUNC=0x00}
vendor_id=0x02C9
vendor_part_id=0x6282
hw_ver=0xA0
fw_ver=5.1.0
PSID not available -- FW not installed using fail-safe mode
num_phys_ports=2
port=1
port_state=PORT_ACTIVE<-
sm_lid=0x0069
port_lid=0x00a9
port_lmc=0x00
max_mtu=2048
port=2
port_state=PORT_DOWN
sm_lid=0x0000
port_lid=0x00aa
port_lmc=0x00
max_mtu=2048
vstat on your machine(s) should list two HCAs. Please make sure that
the first port on both HCAs is in the PORT_ACTIVE state. In case they
are in the PORT_INITIALIZE state, subnet manager can be used in the
following manner:
[vishnu at e10-lustre:~] sudo opensm -o
>
> My current configuation is simply 2 nodes, each with 2 HCAs (MT23108). I
> downloaded the MVPAICH 0.9.7 version (for VAPI) and compiled it using the
> TopSpin stack (3.1.0-113).
>
> I'm running on RHEL 4 U1 machines. In one machine, both HCAs are on
> differnt PCI-X 133 buses, in the other machine one HCA is on a 133 bus and
> the other is on a 100Hz bus.
Even though both machines do not have exactly similar configuration, it
should not be a problem to get them running together using multirail.
>
>
> I first compiled using make.mvapich.vapi and was able to run the OSU
> benchmarks without any problems.
>
> I then compiled successfully using make.mvapich.vapi_multirail, but when I
> tried to run the OSU benchmaks, I get VAPI_RETRY_EXC_ERR midway through
> the benchmark, ... presumably when the code is finally trying to use the
> 2nd rail.
>
> Below is the output of my benchmark run. It is consistent in that it will
> always fail after the 4096 test. Again, using the version compiled
> without mulitrail support works just fine (without changing anything other
> than the version of mvapich I'm using).
>
In my previous email, i forgot to mention about the environment variable
STRIPING_THRESHOLD. The multirail MVAPICH uses this value to determine
whether a message would be striped across multiple available paths. This
could be a combination of multiple ports and multiple HCAs.
Section 9.4 and 9.5 of the user_guide talk about environment variable
NUM_PORTS and NUM_HCAS. A combination of these values can be used at the
same time. For example, if there is a cluster with each node having 2 HCAs
and 2 Ports per HCA, setting up NUM_PORTS=2 and NUM_HCAS=2 would allow
multirail to use all ports and all HCAs.
> If you have any suggestions on what to try, I'd appreciate it. I'm not
> exactly sure how I should set up the IP addresses... so I included that
> information below too. I am using only one port on each of the two HCAs,
> and all four cables connect to the same TopSpin TS120 switch.
>
A following change in the command line should solve the problem for you:
./mpirun_rsh -rsh -np 2 -hostfile /root/hostfile
NUM_PORTS=1 NUM_HCAS=2 /root/OSU-benchmarks/osu_bw
Please let us know if the problem persists.
Thanks and best regards,
-- Abhinav
> Thanks in advance!
>
> Jim
>
>
>
> ./mpirun_rsh -rsh -np 2 -hostfile /root/hostfile
> /root/OSU-benchmarks/osu_bw
>
> # OSU MPI Bandwidth Test (Version 2.2)
> # Size Bandwidth (MB/s)
> 1 0.284546
> 2 0.645845
> 4 1.159683
> 8 2.591093
> 16 4.963886
> 32 10.483747
> 64 20.685824
> 128 36.271862
> 256 78.276241
> 512 146.724578
> 1024 237.888853
> 2048 295.633345
> 4096 347.127837
> [0] Abort: [vis460.watson.ibm.com:0] Got completion with error,
> code=VAPI_RETRY_EXC_ERR, vendor code=81
> at line 2114 in file viacheck.c
> Timeout alarm signaled
> Cleaning up all processes ...done.
>
>
> My machine file is just the 2 hostnames:
>
> cat /root/hostfile
> vis460
> vis30
>
>
>
>
> ifconfig
> eth0 Link encap:Ethernet HWaddr 00:0D:60:98:20:B8
> inet addr:9.2.12.221 Bcast:9.2.15.255 Mask:255.255.248.0
> inet6 addr: fe80::20d:60ff:fe98:20b8/64 Scope:Link
> UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
> RX packets:9787508 errors:841 dropped:0 overruns:0 frame:0
> TX packets:1131808 errors:0 dropped:0 overruns:0 carrier:0
> collisions:0 txqueuelen:1000
> RX bytes:926406322 (883.4 MiB) TX bytes:94330491 (89.9 MiB)
> Interrupt:185
>
> ib0 Link encap:Ethernet HWaddr 93:C9:C9:6F:5D:7C
> inet addr:10.10.5.46 Bcast:10.10.5.255 Mask:255.255.255.0
> inet6 addr: fe80::6bc9:c9ff:fe66:c15b/64 Scope:Link
> UP BROADCAST RUNNING MULTICAST MTU:2044 Metric:1
> RX packets:175 errors:0 dropped:0 overruns:0 frame:0
> TX packets:174 errors:0 dropped:18 overruns:0 carrier:0
> collisions:0 txqueuelen:128
> RX bytes:11144 (10.8 KiB) TX bytes:11638 (11.3 KiB)
>
> ib2 Link encap:Ethernet HWaddr 65:9A:4B:CF:8D:00
> inet addr:12.12.5.46 Bcast:12.12.5.255 Mask:255.255.255.0
> inet6 addr: fe80::c19a:4bff:fed2:f3a0/64 Scope:Link
> UP BROADCAST RUNNING MULTICAST MTU:2044 Metric:1
> RX packets:257 errors:0 dropped:0 overruns:0 frame:0
> TX packets:235 errors:0 dropped:30 overruns:0 carrier:0
> collisions:0 txqueuelen:128
> RX bytes:15180 (14.8 KiB) TX bytes:15071 (14.7 KiB)
>
> lo Link encap:Local Loopback
> inet addr:127.0.0.1 Mask:255.0.0.0
> inet6 addr: ::1/128 Scope:Host
> UP LOOPBACK RUNNING MTU:16436 Metric:1
> RX packets:14817 errors:0 dropped:0 overruns:0 frame:0
> TX packets:14817 errors:0 dropped:0 overruns:0 carrier:0
> collisions:0 txqueuelen:0
> RX bytes:7521844 (7.1 MiB) TX bytes:7521844 (7.1 MiB)
>
>
>
>
More information about the mvapich-discuss
mailing list