[mvapich-discuss] help getting multirail working

Abhinav Vishnu vishnu at cse.ohio-state.edu
Fri Apr 14 12:50:48 EDT 2006


Hi James,

Sorry, I forgot to mention about the MVAPICH user guide, which also
provides a list of configuration examples, debugging information and
also a list of environment variables, which can be used.

Please refer to the user guide at:

http://nowlab.cse.ohio-state.edu/projects/mpi-iba/mvapich_user_guide.html

In section 7 of the user guide, there are a couple of troubleshooting
examples. 7.3.3 is an example in which a user application aborts with
VAPI_RETRY_EXEC_ERROR.

VAPI provides a utility, vstat which can be used for checking the status
of the IB communication ports. As an example,

[vishnu at e8-lustre:~] vstat
        hca_id=InfiniHost_III_Ex0
        pci_location={BUS=0x04,DEV/FUNC=0x00}
        vendor_id=0x02C9
        vendor_part_id=0x6282
        hw_ver=0xA0
        fw_ver=5.1.0
        PSID not available -- FW not installed using fail-safe mode
        num_phys_ports=2
                port=1
                port_state=PORT_ACTIVE<-
                sm_lid=0x0069
                port_lid=0x00a9
                port_lmc=0x00
                max_mtu=2048

                port=2
                port_state=PORT_DOWN
                sm_lid=0x0000
                port_lid=0x00aa
                port_lmc=0x00
                max_mtu=2048

vstat on your machine(s) should list two HCAs. Please make sure that
the first port on both HCAs is in the PORT_ACTIVE state. In case they
are in the PORT_INITIALIZE state, subnet manager can be used in the
following manner:

[vishnu at e10-lustre:~] sudo opensm -o

>
> My current configuation is simply 2 nodes, each with 2 HCAs (MT23108).  I
> downloaded the MVPAICH 0.9.7 version (for VAPI) and compiled it using the
> TopSpin stack (3.1.0-113).
>
> I'm running on RHEL 4 U1 machines.  In one machine, both HCAs are on
> differnt PCI-X 133 buses, in the other machine one HCA is on a 133 bus and
> the other is on a 100Hz bus.

Even though both machines do not have exactly similar configuration, it
should not be a problem to get them running together using multirail.
>
>
> I first compiled using make.mvapich.vapi and was able to run the OSU
> benchmarks without any problems.
>
> I then compiled successfully using make.mvapich.vapi_multirail, but when I
> tried to run the OSU benchmaks, I get VAPI_RETRY_EXC_ERR midway through
> the benchmark, ... presumably when the code is finally trying to use the
> 2nd rail.
>
> Below is the output of my benchmark run.  It is consistent in that it will
> always fail after the 4096 test.  Again, using the version compiled
> without mulitrail support works just fine (without changing anything other
> than the version of mvapich I'm using).
>

In my previous email, i forgot to mention about the environment variable
STRIPING_THRESHOLD. The multirail MVAPICH uses this value to determine
whether a message would be striped across multiple available paths. This
could be a combination of multiple ports and multiple HCAs.
Section 9.4 and 9.5 of the user_guide talk about environment variable
NUM_PORTS and NUM_HCAS. A combination of these values can be used at the
same time. For example, if there is a cluster with each node having 2 HCAs
and 2 Ports per HCA, setting up NUM_PORTS=2 and NUM_HCAS=2 would allow
multirail to use all ports and all HCAs.

> If you have any suggestions on what to try, I'd appreciate it.  I'm not
> exactly sure how I should set up the IP addresses... so I included that
> information below too.  I am using only one port on each of the two HCAs,
> and all four cables connect to the same TopSpin TS120 switch.
>

A following change in the command line should solve the problem for you:

./mpirun_rsh -rsh -np 2 -hostfile /root/hostfile
NUM_PORTS=1 NUM_HCAS=2 /root/OSU-benchmarks/osu_bw

Please let us know if the problem persists.

Thanks and best regards,

-- Abhinav


> Thanks in advance!
>
> Jim
>
>
>
> ./mpirun_rsh -rsh -np 2 -hostfile /root/hostfile
> /root/OSU-benchmarks/osu_bw
>
> # OSU MPI Bandwidth Test (Version 2.2)
> # Size          Bandwidth (MB/s)
> 1               0.284546
> 2               0.645845
> 4               1.159683
> 8               2.591093
> 16              4.963886
> 32              10.483747
> 64              20.685824
> 128             36.271862
> 256             78.276241
> 512             146.724578
> 1024            237.888853
> 2048            295.633345
> 4096            347.127837
> [0] Abort: [vis460.watson.ibm.com:0] Got completion with error,
>         code=VAPI_RETRY_EXC_ERR, vendor code=81
>         at line 2114 in file viacheck.c
>         Timeout alarm signaled
>         Cleaning up all processes ...done.
>
>
> My machine file is just the 2 hostnames:
>
> cat /root/hostfile
> vis460
> vis30
>
>
>
>
> ifconfig
> eth0      Link encap:Ethernet  HWaddr 00:0D:60:98:20:B8
>           inet addr:9.2.12.221  Bcast:9.2.15.255  Mask:255.255.248.0
>           inet6 addr: fe80::20d:60ff:fe98:20b8/64 Scope:Link
>           UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
>           RX packets:9787508 errors:841 dropped:0 overruns:0 frame:0
>           TX packets:1131808 errors:0 dropped:0 overruns:0 carrier:0
>           collisions:0 txqueuelen:1000
>           RX bytes:926406322 (883.4 MiB)  TX bytes:94330491 (89.9 MiB)
>           Interrupt:185
>
> ib0       Link encap:Ethernet  HWaddr 93:C9:C9:6F:5D:7C
>           inet addr:10.10.5.46  Bcast:10.10.5.255  Mask:255.255.255.0
>           inet6 addr: fe80::6bc9:c9ff:fe66:c15b/64 Scope:Link
>           UP BROADCAST RUNNING MULTICAST  MTU:2044  Metric:1
>           RX packets:175 errors:0 dropped:0 overruns:0 frame:0
>           TX packets:174 errors:0 dropped:18 overruns:0 carrier:0
>           collisions:0 txqueuelen:128
>           RX bytes:11144 (10.8 KiB)  TX bytes:11638 (11.3 KiB)
>
> ib2       Link encap:Ethernet  HWaddr 65:9A:4B:CF:8D:00
>           inet addr:12.12.5.46  Bcast:12.12.5.255  Mask:255.255.255.0
>           inet6 addr: fe80::c19a:4bff:fed2:f3a0/64 Scope:Link
>           UP BROADCAST RUNNING MULTICAST  MTU:2044  Metric:1
>           RX packets:257 errors:0 dropped:0 overruns:0 frame:0
>           TX packets:235 errors:0 dropped:30 overruns:0 carrier:0
>           collisions:0 txqueuelen:128
>           RX bytes:15180 (14.8 KiB)  TX bytes:15071 (14.7 KiB)
>
> lo        Link encap:Local Loopback
>           inet addr:127.0.0.1  Mask:255.0.0.0
>           inet6 addr: ::1/128 Scope:Host
>           UP LOOPBACK RUNNING  MTU:16436  Metric:1
>           RX packets:14817 errors:0 dropped:0 overruns:0 frame:0
>           TX packets:14817 errors:0 dropped:0 overruns:0 carrier:0
>           collisions:0 txqueuelen:0
>           RX bytes:7521844 (7.1 MiB)  TX bytes:7521844 (7.1 MiB)
>
>
>
>



More information about the mvapich-discuss mailing list