[mvapich-discuss] help getting multirail working

Abhinav Vishnu vishnu at cse.ohio-state.edu
Fri Apr 14 11:52:30 EDT 2006


Hi James,

Thanks for using multirail MVAPICH and reporting the problem.

There are various parameters which are used to define the number of
ports/HCA and number of HCAs to be used for communication. By default,
the number of ports is 2. This can be changed using environment variable
NUM_PORTS. Since you are using one port/HCA and two HCAs, i would
recommend using NUM_PORTS=1 and NUM_HCAS=2.

Please let me know if the problem persists.

With best regards,

-- Abhinav
-------------------------------
Abhinav Vishnu,
Graduate Research Associate,
Department Of Comp. Sc. & Engg.
The Ohio State University.
-------------------------------

On Fri, 14 Apr 2006, James T Klosowski wrote:

> Hi,
>
> I'm trying to get the multirail feature working but have not had any
> success.  I have not found much documentation on how to do it.  If you can
> point me to some, I'd appreciate it.
>
>
> My current configuation is simply 2 nodes, each with 2 HCAs (MT23108).  I
> downloaded the MVPAICH 0.9.7 version (for VAPI) and compiled it using the
> TopSpin stack (3.1.0-113).
>
> I'm running on RHEL 4 U1 machines.  In one machine, both HCAs are on
> differnt PCI-X 133 buses, in the other machine one HCA is on a 133 bus and
> the other is on a 100Hz bus.
>
>
> I first compiled using make.mvapich.vapi and was able to run the OSU
> benchmarks without any problems.
>
> I then compiled successfully using make.mvapich.vapi_multirail, but when I
> tried to run the OSU benchmaks, I get VAPI_RETRY_EXC_ERR midway through
> the benchmark, ... presumably when the code is finally trying to use the
> 2nd rail.
>
> Below is the output of my benchmark run.  It is consistent in that it will
> always fail after the 4096 test.  Again, using the version compiled
> without mulitrail support works just fine (without changing anything other
> than the version of mvapich I'm using).
>
> If you have any suggestions on what to try, I'd appreciate it.  I'm not
> exactly sure how I should set up the IP addresses... so I included that
> information below too.  I am using only one port on each of the two HCAs,
> and all four cables connect to the same TopSpin TS120 switch.
>
> I suspect a configuration problem on my part, but short of that, I was
> also thinking of trying the IBGD code from Mellanox.
>
>
> Thanks in advance!
>
> Jim
>
>
>
> ./mpirun_rsh -rsh -np 2 -hostfile /root/hostfile
> /root/OSU-benchmarks/osu_bw
>
> # OSU MPI Bandwidth Test (Version 2.2)
> # Size          Bandwidth (MB/s)
> 1               0.284546
> 2               0.645845
> 4               1.159683
> 8               2.591093
> 16              4.963886
> 32              10.483747
> 64              20.685824
> 128             36.271862
> 256             78.276241
> 512             146.724578
> 1024            237.888853
> 2048            295.633345
> 4096            347.127837
> [0] Abort: [vis460.watson.ibm.com:0] Got completion with error,
>         code=VAPI_RETRY_EXC_ERR, vendor code=81
>         at line 2114 in file viacheck.c
>         Timeout alarm signaled
>         Cleaning up all processes ...done.
>
>
> My machine file is just the 2 hostnames:
>
> cat /root/hostfile
> vis460
> vis30
>
>
>
>
> ifconfig
> eth0      Link encap:Ethernet  HWaddr 00:0D:60:98:20:B8
>           inet addr:9.2.12.221  Bcast:9.2.15.255  Mask:255.255.248.0
>           inet6 addr: fe80::20d:60ff:fe98:20b8/64 Scope:Link
>           UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
>           RX packets:9787508 errors:841 dropped:0 overruns:0 frame:0
>           TX packets:1131808 errors:0 dropped:0 overruns:0 carrier:0
>           collisions:0 txqueuelen:1000
>           RX bytes:926406322 (883.4 MiB)  TX bytes:94330491 (89.9 MiB)
>           Interrupt:185
>
> ib0       Link encap:Ethernet  HWaddr 93:C9:C9:6F:5D:7C
>           inet addr:10.10.5.46  Bcast:10.10.5.255  Mask:255.255.255.0
>           inet6 addr: fe80::6bc9:c9ff:fe66:c15b/64 Scope:Link
>           UP BROADCAST RUNNING MULTICAST  MTU:2044  Metric:1
>           RX packets:175 errors:0 dropped:0 overruns:0 frame:0
>           TX packets:174 errors:0 dropped:18 overruns:0 carrier:0
>           collisions:0 txqueuelen:128
>           RX bytes:11144 (10.8 KiB)  TX bytes:11638 (11.3 KiB)
>
> ib2       Link encap:Ethernet  HWaddr 65:9A:4B:CF:8D:00
>           inet addr:12.12.5.46  Bcast:12.12.5.255  Mask:255.255.255.0
>           inet6 addr: fe80::c19a:4bff:fed2:f3a0/64 Scope:Link
>           UP BROADCAST RUNNING MULTICAST  MTU:2044  Metric:1
>           RX packets:257 errors:0 dropped:0 overruns:0 frame:0
>           TX packets:235 errors:0 dropped:30 overruns:0 carrier:0
>           collisions:0 txqueuelen:128
>           RX bytes:15180 (14.8 KiB)  TX bytes:15071 (14.7 KiB)
>
> lo        Link encap:Local Loopback
>           inet addr:127.0.0.1  Mask:255.0.0.0
>           inet6 addr: ::1/128 Scope:Host
>           UP LOOPBACK RUNNING  MTU:16436  Metric:1
>           RX packets:14817 errors:0 dropped:0 overruns:0 frame:0
>           TX packets:14817 errors:0 dropped:0 overruns:0 carrier:0
>           collisions:0 txqueuelen:0
>           RX bytes:7521844 (7.1 MiB)  TX bytes:7521844 (7.1 MiB)
>
>
>
>



More information about the mvapich-discuss mailing list