[mvapich-discuss] Differing IB interfaces
Lundrigan, Adam
LundriganA at DFO-MPO.GC.CA
Thu Jul 26 22:06:33 EDT 2007
We're using MVAPICH2 with Infiniband on a 5-node Sun/Solaris cluster
(Sun Fire x4100/x4200), and are having a problem with consistency in the
naming of our ibd interfaces. On the x4100 nodes, the IPoIB interface
is ibd0. However, on the head node (x4200), the interface is ibd2.
We've tried everything short of wiping the machine and reinstalling the
OS to force one of the two HCAs to have an ibd0, but thus far we have
failed. The only choices Solaris seems to use are ibd2, ibd3, ibd6 and
ibd7 (we have 2 cards w/ 4 ports in that node)
Long story short: Is there a way to force each instance of mpd to
connect using a different DAPL_PROVIDER? We've tried setting the
environment on each node separately to the proper value, but that
doesn't seem to work. When I compile MVAPICH2 with DAPL_PROVIDER set to
ibd0, everything works find if we restrict the ring to just the 4 nodes
which use ibd0 as their adapter. However, when we set the DAPL_PROVIDER
variable to ibd2 for the head node (in both ~user/.profile and
/etc/profile), I get the following:
:~/osu_benchmarks> mpiexec -l -np 16
/export/home/noofs/osu_benchmarks/osu_latency.e
4: [rdma_udapl_priv.c:649] error(-2146828288): Cannot open IA
8: [rdma_udapl_priv.c:649] error(-2146828288): Cannot open IA
0: [rdma_udapl_priv.c:649] error(-2146828288): Cannot open IA
12: [rdma_udapl_priv.c:649] error(-2146828288): Cannot open IA
rank 12 in job 2 CNOOFS01_41436 caused collective abort of all ranks
exit status of rank 12: killed by signal 9
rank 8 in job 2 CNOOFS01_41436 caused collective abort of all ranks
exit status of rank 8: killed by signal 9
rank 4 in job 2 CNOOFS01_41436 caused collective abort of all ranks
exit status of rank 4: killed by signal 9
rank 0 in job 2 CNOOFS01_41436 caused collective abort of all ranks
exit status of rank 0: killed by signal 9
I set up the ring to cover 4 nodes (16x processors), and the "Cannot
Open IA" message occurs exactly 4 times....once for each processor on
the node which doesn't use ibd0 as its interface. It seems to be a
problem with the profile not being executed on each node when the ranks
are doled out:
:~> for i in 1 3 4 5; do ssh noofs at CNOOFS0${i}-IB "echo
\$DAPL_PROVIDER"; done;
<blank>
<blank>
<blank>
<blank>
The above script just connects to each node and prints the set value of
DAPL_PROVIDER, which is empty since the /etc/profile file isn't being
sourced when we SSH this way.
Modifying the code just slightly gives the necessary result:
:~> for i in 1 3 4 5; do ssh noofs at CNOOFS0${i}-IB "source /etc/profile
&& echo \$DAPL_PROVIDER"; done;
ibd2
ibd0
ibd0
ibd0
Long story short: Is there a way to force each instance of mpd to
connect using a different DAPL_PROVIDER? Or a way to force Solaris to
rename an interface to ibd0?
Thanks in advance,
--
Adam Lundrigan
Computer Systems Programmer
Biological & Physical Oceanography Section
Science, Oceans & Environment Branch
Department of Fisheries and Oceans Canada
Northwest Atlantic Fisheries Centre
St. John's, NL A1C 5X1
Tel: (709) 772-8136
Fax: (709) 772-8138
Cell: (709) 277-4575
Office: G10-117J
Email: LundriganA at dfo-mpo.gc.ca
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20070726/06691372/attachment-0001.html
More information about the mvapich-discuss
mailing list