[mvapich-discuss] Differing IB interfaces

Lundrigan, Adam LundriganA at DFO-MPO.GC.CA
Thu Jul 26 22:06:33 EDT 2007


We're using MVAPICH2 with Infiniband on a 5-node Sun/Solaris cluster
(Sun Fire x4100/x4200), and are having a problem with consistency in the
naming of our ibd interfaces.  On the x4100 nodes, the IPoIB interface
is ibd0.  However, on the head node (x4200), the interface is ibd2.
We've tried everything short of wiping the machine and reinstalling the
OS to force one of the two HCAs to have an ibd0, but thus far we have
failed.  The only choices Solaris seems to use are ibd2, ibd3, ibd6 and
ibd7 (we have 2 cards w/ 4 ports in that node)

 

Long story short:  Is there a way to force each instance of mpd to
connect using a different DAPL_PROVIDER?  We've tried setting the
environment on each node separately to the proper value, but that
doesn't seem to work.  When I compile MVAPICH2 with DAPL_PROVIDER set to
ibd0, everything works find if we restrict the ring to just the 4 nodes
which use ibd0 as their adapter.  However, when we set the DAPL_PROVIDER
variable to ibd2 for the head node (in both ~user/.profile and
/etc/profile), I get the following: 

 

:~/osu_benchmarks> mpiexec -l -np 16
/export/home/noofs/osu_benchmarks/osu_latency.e

4: [rdma_udapl_priv.c:649] error(-2146828288): Cannot open IA

8: [rdma_udapl_priv.c:649] error(-2146828288): Cannot open IA

0: [rdma_udapl_priv.c:649] error(-2146828288): Cannot open IA

12: [rdma_udapl_priv.c:649] error(-2146828288): Cannot open IA

 

rank 12 in job 2  CNOOFS01_41436   caused collective abort of all ranks

  exit status of rank 12: killed by signal 9

rank 8 in job 2  CNOOFS01_41436   caused collective abort of all ranks

  exit status of rank 8: killed by signal 9

rank 4 in job 2  CNOOFS01_41436   caused collective abort of all ranks

  exit status of rank 4: killed by signal 9

rank 0 in job 2  CNOOFS01_41436   caused collective abort of all ranks

  exit status of rank 0: killed by signal 9

 

I set up the ring to cover 4 nodes (16x processors), and the "Cannot
Open IA" message occurs exactly 4 times....once for each processor on
the node which doesn't use ibd0 as its interface.  It seems to be a
problem with  the profile not being executed on each node when the ranks
are doled out:

 

:~> for i in 1 3 4 5; do ssh noofs at CNOOFS0${i}-IB "echo
\$DAPL_PROVIDER"; done;

<blank>

<blank>

<blank>

<blank>

 

The above script just connects to each node and prints the set value of
DAPL_PROVIDER, which is empty since the /etc/profile file isn't being
sourced when we SSH this way. 

Modifying the code just slightly gives the necessary result:

 

:~> for i in 1 3 4 5; do ssh noofs at CNOOFS0${i}-IB "source /etc/profile
&& echo \$DAPL_PROVIDER"; done;

ibd2

ibd0

ibd0

ibd0

 

Long story short:  Is there a way to force each instance of mpd to
connect using a different DAPL_PROVIDER? Or a way to force Solaris to
rename an interface to ibd0?

 

Thanks in advance,

--

Adam Lundrigan

Computer Systems Programmer

Biological & Physical Oceanography Section

Science, Oceans & Environment Branch

Department of Fisheries and Oceans Canada

Northwest Atlantic Fisheries Centre 

St. John's, NL    A1C 5X1

 

Tel: (709) 772-8136

Fax: (709) 772-8138

Cell: (709) 277-4575

Office:  G10-117J

Email: LundriganA at dfo-mpo.gc.ca

 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20070726/06691372/attachment-0001.html


More information about the mvapich-discuss mailing list