[mvapich-discuss] non-uniform IB connectivity

NiftyCluster Tom Mitchell niftycluster at niftyegg.com
Sat Nov 7 19:26:40 EST 2009


On 11/6/09, Dhabaleswar Panda <panda at cse.ohio-state.edu> wrote:
> Hi Bron,
>
> Thanks for your note.
>
>> I'm new to this list, so sorry if this has been discussed before.
>> My site has two different clusters, each of which has dual
>> fabrics of connectivity, i.e. the IB cards have 2 ports, and
>> each cluster has 2 separate fabrics: ib0 (connected to the first
>> port), and ib1 (connected to the second port), connecting the
>> nodes within each cluster.
>>
>> We are now connecting the two clusters together, but for various
>> reasons, the inter-cluster connections are only available over the
>> ib1 fabric from each of the sub-clusters.

Do check your TCP/IP Ethernet connectivity.  netmasks and host names etc.
It is possible that your previous two fabrics were isolated in part
by the way that jobs were launched from the tcp/ip connections
on the nodes.

It only takes a single IB cable between two fabrics to merge them.
As soon as duplicate LIDs are eliminated by the subnet manager
you have a single IB fabric.  Make sure the subnet manager is healthy.
In some cases with too many nodes managed by an imbedded manager
the SM can choke.  Merging two fabrics might take things beyond the
one too many line.

i.e. the management tools and configurations that once kept the two clusters
healthy or separated may still be getting in the way.


-- 
        NiftyCluster
        T o m   M i t c h e l l


More information about the mvapich-discuss mailing list