[mvapich-discuss] Dual port HCA back-to-back woes

Constantinos Evangelinos ce107 at MIT.EDU
Thu Feb 8 14:13:35 EST 2007


Hi - we have two quad socket Opteron systems, each with a Voltaire HCA 400Ex  
connected directly back to back which I realise is an unusual configuration. 
Since Voltaire will not support back-to-back with OpenFabrics we are running 
the specific earlier Verbs-based Voltaire GridStack with the only firmware 
level for the cards that Voltaire supports for single port setups. Using 
minism as the session manager running on one of the nodes, I have been able 
to use this back-to-back setup with a pair of HCA 410Ex-Ds (I was initially 
sent by mistake) as well as the 400Exs. In that case one port is in the 
PORT_ACTIVE state while the other in the PORT_INITIALIZE state as minism will 
claim "Status: Port not discovered" for the 2nd port. If I start minism with 
a "-p 2" argument then the roles are reversed as port 1 is not discovered.

The Voltaire distributed MVAPICH, OpenMPI and MVAPICH 0.9.8 built for a single 
port work fine with this half active configuration at half the potential 
speed for large messages of course.

Having recompiled MVAPICH 0.9.8 with support for SDR/dual port I cannot use it 
with this setup (one port active, the other one in the initialize state) as I 
get the following error:

[-1] Abort: malloc for alladdrs/local_addr/lid_table/qp_table at line 278 in 
file viainit.c
[-1] Abort: malloc for alladdrs/local_addr/lid_table/qp_table at line 278 in 
file viainit.c

If I start minism again with "-p 2" (on the same node or even the other node, 
it does not make any difference) and trick the system to bring the second 
port up to the active state as well I still have the same problem. The best 
situation a lot of experimentation with recompilation upon recompilation 
landed me was a setup where I could use what appeared to be both ports but 
only one process on either side could be involved in MPI communications, 
thereby negating any usability of such an approach (the reason I preferred 
dual SDR to single DDR was to have more bandwidth between the nodes when more 
than one processor on each side is communicating). 

OpenMPI will work fine in the half-active configuration but will not initiate 
communications successfully and hangs when both ports are tricked into 
becoming active concurrently. 

I realize that this is an unusual setup and it may be that minism will not be 
able to support such a setup and no fault lies with either MPI 
implementation. Do we know whether opensm and OpenFabrics would do any better 
(if I were to take the plunge and try a completely unsupported by Voltaire 
configuration)?

Thanks for any help in advance.

Constantinos
-- 
Dr. Constantinos Evangelinos                    Room 54-1518, EAPS/MIT
Earth, Atmospheric and Planetary Sciences       77 Massachusetts Avenue
Massachusetts Institute of Technology           Cambridge, MA 02139
+1-617-253-5259/+1-617-253-4464 (fax)           USA



More information about the mvapich-discuss mailing list