[mvapich-discuss] Dual port HCA back-to-back woes

Abhinav Vishnu vishnu at cse.ohio-state.edu
Fri Feb 9 10:41:26 EST 2007


Dr. Constantinos,

Thanks for using MVAPICH and reporting the problem to us.

> Hi - we have two quad socket Opteron systems, each with a Voltaire HCA 400Ex  
> connected directly back to back which I realise is an unusual configuration. 
> Since Voltaire will not support back-to-back with OpenFabrics we are running 
> the specific earlier Verbs-based Voltaire GridStack with the only firmware 
> level for the cards that Voltaire supports for single port setups. Using 
> minism as the session manager running on one of the nodes, I have been able 
> to use this back-to-back setup with a pair of HCA 410Ex-Ds (I was initially 
> sent by mistake) as well as the 400Exs. In that case one port is in the 
> PORT_ACTIVE state while the other in the PORT_INITIALIZE state as minism will 
> claim "Status: Port not discovered" for the 2nd port. If I start minism with 
> a "-p 2" argument then the roles are reversed as port 1 is not discovered.
> 
> The Voltaire distributed MVAPICH, OpenMPI and MVAPICH 0.9.8 built for a single 
> port work fine with this half active configuration at half the potential 
> speed for large messages of course.
> 
> Having recompiled MVAPICH 0.9.8 with support for SDR/dual port I cannot use it 
> with this setup (one port active, the other one in the initialize state) as I 
> get the following error:
> 
> [-1] Abort: malloc for alladdrs/local_addr/lid_table/qp_table at line 278 in 
> file viainit.c
> [-1] Abort: malloc for alladdrs/local_addr/lid_table/qp_table at line 278 in 
> file viainit.c

In our testing with VAPI multi-rail device, we did not encounter this
problem. Not sure why this problem is occuring on your machines. Can you
please let us know the MPI test you are using?

Also, we have not used minism for quite some time now (~3 years). For
these years, we have been using opensm distributed with IB Gold (from
Mellanox) and opensm distributed with OFED for the OpenFabrics drivers.

> 
> If I start minism again with "-p 2" (on the same node or even the other node, 
> it does not make any difference) and trick the system to bring the second 
> port up to the active state as well I still have the same problem. The best 
> situation a lot of experimentation with recompilation upon recompilation 
> landed me was a setup where I could use what appeared to be both ports but 
> only one process on either side could be involved in MPI communications, 
> thereby negating any usability of such an approach (the reason I preferred 
> dual SDR to single DDR was to have more bandwidth between the nodes when more 
> than one processor on each side is communicating). 
> 
> OpenMPI will work fine in the half-active configuration but will not initiate 
> communications successfully and hangs when both ports are tricked into 
> becoming active concurrently. 
> 
> I realize that this is an unusual setup and it may be that minism will not be 
> able to support such a setup and no fault lies with either MPI 
> implementation. Do we know whether opensm and OpenFabrics would do any better 
> (if I were to take the plunge and try a completely unsupported by Voltaire 
> configuration)?

In our lab, we have tried running opensm on the same node, bound to
different ports of the HCA. It absolutely works fine.

I would strongly recommend you to download OFED from the OpenFabrics
website. FYI, I am posting it here:

http://www.openfabrics.org/downloads.html

Please use the OFED-1.1 tarball for building the OFED modules and
userspace libraries. Once this step is over, please use the
make.mvapich.gen2 script in the MVAPICH-0.9.8 top directory.
For using the multi-rail version, please use the
make.mvapich.gen2_multirail script. For more information on building
instructions, please refer to the section 4.4.1 and 4.4.4 in the 
following URL:

http://nowlab.cse.ohio-state.edu/projects/mpi-iba/user_guide.html

Please report back to us any problems during the compilation/execution
of your MPI programs.

Thanks again,

:- Abhinav



> 
> Thanks for any help in advance.
> 
> Constantinos
> -- 
> Dr. Constantinos Evangelinos                    Room 54-1518, EAPS/MIT
> Earth, Atmospheric and Planetary Sciences       77 Massachusetts Avenue
> Massachusetts Institute of Technology           Cambridge, MA 02139
> +1-617-253-5259/+1-617-253-4464 (fax)           USA
> 
> _______________________________________________
> mvapich-discuss mailing list
> mvapich-discuss at cse.ohio-state.edu
> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss


More information about the mvapich-discuss mailing list