[mvapich-discuss] BUG REPORT: MVAPICH2 over OFED 1.5.4.1 fails in heterogeneous fabrics

Jonathan Perkins perkinjo at cse.ohio-state.edu
Mon Apr 2 16:10:17 EDT 2012


Thank you for the report.  We are working on a resolution to this issue.

On Mon, Apr 02, 2012 at 06:13:17PM +0000, Heinz, Michael William wrote:
> Basically, the problem is this: In version 1.7 of mvapich2, setting up handling of a mixed fabric was done before initialization of the IB queue pairs. This was done by calling rdma_ring_based_allgather() to collect information about the HCA types and then calling rdma_param_handle_heterogenity(). (See lines 250-270 of rdma_iba_init.c).
> 
> Working this way permitted each rank to correctly determine whether to create a shared receive queue or not.
> 
> Unfortunately, this was eliminated in 1.7-r5140. In the new version, rdma_param_handle_heterogenity() is not called till *after* the shared receive queue has already been created and the QP had been moved to ready-to-receive state - and when rdma_param_handle_heterogenity() turns the shared receive queue off, the queue pairs are left in an unusable state.
> 
> This problem affects fabrics using HCAs from IBM, older Tavor-style Mellanox HCAs and QLogic HCAs.
> 
> We've reviewed the changes and, unfortunately, we can't see a way to fix this without going back to using rdma_ring_based_allgather() to collect information about the HCA types before initializing the queue pairs. The work around is to manually specify MV2_USE_SRQ=0 when using mvapich2-1.7-r5140.
> 
> _______________________________________________
> mvapich-discuss mailing list
> mvapich-discuss at cse.ohio-state.edu
> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
> 

-- 
Jonathan Perkins
http://www.cse.ohio-state.edu/~perkinjo


More information about the mvapich-discuss mailing list