[mvapich-discuss] BUG REPORT: MVAPICH2 over OFED 1.5.4.1 fails
in heterogeneous fabrics
Jonathan Perkins
perkinjo at cse.ohio-state.edu
Mon Apr 2 16:10:17 EDT 2012
Thank you for the report. We are working on a resolution to this issue.
On Mon, Apr 02, 2012 at 06:13:17PM +0000, Heinz, Michael William wrote:
> Basically, the problem is this: In version 1.7 of mvapich2, setting up handling of a mixed fabric was done before initialization of the IB queue pairs. This was done by calling rdma_ring_based_allgather() to collect information about the HCA types and then calling rdma_param_handle_heterogenity(). (See lines 250-270 of rdma_iba_init.c).
>
> Working this way permitted each rank to correctly determine whether to create a shared receive queue or not.
>
> Unfortunately, this was eliminated in 1.7-r5140. In the new version, rdma_param_handle_heterogenity() is not called till *after* the shared receive queue has already been created and the QP had been moved to ready-to-receive state - and when rdma_param_handle_heterogenity() turns the shared receive queue off, the queue pairs are left in an unusable state.
>
> This problem affects fabrics using HCAs from IBM, older Tavor-style Mellanox HCAs and QLogic HCAs.
>
> We've reviewed the changes and, unfortunately, we can't see a way to fix this without going back to using rdma_ring_based_allgather() to collect information about the HCA types before initializing the queue pairs. The work around is to manually specify MV2_USE_SRQ=0 when using mvapich2-1.7-r5140.
>
> _______________________________________________
> mvapich-discuss mailing list
> mvapich-discuss at cse.ohio-state.edu
> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>
--
Jonathan Perkins
http://www.cse.ohio-state.edu/~perkinjo
More information about the mvapich-discuss
mailing list