[mvapich-discuss] BUG REPORT: MVAPICH2 over OFED 1.5.4.1 fails in heterogeneous fabrics

Thu Apr 12 10:25:34 EDT 2012

Devendar,

I just wanted to follow up, in case Mike Heinz hasn't had a chance too.

Thank you very much for the patch, we will give it a try soon, but it might be a couple of more days before we get a chance to do so.

We are looking into different high priority problem observed with mvapich2  1.7-6 as included in OFED 1.5.4.1.

The problem occurs when building mvapich2 with the PGI compiler.  In this case, collectives, such as barrier and reduce intermittently hang during tests such as IMB.    We have narrowed down the problem to intra-node operations via shared memory regions in conjunction with PGI compiler optimizations.  When mvapich2 is built with -O2 (the default per the .spec and configure files) the failure occurs.  If we turn off optimization, the problem does not occur.  At least two functions of interest are MPIR_Reduce_shmem_MV2 and MPIR_shmem_barrier_MV2 in barrier_osu.c

The problem is specific to the PGI compiler, we are using PGI 10.5.  Intel and GNU compilers do not expose this issue.

Our suspicion is an aggressive optimization in the intra-node code causes some subtle aspects of shared memory polling and testing to be optimized out and causes a race.

It seems these functions were not previously used in mvapich2  1.6 (which did not have this issue)

Are you aware of any problems such as this?

Todd Rimmer
IB Product Architect 
Fabric Products Division
Voice: 610-233-4852     Fax: 610-233-4777
Todd.Rimmer at intel.com

> -----Original Message-----
> From: Devendar Bureddy [mailto:bureddy at cse.ohio-state.edu]
> Sent: Monday, April 09, 2012 11:28 AM
> To: Heinz, Michael William
> Cc: mvapich-discuss at cse.ohio-state.edu; Marciniszyn, Mike; Rimmer, Todd
> Subject: Re: [mvapich-discuss] BUG REPORT: MVAPICH2 over OFED 1.5.4.1
> fails in heterogeneous fabrics
> 
> Hi Michael
> 
> Can you please try the attached patch with latest 1.7 nightly tarball and see if
> this issue resolved with it?
> 
> Please follow below instructions for applying the patch:
> 
> $tar xvf mvapich2-latest.tar.gz
> $cd mvapich2-1.7-r5225
> $patch -p0 < diff.patch
> 
> -Devendar
> 
> On Mon, Apr 2, 2012 at 2:13 PM, Heinz, Michael William
> <michael.william.heinz at intel.com> wrote:
> > Basically, the problem is this: In version 1.7 of mvapich2, setting up handling
> of a mixed fabric was done before initialization of the IB queue pairs. This
> was done by calling rdma_ring_based_allgather() to collect information
> about the HCA types and then calling rdma_param_handle_heterogenity().
> (See lines 250-270 of rdma_iba_init.c).
> >
> > Working this way permitted each rank to correctly determine whether to
> create a shared receive queue or not.
> >
> > Unfortunately, this was eliminated in 1.7-r5140. In the new version,
> rdma_param_handle_heterogenity() is not called till *after* the shared
> receive queue has already been created and the QP had been moved to
> ready-to-receive state - and when rdma_param_handle_heterogenity()
> turns the shared receive queue off, the queue pairs are left in an unusable
> state.
> >
> > This problem affects fabrics using HCAs from IBM, older Tavor-style
> Mellanox HCAs and QLogic HCAs.
> >
> > We've reviewed the changes and, unfortunately, we can't see a way to fix
> this without going back to using rdma_ring_based_allgather() to collect
> information about the HCA types before initializing the queue pairs. The
> work around is to manually specify MV2_USE_SRQ=0 when using mvapich2-
> 1.7-r5140.
> >
> > _______________________________________________
> > mvapich-discuss mailing list
> > mvapich-discuss at cse.ohio-state.edu
> > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
> 
> 
> 
> --
> Devendar