[mvapich-discuss] BUG REPORT: MVAPICH2 over OFED 1.5.4.1 fails in heterogeneous fabrics

Thu Apr 12 13:00:04 EDT 2012

Hello Todd,

Can you try to reproduce this with latest PGI compiler, 10.5 is quite
old and this could have been caused by a compiler bug that has since
been fixed.  I recommend this since we haven't seen this problem (we're
currently using PGI 12.3.)

Can you also provide more details about how to reproduce this using IMB?
How many processes are being used?  Is this race condition hit
frequently?

This morning the 1.7 branch was updated with a few fixes we've developed
over the past month.  Perhaps you can try this branch and see if this
error is still reproduciable.

If you'd like to try out the patch directly you can retrieve it using
the following command.

svn diff -c5391 http://mvapich.cse.ohio-state.edu/svn/mpi/mvapich2/branches/1.7

On Thu, Apr 12, 2012 at 02:25:34PM +0000, Rimmer, Todd wrote:
> Devendar,
> 
> I just wanted to follow up, in case Mike Heinz hasn't had a chance too.
> 
> Thank you very much for the patch, we will give it a try soon, but it might be a couple of more days before we get a chance to do so.
> 
> We are looking into different high priority problem observed with mvapich2  1.7-6 as included in OFED 1.5.4.1.
> 
> The problem occurs when building mvapich2 with the PGI compiler.  In this case, collectives, such as barrier and reduce intermittently hang during tests such as IMB.    We have narrowed down the problem to intra-node operations via shared memory regions in conjunction with PGI compiler optimizations.  When mvapich2 is built with -O2 (the default per the .spec and configure files) the failure occurs.  If we turn off optimization, the problem does not occur.  At least two functions of interest are MPIR_Reduce_shmem_MV2 and MPIR_shmem_barrier_MV2 in barrier_osu.c
> 
> The problem is specific to the PGI compiler, we are using PGI 10.5.  Intel and GNU compilers do not expose this issue.
> 
> Our suspicion is an aggressive optimization in the intra-node code causes some subtle aspects of shared memory polling and testing to be optimized out and causes a race.
> 
> It seems these functions were not previously used in mvapich2  1.6 (which did not have this issue)
> 
> Are you aware of any problems such as this?
> 
> Todd Rimmer
> IB Product Architect 
> Fabric Products Division
> Voice: 610-233-4852     Fax: 610-233-4777
> Todd.Rimmer at intel.com
> 
> 
> > -----Original Message-----
> > From: Devendar Bureddy [mailto:bureddy at cse.ohio-state.edu]
> > Sent: Monday, April 09, 2012 11:28 AM
> > To: Heinz, Michael William
> > Cc: mvapich-discuss at cse.ohio-state.edu; Marciniszyn, Mike; Rimmer, Todd
> > Subject: Re: [mvapich-discuss] BUG REPORT: MVAPICH2 over OFED 1.5.4.1
> > fails in heterogeneous fabrics
> > 
> > Hi Michael
> > 
> > Can you please try the attached patch with latest 1.7 nightly tarball and see if
> > this issue resolved with it?
> > 
> > Please follow below instructions for applying the patch:
> > 
> > $tar xvf mvapich2-latest.tar.gz
> > $cd mvapich2-1.7-r5225
> > $patch -p0 < diff.patch
> > 
> > -Devendar
> > 
> > On Mon, Apr 2, 2012 at 2:13 PM, Heinz, Michael William
> > <michael.william.heinz at intel.com> wrote:
> > > Basically, the problem is this: In version 1.7 of mvapich2, setting up handling
> > of a mixed fabric was done before initialization of the IB queue pairs. This
> > was done by calling rdma_ring_based_allgather() to collect information
> > about the HCA types and then calling rdma_param_handle_heterogenity().
> > (See lines 250-270 of rdma_iba_init.c).
> > >
> > > Working this way permitted each rank to correctly determine whether to
> > create a shared receive queue or not.
> > >
> > > Unfortunately, this was eliminated in 1.7-r5140. In the new version,
> > rdma_param_handle_heterogenity() is not called till *after* the shared
> > receive queue has already been created and the QP had been moved to
> > ready-to-receive state - and when rdma_param_handle_heterogenity()
> > turns the shared receive queue off, the queue pairs are left in an unusable
> > state.
> > >
> > > This problem affects fabrics using HCAs from IBM, older Tavor-style
> > Mellanox HCAs and QLogic HCAs.
> > >
> > > We've reviewed the changes and, unfortunately, we can't see a way to fix
> > this without going back to using rdma_ring_based_allgather() to collect
> > information about the HCA types before initializing the queue pairs. The
> > work around is to manually specify MV2_USE_SRQ=0 when using mvapich2-
> > 1.7-r5140.
> > >
> > > _______________________________________________
> > > mvapich-discuss mailing list
> > > mvapich-discuss at cse.ohio-state.edu
> > > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
> > 
> > 
> > 
> > --
> > Devendar
> 
> _______________________________________________
> mvapich-discuss mailing list
> mvapich-discuss at cse.ohio-state.edu
> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
> 

-- 
Jonathan Perkins
http://www.cse.ohio-state.edu/~perkinjo