[mvapich-discuss] BUG REPORT: MVAPICH2 over OFED 1.5.4.1 fails in heterogeneous fabrics

Devendar Bureddy bureddy at cse.ohio-state.edu
Thu Apr 12 15:01:15 EDT 2012


Hi Todd

We had carefully used volatile variables to avoid some explicit locking
primitives here.  It seems that, issue lies with PGI compiler with respect
to optimizations while accessing the volatile variables.  We find a recent
bug report(
http://www.pgroup.com/userforum/viewtopic.php?t=3040&sid=772c0c97927453136a0ba6c74d0628b8
) in
PGI user forum while using volatile variables with -O2. They had created a
porblem report(TPR#18530) for this.  we believe this is a similar kind of
issue.

-Devendar

On Thu, Apr 12, 2012 at 1:18 PM, Rimmer, Todd <todd.rimmer at intel.com> wrote:

> Jonathan,
>
> Thanks for the quick reply.
>
> We have a requirement to support older compilers since not all customers
> move their apps forward immediately.
>
> I haven't tried PGI 12.3 yet, there were more fundamental issues with
> earlier versions of PGI 12 I had tried (things didn't build, etc).
>
> In studying the code in ch3_shmem_coll.c I can see a number of risk areas,
> code such as those which poll shared memory locations without using any
> locking primitives or OS calls.  Such code should be using CPU memory
> barriers to ensure the CPU cache and pipelining doesn't result in out of
> order reads and writes.  I suspect the issue I am seeing is due to such a
> behavior in conjunction with the timing changes which optimized code might
> produce.  It possible that volatile on some compilers handle this situation
> while PGI might not when using its -O2 level of optimization.
>
>
> The problem we observed reproduced with as few as 9 ranks of IMB.  I have
> also reproduced it with a 80 rank IMB job which uses all CPU cores on
> multiple nodes.
>
> I'm doing 2 experiments now:
> 1. using PGI pragma statements to disable optimization of the routines in
> ch3_shmem_coll.c which access the shared memory - thus far this experiment
> seems to be successful
>
> 2. adding CPU memory barrier instructions at the key points in these same
> functions to prevent the CPU from pipelining and doing any operations out
> of order, I suspect this may also provide a hint to the optimizer which
> could prevent some reordering of code by it too.  I'm about to start this
> experiment.
>
> Todd Rimmer
> IB Product Architect
> Fabric Products Division
> Voice: 610-233-4852     Fax: 610-233-4777
> Todd.Rimmer at intel.com
>
>
> > -----Original Message-----
> > From: Jonathan Perkins [mailto:perkinjo at cse.ohio-state.edu]
> > Sent: Thursday, April 12, 2012 1:00 PM
> > To: Rimmer, Todd
> > Cc: Devendar Bureddy; Heinz, Michael William; Marciniszyn, Mike; Tang,
> CQ;
> > mvapich-discuss at cse.ohio-state.edu
> > Subject: Re: [mvapich-discuss] BUG REPORT: MVAPICH2 over OFED 1.5.4.1
> > fails in heterogeneous fabrics
> >
> > Hello Todd,
> >
> > Can you try to reproduce this with latest PGI compiler, 10.5 is quite
> old and
> > this could have been caused by a compiler bug that has since been fixed.
>  I
> > recommend this since we haven't seen this problem (we're currently using
> > PGI 12.3.)
> >
> > Can you also provide more details about how to reproduce this using IMB?
> > How many processes are being used?  Is this race condition hit
> frequently?
> >
> > This morning the 1.7 branch was updated with a few fixes we've developed
> > over the past month.  Perhaps you can try this branch and see if this
> error is
> > still reproduciable.
> >
> > If you'd like to try out the patch directly you can retrieve it using
> the following
> > command.
> >
> > svn diff -c5391 http://mvapich.cse.ohio-
> > state.edu/svn/mpi/mvapich2/branches/1.7
> >
> > On Thu, Apr 12, 2012 at 02:25:34PM +0000, Rimmer, Todd wrote:
> > > Devendar,
> > >
> > > I just wanted to follow up, in case Mike Heinz hasn't had a chance too.
> > >
> > > Thank you very much for the patch, we will give it a try soon, but it
> might be
> > a couple of more days before we get a chance to do so.
> > >
> > > We are looking into different high priority problem observed with
> mvapich2
> > 1.7-6 as included in OFED 1.5.4.1.
> > >
> > > The problem occurs when building mvapich2 with the PGI compiler.  In
> this
> > case, collectives, such as barrier and reduce intermittently hang during
> tests
> > such as IMB.    We have narrowed down the problem to intra-node
> > operations via shared memory regions in conjunction with PGI compiler
> > optimizations.  When mvapich2 is built with -O2 (the default per the
> .spec
> > and configure files) the failure occurs.  If we turn off optimization,
> the
> > problem does not occur.  At least two functions of interest are
> > MPIR_Reduce_shmem_MV2 and MPIR_shmem_barrier_MV2 in
> > barrier_osu.c
> > >
> > > The problem is specific to the PGI compiler, we are using PGI 10.5.
>  Intel
> > and GNU compilers do not expose this issue.
> > >
> > > Our suspicion is an aggressive optimization in the intra-node code
> causes
> > some subtle aspects of shared memory polling and testing to be optimized
> > out and causes a race.
> > >
> > > It seems these functions were not previously used in mvapich2  1.6
> > > (which did not have this issue)
> > >
> > > Are you aware of any problems such as this?
> > >
> > > Todd Rimmer
> > > IB Product Architect
> > > Fabric Products Division
> > > Voice: 610-233-4852     Fax: 610-233-4777 Todd.Rimmer at intel.com
> > >
> > >
> > > > -----Original Message-----
> > > > From: Devendar Bureddy [mailto:bureddy at cse.ohio-state.edu]
> > > > Sent: Monday, April 09, 2012 11:28 AM
> > > > To: Heinz, Michael William
> > > > Cc: mvapich-discuss at cse.ohio-state.edu; Marciniszyn, Mike; Rimmer,
> > Todd
> > > > Subject: Re: [mvapich-discuss] BUG REPORT: MVAPICH2 over OFED
> > 1.5.4.1
> > > > fails in heterogeneous fabrics
> > > >
> > > > Hi Michael
> > > >
> > > > Can you please try the attached patch with latest 1.7 nightly
> tarball and
> > see if
> > > > this issue resolved with it?
> > > >
> > > > Please follow below instructions for applying the patch:
> > > >
> > > > $tar xvf mvapich2-latest.tar.gz
> > > > $cd mvapich2-1.7-r5225
> > > > $patch -p0 < diff.patch
> > > >
> > > > -Devendar
> > > >
> > > > On Mon, Apr 2, 2012 at 2:13 PM, Heinz, Michael William
> > > > <michael.william.heinz at intel.com> wrote:
> > > > > Basically, the problem is this: In version 1.7 of mvapich2,
> setting up
> > handling
> > > > of a mixed fabric was done before initialization of the IB queue
> pairs. This
> > > > was done by calling rdma_ring_based_allgather() to collect
> information
> > > > about the HCA types and then calling
> > rdma_param_handle_heterogenity().
> > > > (See lines 250-270 of rdma_iba_init.c).
> > > > >
> > > > > Working this way permitted each rank to correctly determine whether
> > to
> > > > create a shared receive queue or not.
> > > > >
> > > > > Unfortunately, this was eliminated in 1.7-r5140. In the new
> version,
> > > > rdma_param_handle_heterogenity() is not called till *after* the
> shared
> > > > receive queue has already been created and the QP had been moved to
> > > > ready-to-receive state - and when rdma_param_handle_heterogenity()
> > > > turns the shared receive queue off, the queue pairs are left in an
> > unusable
> > > > state.
> > > > >
> > > > > This problem affects fabrics using HCAs from IBM, older Tavor-style
> > > > Mellanox HCAs and QLogic HCAs.
> > > > >
> > > > > We've reviewed the changes and, unfortunately, we can't see a way
> to
> > fix
> > > > this without going back to using rdma_ring_based_allgather() to
> collect
> > > > information about the HCA types before initializing the queue pairs.
> The
> > > > work around is to manually specify MV2_USE_SRQ=0 when using
> > mvapich2-
> > > > 1.7-r5140.
> > > > >
> > > > > _______________________________________________
> > > > > mvapich-discuss mailing list
> > > > > mvapich-discuss at cse.ohio-state.edu
> > > > > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
> > > >
> > > >
> > > >
> > > > --
> > > > Devendar
> > >
> > > _______________________________________________
> > > mvapich-discuss mailing list
> > > mvapich-discuss at cse.ohio-state.edu
> > > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
> > >
> >
> > --
> > Jonathan Perkins
> > http://www.cse.ohio-state.edu/~perkinjo
>



-- 
Devendar
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20120412/f6ff5b7c/attachment-0001.html


More information about the mvapich-discuss mailing list