[mvapich-discuss] MVAPICH2 Error - Assertion 'current_bytes[vc->smp.local_nodes]==0' failed.

Thomas O'Shea THOMAS.T.O'SHEA at saic.com
Wed Jun 6 19:58:52 EDT 2007


Hello,

We just recompiled without the -D_SMP_flag and the code runs with no errors.
Did this change the way mvapich communicates on a local node? How much
slower do you think it will be? We're running some scaling tests now.
Anything else you want me to try?

Thanks,
Tom

> Hi,
>
> We've been carrying thorough testing on our code base. Up to now, we did
> not find any outstanding error on MPI one sided code. Can we get access to
> your source code or get a small program showing the problem? It will be
> the easiest way for us to find the problem.
>
> Also, since this is an assertion failure in the SMP part of code, you can
> also try compiling mvapich2 without SMP by removing the -D_SMP_ flag from
> your CFLAGS. Corresponding changes can be made in our make.mvapich2.ofa.
> Let's see if your program runs successfully with that change.
>
> Thanks.
>
> Regards,
> Wei Huang
>
> 774 Dreese Lab, 2015 Neil Ave,
> Dept. of Computer Science and Engineering
> Ohio State University
> OH 43210
> Tel: (614)292-8501
>
>
> On Mon, 4 Jun 2007, Thomas O'Shea wrote:
>
> > We migrated over to gen2 (OpenFabrics ) and we are still getting the
same
> > errors. I was wondering if you found anything, or have any ideas of what
to
> > try next.
> >
> > Thanks,
> > Tom
> > ----- Original Message -----
> > From: "wei huang" <huanwei at cse.ohio-state.edu>
> > To: "Thomas O'Shea" <toshea at trg.saic.com>
> > Cc: <mvapich-discuss at cse.ohio-state.edu>
> > Sent: Friday, May 04, 2007 3:06 PM
> > Subject: Re: [mvapich-discuss] MVAPICH2 Error - Assertion
> > 'current_bytes[vc->smp.local_nodes]==0' failed.
> >
> >
> > > Hi Thomas,
> > >
> > > Thanks for your reply.
> > >
> > > Because the source code of your application is not available to us, we
> > > will do a code review of our code (or do you have a piece of code
which
> > > shows the problem that can be sent to us?)
> > >
> > > The reason I ask you to try gen2 (OpenFabrics) stack is because the
whole
> > > InfiniBand community is moving towards this. So actually most of our
> > > efforts is spent on this front (though we still maintain certain
necessary
> > > maintenance and bug fixes for the vapi stack). You can find useful
> > > information to install the OFED stack (OpenFabrics Enterprise
> > > Distribution) here:
> > >
> > > http://www.openfabrics.org/downloads.htm
> > >
> > > And the information to compile mvapich2 with OFED stack is avaialable
> > > through our website.
> > >
> > > Anyway, we will get back to you once we find something.
> > >
> > > Thanks.
> > >
> > > Regards,
> > > Wei Huang
> > >
> > > 774 Dreese Lab, 2015 Neil Ave,
> > > Dept. of Computer Science and Engineering
> > > Ohio State University
> > > OH 43210
> > > Tel: (614)292-8501
> > >
> > >
> > > On Fri, 4 May 2007, Thomas O'Shea wrote:
> > >
> > > > Thanks for the response.
> > > >
> > > > 1) Turns out we are using mvapich2-0.9.8p1 already.
> > > >
> > > > 2) Yes, the standard compiling scripts were used.
> > > >
> > > > 3) You are correct, most of the communication involves one sided
> > operations
> > > > with passive synchronization. The code also uses a few other MPI
> > commands.
> > > >
> > > > We define MPI vector types:
> > > >
> > > >       CALL MPI_TYPE_VECTOR(xlen,nguard,iu_bnd,MPI_DOUBLE_PRECISION,
> > > >      &                     xtype,ierr)
> > > >
> > > >       CALL MPI_TYPE_COMMIT(xtype,ierr)
> > > >
> > > >  Create MPI Windows:
> > > >
> > > >       CALL MPI_WIN_CREATE(work,winsize,8,MPI_INFO_NULL,
> > > >      &                    MPI_COMM_WORLD,win,ierr)
> > > >
> > > > Synch our gets with lock and unlock:
> > > >
> > > >         CALL MPI_WIN_LOCK(MPI_LOCK_SHARED,get_pe,0,win,ierr)
> > > >         CALL MPI_GET(wget,1,xtype,get_pe,
> > > >      &             targ_disp,1,xtype,win,ierr)
> > > >         CALL MPI_WIN_UNLOCK(get_pe,win,ierr)
> > > >
> > > > We use one broadcast call
> > > >
> > > >       call MPI_BCAST(qxyz,3*maxpan,MPI_DOUBLE_PRECISION,0,
> > > >      1               MPI_COMM_WORLD,ierr)
> > > >
> > > > And of course barriers and freeing the windows and vector types.
> > > >
> > > > The error we are getting happens on a MPI_WIN_UNLOCK after a GET
call
> > that
> > > > does not use the MPI_TYPE_VECTOR that we created though. The ierr
from
> > the
> > > > GET call is 0 as well.
> > > >
> > > >
> > > > 4) I talked with the IT person in charge of this cluster and he said
> > that we
> > > > could try that, but he said the documentation he found on gen2 and
udapl
> > was
> > > > somewhat sparse in that he wasn't sure exactly how to set that up
and
> > what
> > > > the different compilations actually do differently. Is there any
> > resource
> > > > you can point us towards?
> > > >
> > > > Thanks,
> > > > Tom
> > > >
> > > >
> > > > > Hi Thomas,
> > > > >
> > > > > We will look into this issue. Would you please let us know the
> > following:
> > > > >
> > > > > 1) We have recently made a couple of bug fixes and released
> > > > > mvapich2-0.9.8p1. Would you first try that version?
> > > > >
> > > > > And if it is not working:
> > > > >
> > > > > 2) Did you use the standard compiling scripts (you mentioned ib
gold
> > > > > release, is it on vapi? And did you use make.mvapich2.vapi?)
> > > > >
> > > > > 3) Would you provide us some information on how the comunication
> > patterns
> > > > > of your application are? It seems like one sided operations with
> > passive
> > > > > synchronization (lock, get, unlock). Did you use other operations?
> > > > >
> > > > > 4) Will it possible for you to try gen2 (make.mvapich2.ofa) or
udapl
> > on
> > > > > your stack, if they are available on your systems?
> > > > >
> > > > > Thanks.
> > > > >
> > > > > Regards,
> > > > > Wei Huang
> > > > >
> > > > > 774 Dreese Lab, 2015 Neil Ave,
> > > > > Dept. of Computer Science and Engineering
> > > > > Ohio State University
> > > > > OH 43210
> > > > > Tel: (614)292-8501
> > > > >
> > > > >
> > > > > On Thu, 3 May 2007, Thomas O'Shea wrote:
> > > > >
> > > > > > Hello,
> > > > > >
> > > > > > I'm running the MVAPICH2-0.9.8 using the IB Gold Release. I've
got 2
> > > > > > 16 processor nodes (each has 8 dual-core AMD Opterons) hooked up
> > > > > > through infiniband. I started off running this parallel Fortran
code
> > > > > > on just one node with MPICH2 and had no problems. It scaled
decently
> > > > > > to 8 processors but didn't see much improvement with the jump to
16
> > > > > > (possibly due to cache coherency or something). Now, when trying
to
> > > > > > get it running across the infiniband connect I get this error:
> > > > > >
> > > > > > current bytes 4, total bytes 28, remote id 1
> > > > > > nfa_opt: ch3_smp_progress.c:2075: MPIDI_CH3I_SMP_pull_header:
> > Assertion
> > > > 'current_bytes[vc->smp.local_nodes] == 0' failed.
> > > > > > rank 0 in job 1 nessie_32906  caused collective abort of all
ranks
> > > > > >  exit status of rank 0: killed by signal 9
> > > > > >
> > > > > > This happens right after a one sided communication (MPI_GET) but
> > > > > > before the MPI_WIN_UNLOCK call that follows. Also this is only
with
> > a
> > > > > > process that is on the same node as the calling process, The
MPI_GET
> > > > > > call exits with no errors also.
> > > > > >
> > > > > > All the osu_benchmarks run with no problems. There were also no
> > > > > > problems if I make a local mpd (mpd &) ring on a single node and
run
> > > > > > the code with MVAPICH2 with 2,4,8,or 16 processors.  If I
compile
> > with
> > > > > > the MPICH2 libraries there are no problems on a single node or
> > running
> > > > > > processes spread out on both nodes.
> > > > > >
> > > > > > Ever seen this before? Any help would be greatly appreciated.
> > > > > >
> > > > > > Thanks,
> > > > > > Thomas O'Shea
> > > > > > SAIC
> > > >
> >



More information about the mvapich-discuss mailing list