[mvapich-discuss] MVAPICH2 Error - Assertion'current_bytes[vc->smp.local_nodes]==0' failed. (fwd)

Gopal Santhanaraman santhana at cse.ohio-state.edu
Wed Jun 13 10:21:45 EDT 2007


 Hi Thomas,

  Thanks for your feedback.

  I don't know of any known upper limits to the size of MPI_TYPE_VECTOR
  datatypes. Can you let us know how big is the count and blocklength
  of the vector datatypes that you are using.

  Also whenever passive synchronization is used , it is recommended
  to use MPI_ALLOC_MEM.

 Thanks
 Gopal

On Tue, 12 Jun 2007, Thomas O'Shea wrote:

> Thanks for taking a look at this. I think we've narrowed it down to using
> the MPI_TYPE_VECTOR to make some derived datatypes. When we use the code
> without them it seems to function fine. There are some issues with handing
> out the code so it may take a while to boil it down to a simple section that
> I can pull out and post here.
>
> Are there any known limits to the size of MPI_TYPE_VECTOR datatypes? We are
> using these in conjunction with one sided communications, and I know some
> implementations require the memory to be used in RMA needs to be allocated
> using MPI_ALLOC_MEM when using derived datatypes, but I didn't think MPICH2
> was one of them.
>
> Thanks,
> Tom
>
>
> >
> > Hi Thomas
> >
> >    Thanks for your reply.
> >
> >    We have tried out the communication patterns that you had reported
> >    and they run fine even with the case where the two processes are
> >    running on the same node.
> >
> >    I have attached the tests with this mail. These tests are
> >    are from the mpich2 test suite (test4.c test4_am.c transpose.c).
> >    You can also try out these tests on your system.
> >
> >    We have not been able to reproduce the error that you are reporting.
> >    Can you give more insights into the application code that you are
> >    running or if it is possible give us the application code.
> >
> > Thanks
> > Gopal
> >
> > > Date: Mon, 4 Jun 2007 08:37:23 -0700
> > > From: Thomas O'Shea <THOMAS.T.O'SHEA at saic.com>
> > > Reply-To: Thomas O'Shea <toshea at trg.saic.com>
> > > To: wei huang <huanwei at cse.ohio-state.edu>
> > > Cc: mvapich-discuss at cse.ohio-state.edu
> > > Subject: Re: [mvapich-discuss] MVAPICH2 Error - Assertion
> > >     'current_bytes[vc->smp.local_nodes]==0' failed.
> > >
> > > We migrated over to gen2 (OpenFabrics ) and we are still getting the
> same
> > > errors. I was wondering if you found anything, or have any ideas of what
> to
> > > try next.
> > >
> > > Thanks,
> > > Tom
> > > ----- Original Message -----
> > > From: "wei huang" <huanwei at cse.ohio-state.edu>
> > > To: "Thomas O'Shea" <toshea at trg.saic.com>
> > > Cc: <mvapich-discuss at cse.ohio-state.edu>
> > > Sent: Friday, May 04, 2007 3:06 PM
> > > Subject: Re: [mvapich-discuss] MVAPICH2 Error - Assertion
> > > 'current_bytes[vc->smp.local_nodes]==0' failed.
> > >
> > >
> > > > Hi Thomas,
> > > >
> > > > Thanks for your reply.
> > > >
> > > > Because the source code of your application is not available to us, we
> > > > will do a code review of our code (or do you have a piece of code
> which
> > > > shows the problem that can be sent to us?)
> > > >
> > > > The reason I ask you to try gen2 (OpenFabrics) stack is because the
> whole
> > > > InfiniBand community is moving towards this. So actually most of our
> > > > efforts is spent on this front (though we still maintain certain
> necessary
> > > > maintenance and bug fixes for the vapi stack). You can find useful
> > > > information to install the OFED stack (OpenFabrics Enterprise
> > > > Distribution) here:
> > > >
> > > > http://www.openfabrics.org/downloads.htm
> > > >
> > > > And the information to compile mvapich2 with OFED stack is avaialable
> > > > through our website.
> > > >
> > > > Anyway, we will get back to you once we find something.
> > > >
> > > > Thanks.
> > > >
> > > > Regards,
> > > > Wei Huang
> > > >
> > > > 774 Dreese Lab, 2015 Neil Ave,
> > > > Dept. of Computer Science and Engineering
> > > > Ohio State University
> > > > OH 43210
> > > > Tel: (614)292-8501
> > > >
> > > >
> > > > On Fri, 4 May 2007, Thomas O'Shea wrote:
> > > >
> > > > > Thanks for the response.
> > > > >
> > > > > 1) Turns out we are using mvapich2-0.9.8p1 already.
> > > > >
> > > > > 2) Yes, the standard compiling scripts were used.
> > > > >
> > > > > 3) You are correct, most of the communication involves one sided
> > > operations
> > > > > with passive synchronization. The code also uses a few other MPI
> > > commands.
> > > > >
> > > > > We define MPI vector types:
> > > > >
> > > > >       CALL MPI_TYPE_VECTOR(xlen,nguard,iu_bnd,MPI_DOUBLE_PRECISION,
> > > > >      &                     xtype,ierr)
> > > > >
> > > > >       CALL MPI_TYPE_COMMIT(xtype,ierr)
> > > > >
> > > > >  Create MPI Windows:
> > > > >
> > > > >       CALL MPI_WIN_CREATE(work,winsize,8,MPI_INFO_NULL,
> > > > >      &                    MPI_COMM_WORLD,win,ierr)
> > > > >
> > > > > Synch our gets with lock and unlock:
> > > > >
> > > > >         CALL MPI_WIN_LOCK(MPI_LOCK_SHARED,get_pe,0,win,ierr)
> > > > >         CALL MPI_GET(wget,1,xtype,get_pe,
> > > > >      &             targ_disp,1,xtype,win,ierr)
> > > > >         CALL MPI_WIN_UNLOCK(get_pe,win,ierr)
> > > > >
> > > > > We use one broadcast call
> > > > >
> > > > >       call MPI_BCAST(qxyz,3*maxpan,MPI_DOUBLE_PRECISION,0,
> > > > >      1               MPI_COMM_WORLD,ierr)
> > > > >
> > > > > And of course barriers and freeing the windows and vector types.
> > > > >
> > > > > The error we are getting happens on a MPI_WIN_UNLOCK after a GET
> call
> > > that
> > > > > does not use the MPI_TYPE_VECTOR that we created though. The ierr
> from
> > > the
> > > > > GET call is 0 as well.
> > > > >
> > > > >
> > > > > 4) I talked with the IT person in charge of this cluster and he said
> > > that we
> > > > > could try that, but he said the documentation he found on gen2 and
> udapl
> > > was
> > > > > somewhat sparse in that he wasn't sure exactly how to set that up
> and
> > > what
> > > > > the different compilations actually do differently. Is there any
> > > resource
> > > > > you can point us towards?
> > > > >
> > > > > Thanks,
> > > > > Tom
> > > > >
> > > > >
> > > > > > Hi Thomas,
> > > > > >
> > > > > > We will look into this issue. Would you please let us know the
> > > following:
> > > > > >
> > > > > > 1) We have recently made a couple of bug fixes and released
> > > > > > mvapich2-0.9.8p1. Would you first try that version?
> > > > > >
> > > > > > And if it is not working:
> > > > > >
> > > > > > 2) Did you use the standard compiling scripts (you mentioned ib
> gold
> > > > > > release, is it on vapi? And did you use make.mvapich2.vapi?)
> > > > > >
> > > > > > 3) Would you provide us some information on how the comunication
> > > patterns
> > > > > > of your application are? It seems like one sided operations with
> > > passive
> > > > > > synchronization (lock, get, unlock). Did you use other operations?
> > > > > >
> > > > > > 4) Will it possible for you to try gen2 (make.mvapich2.ofa) or
> udapl
> > > on
> > > > > > your stack, if they are available on your systems?
> > > > > >
> > > > > > Thanks.
> > > > > >
> > > > > > Regards,
> > > > > > Wei Huang
> > > > > >
> > > > > > 774 Dreese Lab, 2015 Neil Ave,
> > > > > > Dept. of Computer Science and Engineering
> > > > > > Ohio State University
> > > > > > OH 43210
> > > > > > Tel: (614)292-8501
> > > > > >
> > > > > >
> > > > > > On Thu, 3 May 2007, Thomas O'Shea wrote:
> > > > > >
> > > > > > > Hello,
> > > > > > >
> > > > > > > I'm running the MVAPICH2-0.9.8 using the IB Gold Release. I've
> got 2
> > > > > > > 16 processor nodes (each has 8 dual-core AMD Opterons) hooked up
> > > > > > > through infiniband. I started off running this parallel Fortran
> code
> > > > > > > on just one node with MPICH2 and had no problems. It scaled
> decently
> > > > > > > to 8 processors but didn't see much improvement with the jump to
> 16
> > > > > > > (possibly due to cache coherency or something). Now, when trying
> to
> > > > > > > get it running across the infiniband connect I get this error:
> > > > > > >
> > > > > > > current bytes 4, total bytes 28, remote id 1
> > > > > > > nfa_opt: ch3_smp_progress.c:2075: MPIDI_CH3I_SMP_pull_header:
> > > Assertion
> > > > > 'current_bytes[vc->smp.local_nodes] == 0' failed.
> > > > > > > rank 0 in job 1 nessie_32906  caused collective abort of all
> ranks
> > > > > > >  exit status of rank 0: killed by signal 9
> > > > > > >
> > > > > > > This happens right after a one sided communication (MPI_GET) but
> > > > > > > before the MPI_WIN_UNLOCK call that follows. Also this is only
> with
> > > a
> > > > > > > process that is on the same node as the calling process, The
> MPI_GET
> > > > > > > call exits with no errors also.
> > > > > > >
> > > > > > > All the osu_benchmarks run with no problems. There were also no
> > > > > > > problems if I make a local mpd (mpd &) ring on a single node and
> run
> > > > > > > the code with MVAPICH2 with 2,4,8,or 16 processors.  If I
> compile
> > > with
> > > > > > > the MPICH2 libraries there are no problems on a single node or
> > > running
> > > > > > > processes spread out on both nodes.
> > > > > > >
> > > > > > > Ever seen this before? Any help would be greatly appreciated.
> > > > > > >
> > > > > > > Thanks,
> > > > > > > Thomas O'Shea
> > > > > > > SAIC
> > > > >
> > >
> > > _______________________________________________
> > > mvapich-discuss mailing list
> > > mvapich-discuss at cse.ohio-state.edu
> > > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
> > >
> >
>
>
> ----------------------------------------------------------------------------
> ----
>
>
> > _______________________________________________
> > mvapich-discuss mailing list
> > mvapich-discuss at cse.ohio-state.edu
> > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
> >
>



More information about the mvapich-discuss mailing list