[mvapich-discuss] MVAPICH2 Error - Assertion 'current_bytes[vc->smp.local_nodes]==0' failed.

Thomas O'Shea THOMAS.T.O'SHEA at saic.com
Wed Jun 6 19:05:02 EDT 2007


Did you ever find a work-around?

Thanks,
Tom
----- Original Message ----- 
From: "Sylvain Jeaugey" <sylvain.jeaugey at bull.net>
To: "Thomas O'Shea" <toshea at trg.saic.com>
Cc: "wei huang" <huanwei at cse.ohio-state.edu>;
<mvapich-discuss at cse.ohio-state.edu>
Sent: Monday, June 04, 2007 9:29 AM
Subject: Re: [mvapich-discuss] MVAPICH2 Error - Assertion
'current_bytes[vc->smp.local_nodes]==0' failed.


> Hi all,
>
> For the record, this is an error I already encountered. [I didn't report
> it since I'm still using an old mvapich tree.]
> Unfortunately, we also don't have a simple way to reproduce it.
>
> Sylvain
>
> On Mon, 4 Jun 2007, Thomas O'Shea wrote:
>
> > We migrated over to gen2 (OpenFabrics ) and we are still getting the
same
> > errors. I was wondering if you found anything, or have any ideas of what
to
> > try next.
> >
> > Thanks,
> > Tom
> > ----- Original Message -----
> > From: "wei huang" <huanwei at cse.ohio-state.edu>
> > To: "Thomas O'Shea" <toshea at trg.saic.com>
> > Cc: <mvapich-discuss at cse.ohio-state.edu>
> > Sent: Friday, May 04, 2007 3:06 PM
> > Subject: Re: [mvapich-discuss] MVAPICH2 Error - Assertion
> > 'current_bytes[vc->smp.local_nodes]==0' failed.
> >
> >
> >> Hi Thomas,
> >>
> >> Thanks for your reply.
> >>
> >> Because the source code of your application is not available to us, we
> >> will do a code review of our code (or do you have a piece of code which
> >> shows the problem that can be sent to us?)
> >>
> >> The reason I ask you to try gen2 (OpenFabrics) stack is because the
whole
> >> InfiniBand community is moving towards this. So actually most of our
> >> efforts is spent on this front (though we still maintain certain
necessary
> >> maintenance and bug fixes for the vapi stack). You can find useful
> >> information to install the OFED stack (OpenFabrics Enterprise
> >> Distribution) here:
> >>
> >> http://www.openfabrics.org/downloads.htm
> >>
> >> And the information to compile mvapich2 with OFED stack is avaialable
> >> through our website.
> >>
> >> Anyway, we will get back to you once we find something.
> >>
> >> Thanks.
> >>
> >> Regards,
> >> Wei Huang
> >>
> >> 774 Dreese Lab, 2015 Neil Ave,
> >> Dept. of Computer Science and Engineering
> >> Ohio State University
> >> OH 43210
> >> Tel: (614)292-8501
> >>
> >>
> >> On Fri, 4 May 2007, Thomas O'Shea wrote:
> >>
> >>> Thanks for the response.
> >>>
> >>> 1) Turns out we are using mvapich2-0.9.8p1 already.
> >>>
> >>> 2) Yes, the standard compiling scripts were used.
> >>>
> >>> 3) You are correct, most of the communication involves one sided
> > operations
> >>> with passive synchronization. The code also uses a few other MPI
> > commands.
> >>>
> >>> We define MPI vector types:
> >>>
> >>>       CALL MPI_TYPE_VECTOR(xlen,nguard,iu_bnd,MPI_DOUBLE_PRECISION,
> >>>      &                     xtype,ierr)
> >>>
> >>>       CALL MPI_TYPE_COMMIT(xtype,ierr)
> >>>
> >>>  Create MPI Windows:
> >>>
> >>>       CALL MPI_WIN_CREATE(work,winsize,8,MPI_INFO_NULL,
> >>>      &                    MPI_COMM_WORLD,win,ierr)
> >>>
> >>> Synch our gets with lock and unlock:
> >>>
> >>>         CALL MPI_WIN_LOCK(MPI_LOCK_SHARED,get_pe,0,win,ierr)
> >>>         CALL MPI_GET(wget,1,xtype,get_pe,
> >>>      &             targ_disp,1,xtype,win,ierr)
> >>>         CALL MPI_WIN_UNLOCK(get_pe,win,ierr)
> >>>
> >>> We use one broadcast call
> >>>
> >>>       call MPI_BCAST(qxyz,3*maxpan,MPI_DOUBLE_PRECISION,0,
> >>>      1               MPI_COMM_WORLD,ierr)
> >>>
> >>> And of course barriers and freeing the windows and vector types.
> >>>
> >>> The error we are getting happens on a MPI_WIN_UNLOCK after a GET call
> > that
> >>> does not use the MPI_TYPE_VECTOR that we created though. The ierr from
> > the
> >>> GET call is 0 as well.
> >>>
> >>>
> >>> 4) I talked with the IT person in charge of this cluster and he said
> > that we
> >>> could try that, but he said the documentation he found on gen2 and
udapl
> > was
> >>> somewhat sparse in that he wasn't sure exactly how to set that up and
> > what
> >>> the different compilations actually do differently. Is there any
> > resource
> >>> you can point us towards?
> >>>
> >>> Thanks,
> >>> Tom
> >>>
> >>>
> >>>> Hi Thomas,
> >>>>
> >>>> We will look into this issue. Would you please let us know the
> > following:
> >>>>
> >>>> 1) We have recently made a couple of bug fixes and released
> >>>> mvapich2-0.9.8p1. Would you first try that version?
> >>>>
> >>>> And if it is not working:
> >>>>
> >>>> 2) Did you use the standard compiling scripts (you mentioned ib gold
> >>>> release, is it on vapi? And did you use make.mvapich2.vapi?)
> >>>>
> >>>> 3) Would you provide us some information on how the comunication
> > patterns
> >>>> of your application are? It seems like one sided operations with
> > passive
> >>>> synchronization (lock, get, unlock). Did you use other operations?
> >>>>
> >>>> 4) Will it possible for you to try gen2 (make.mvapich2.ofa) or udapl
> > on
> >>>> your stack, if they are available on your systems?
> >>>>
> >>>> Thanks.
> >>>>
> >>>> Regards,
> >>>> Wei Huang
> >>>>
> >>>> 774 Dreese Lab, 2015 Neil Ave,
> >>>> Dept. of Computer Science and Engineering
> >>>> Ohio State University
> >>>> OH 43210
> >>>> Tel: (614)292-8501
> >>>>
> >>>>
> >>>> On Thu, 3 May 2007, Thomas O'Shea wrote:
> >>>>
> >>>>> Hello,
> >>>>>
> >>>>> I'm running the MVAPICH2-0.9.8 using the IB Gold Release. I've got 2
> >>>>> 16 processor nodes (each has 8 dual-core AMD Opterons) hooked up
> >>>>> through infiniband. I started off running this parallel Fortran code
> >>>>> on just one node with MPICH2 and had no problems. It scaled decently
> >>>>> to 8 processors but didn't see much improvement with the jump to 16
> >>>>> (possibly due to cache coherency or something). Now, when trying to
> >>>>> get it running across the infiniband connect I get this error:
> >>>>>
> >>>>> current bytes 4, total bytes 28, remote id 1
> >>>>> nfa_opt: ch3_smp_progress.c:2075: MPIDI_CH3I_SMP_pull_header:
> > Assertion
> >>> 'current_bytes[vc->smp.local_nodes] == 0' failed.
> >>>>> rank 0 in job 1 nessie_32906  caused collective abort of all ranks
> >>>>>  exit status of rank 0: killed by signal 9
> >>>>>
> >>>>> This happens right after a one sided communication (MPI_GET) but
> >>>>> before the MPI_WIN_UNLOCK call that follows. Also this is only with
> > a
> >>>>> process that is on the same node as the calling process, The MPI_GET
> >>>>> call exits with no errors also.
> >>>>>
> >>>>> All the osu_benchmarks run with no problems. There were also no
> >>>>> problems if I make a local mpd (mpd &) ring on a single node and run
> >>>>> the code with MVAPICH2 with 2,4,8,or 16 processors.  If I compile
> > with
> >>>>> the MPICH2 libraries there are no problems on a single node or
> > running
> >>>>> processes spread out on both nodes.
> >>>>>
> >>>>> Ever seen this before? Any help would be greatly appreciated.
> >>>>>
> >>>>> Thanks,
> >>>>> Thomas O'Shea
> >>>>> SAIC
> >>>
> >
> > _______________________________________________
> > mvapich-discuss mailing list
> > mvapich-discuss at cse.ohio-state.edu
> > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
> >



More information about the mvapich-discuss mailing list