[mvapich-discuss] MVAPICH2 Error - Assertion 'current_bytes[vc->smp.local_nodes]==0' failed.

Sylvain Jeaugey sylvain.jeaugey at bull.net
Mon Jun 4 12:29:58 EDT 2007


Hi all,

For the record, this is an error I already encountered. [I didn't report 
it since I'm still using an old mvapich tree.]
Unfortunately, we also don't have a simple way to reproduce it.

Sylvain

On Mon, 4 Jun 2007, Thomas O'Shea wrote:

> We migrated over to gen2 (OpenFabrics ) and we are still getting the same
> errors. I was wondering if you found anything, or have any ideas of what to
> try next.
>
> Thanks,
> Tom
> ----- Original Message -----
> From: "wei huang" <huanwei at cse.ohio-state.edu>
> To: "Thomas O'Shea" <toshea at trg.saic.com>
> Cc: <mvapich-discuss at cse.ohio-state.edu>
> Sent: Friday, May 04, 2007 3:06 PM
> Subject: Re: [mvapich-discuss] MVAPICH2 Error - Assertion
> 'current_bytes[vc->smp.local_nodes]==0' failed.
>
>
>> Hi Thomas,
>>
>> Thanks for your reply.
>>
>> Because the source code of your application is not available to us, we
>> will do a code review of our code (or do you have a piece of code which
>> shows the problem that can be sent to us?)
>>
>> The reason I ask you to try gen2 (OpenFabrics) stack is because the whole
>> InfiniBand community is moving towards this. So actually most of our
>> efforts is spent on this front (though we still maintain certain necessary
>> maintenance and bug fixes for the vapi stack). You can find useful
>> information to install the OFED stack (OpenFabrics Enterprise
>> Distribution) here:
>>
>> http://www.openfabrics.org/downloads.htm
>>
>> And the information to compile mvapich2 with OFED stack is avaialable
>> through our website.
>>
>> Anyway, we will get back to you once we find something.
>>
>> Thanks.
>>
>> Regards,
>> Wei Huang
>>
>> 774 Dreese Lab, 2015 Neil Ave,
>> Dept. of Computer Science and Engineering
>> Ohio State University
>> OH 43210
>> Tel: (614)292-8501
>>
>>
>> On Fri, 4 May 2007, Thomas O'Shea wrote:
>>
>>> Thanks for the response.
>>>
>>> 1) Turns out we are using mvapich2-0.9.8p1 already.
>>>
>>> 2) Yes, the standard compiling scripts were used.
>>>
>>> 3) You are correct, most of the communication involves one sided
> operations
>>> with passive synchronization. The code also uses a few other MPI
> commands.
>>>
>>> We define MPI vector types:
>>>
>>>       CALL MPI_TYPE_VECTOR(xlen,nguard,iu_bnd,MPI_DOUBLE_PRECISION,
>>>      &                     xtype,ierr)
>>>
>>>       CALL MPI_TYPE_COMMIT(xtype,ierr)
>>>
>>>  Create MPI Windows:
>>>
>>>       CALL MPI_WIN_CREATE(work,winsize,8,MPI_INFO_NULL,
>>>      &                    MPI_COMM_WORLD,win,ierr)
>>>
>>> Synch our gets with lock and unlock:
>>>
>>>         CALL MPI_WIN_LOCK(MPI_LOCK_SHARED,get_pe,0,win,ierr)
>>>         CALL MPI_GET(wget,1,xtype,get_pe,
>>>      &             targ_disp,1,xtype,win,ierr)
>>>         CALL MPI_WIN_UNLOCK(get_pe,win,ierr)
>>>
>>> We use one broadcast call
>>>
>>>       call MPI_BCAST(qxyz,3*maxpan,MPI_DOUBLE_PRECISION,0,
>>>      1               MPI_COMM_WORLD,ierr)
>>>
>>> And of course barriers and freeing the windows and vector types.
>>>
>>> The error we are getting happens on a MPI_WIN_UNLOCK after a GET call
> that
>>> does not use the MPI_TYPE_VECTOR that we created though. The ierr from
> the
>>> GET call is 0 as well.
>>>
>>>
>>> 4) I talked with the IT person in charge of this cluster and he said
> that we
>>> could try that, but he said the documentation he found on gen2 and udapl
> was
>>> somewhat sparse in that he wasn't sure exactly how to set that up and
> what
>>> the different compilations actually do differently. Is there any
> resource
>>> you can point us towards?
>>>
>>> Thanks,
>>> Tom
>>>
>>>
>>>> Hi Thomas,
>>>>
>>>> We will look into this issue. Would you please let us know the
> following:
>>>>
>>>> 1) We have recently made a couple of bug fixes and released
>>>> mvapich2-0.9.8p1. Would you first try that version?
>>>>
>>>> And if it is not working:
>>>>
>>>> 2) Did you use the standard compiling scripts (you mentioned ib gold
>>>> release, is it on vapi? And did you use make.mvapich2.vapi?)
>>>>
>>>> 3) Would you provide us some information on how the comunication
> patterns
>>>> of your application are? It seems like one sided operations with
> passive
>>>> synchronization (lock, get, unlock). Did you use other operations?
>>>>
>>>> 4) Will it possible for you to try gen2 (make.mvapich2.ofa) or udapl
> on
>>>> your stack, if they are available on your systems?
>>>>
>>>> Thanks.
>>>>
>>>> Regards,
>>>> Wei Huang
>>>>
>>>> 774 Dreese Lab, 2015 Neil Ave,
>>>> Dept. of Computer Science and Engineering
>>>> Ohio State University
>>>> OH 43210
>>>> Tel: (614)292-8501
>>>>
>>>>
>>>> On Thu, 3 May 2007, Thomas O'Shea wrote:
>>>>
>>>>> Hello,
>>>>>
>>>>> I'm running the MVAPICH2-0.9.8 using the IB Gold Release. I've got 2
>>>>> 16 processor nodes (each has 8 dual-core AMD Opterons) hooked up
>>>>> through infiniband. I started off running this parallel Fortran code
>>>>> on just one node with MPICH2 and had no problems. It scaled decently
>>>>> to 8 processors but didn't see much improvement with the jump to 16
>>>>> (possibly due to cache coherency or something). Now, when trying to
>>>>> get it running across the infiniband connect I get this error:
>>>>>
>>>>> current bytes 4, total bytes 28, remote id 1
>>>>> nfa_opt: ch3_smp_progress.c:2075: MPIDI_CH3I_SMP_pull_header:
> Assertion
>>> 'current_bytes[vc->smp.local_nodes] == 0' failed.
>>>>> rank 0 in job 1 nessie_32906  caused collective abort of all ranks
>>>>>  exit status of rank 0: killed by signal 9
>>>>>
>>>>> This happens right after a one sided communication (MPI_GET) but
>>>>> before the MPI_WIN_UNLOCK call that follows. Also this is only with
> a
>>>>> process that is on the same node as the calling process, The MPI_GET
>>>>> call exits with no errors also.
>>>>>
>>>>> All the osu_benchmarks run with no problems. There were also no
>>>>> problems if I make a local mpd (mpd &) ring on a single node and run
>>>>> the code with MVAPICH2 with 2,4,8,or 16 processors.  If I compile
> with
>>>>> the MPICH2 libraries there are no problems on a single node or
> running
>>>>> processes spread out on both nodes.
>>>>>
>>>>> Ever seen this before? Any help would be greatly appreciated.
>>>>>
>>>>> Thanks,
>>>>> Thomas O'Shea
>>>>> SAIC
>>>
>
> _______________________________________________
> mvapich-discuss mailing list
> mvapich-discuss at cse.ohio-state.edu
> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>


More information about the mvapich-discuss mailing list