[mvapich-discuss] MVAPICH2 Error - Assertion 'current_bytes[vc->smp.local_nodes]==0' failed.

wei huang huanwei at cse.ohio-state.edu
Fri May 4 10:40:36 EDT 2007


Hi Thomas,

We will look into this issue. Would you please let us know the following:

1) We have recently made a couple of bug fixes and released
mvapich2-0.9.8p1. Would you first try that version?

And if it is not working:

2) Did you use the standard compiling scripts (you mentioned ib gold
release, is it on vapi? And did you use make.mvapich2.vapi?)

3) Would you provide us some information on how the comunication patterns
of your application are? It seems like one sided operations with passive
synchronization (lock, get, unlock). Did you use other operations?

4) Will it possible for you to try gen2 (make.mvapich2.ofa) or udapl on
your stack, if they are available on your systems?

Thanks.

Regards,
Wei Huang

774 Dreese Lab, 2015 Neil Ave,
Dept. of Computer Science and Engineering
Ohio State University
OH 43210
Tel: (614)292-8501


On Thu, 3 May 2007, Thomas O'Shea wrote:

> Hello,
>
> I'm running the MVAPICH2-0.9.8 using the IB Gold Release. I've got 2
> 16 processor nodes (each has 8 dual-core AMD Opterons) hooked up
> through infiniband. I started off running this parallel Fortran code
> on just one node with MPICH2 and had no problems. It scaled decently
> to 8 processors but didn't see much improvement with the jump to 16
> (possibly due to cache coherency or something). Now, when trying to
> get it running across the infiniband connect I get this error:
>
> current bytes 4, total bytes 28, remote id 1
> nfa_opt: ch3_smp_progress.c:2075: MPIDI_CH3I_SMP_pull_header: Assertion 'current_bytes[vc->smp.local_nodes] == 0' failed.
> rank 0 in job 1 nessie_32906  caused collective abort of all ranks
>  exit status of rank 0: killed by signal 9
>
> This happens right after a one sided communication (MPI_GET) but
> before the MPI_WIN_UNLOCK call that follows. Also this is only with a
> process that is on the same node as the calling process, The MPI_GET
> call exits with no errors also.
>
> All the osu_benchmarks run with no problems. There were also no
> problems if I make a local mpd (mpd &) ring on a single node and run
> the code with MVAPICH2 with 2,4,8,or 16 processors.  If I compile with
> the MPICH2 libraries there are no problems on a single node or running
> processes spread out on both nodes.
>
> Ever seen this before? Any help would be greatly appreciated.
>
> Thanks,
> Thomas O'Shea
> SAIC



More information about the mvapich-discuss mailing list