[mvapich-discuss] MVAPICH2 Error - Assertion 'current_bytes[vc->smp.local_nodes]==0' failed.

Thomas O'Shea THOMAS.T.O'SHEA at saic.com
Thu May 3 20:45:57 EDT 2007


Hello,

I'm running the MVAPICH2-0.9.8 using the IB Gold Release. I've got 2 16 processor nodes (each has 8 dual-core AMD Opterons) hooked up through infiniband. I started off running this parallel Fortran code on just one node with MPICH2 and had no problems. It scaled decently to 8 processors but didn't see much improvement with the jump to 16 (possibly due to cache coherency or something). Now, when trying to get it running across the infiniband connect I get this error:

current bytes 4, total bytes 28, remote id 1
nfa_opt: ch3_smp_progress.c:2075: MPIDI_CH3I_SMP_pull_header: Assertion 'current_bytes[vc->smp.local_nodes] == 0' failed.
rank 0 in job 1 nessie_32906  caused collective abort of all ranks
 exit status of rank 0: killed by signal 9

This happens right after a one sided communication (MPI_GET) but before the MPI_WIN_UNLOCK call that follows. Also this is only with a process that is on the same node as the calling process,   The MPI_GET call exits with no errors also.

All the osu_benchmarks run with no problems. There were also no problems if I make a local mpd (mpd &) ring on a single node and run the code with MVAPICH2 with 2,4,8,or 16 processors.  If I compile with the MPICH2 libraries there are no problems on a single node or running processes spread out on both nodes. 

Ever seen this before? Any help would be greatly appreciated.

Thanks,
Thomas O'Shea
SAIC
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20070503/4f4800e2/attachment.html


More information about the mvapich-discuss mailing list