[mvapich-discuss] MPI_BCAST appears to work incorrectly in vasp code - mvapich2 2.1rc1

Gardiner, Judith judithg at osc.edu
Thu Jan 15 14:27:23 EST 2015


We are running VASP successfully with mvapich2 1.9, but it fails with 2.1a and 2.1rc1.  It only happens when we use at least 40 processes.  We have 20 cores per node, so I've tried it with nodes=2:ppn=20, nodes=4:ppn=10, and nodes=8:ppn=5.  It fails on all of them.  The problem is repeatable.

I've narrowed it down to a particular call to MPI_BCAST.  Rank 5 is broadcasting a single integer value.  The correct value (128) is received by all ranks running on the first node.  An incorrect value (2) is received by all ranks running on other nodes, including the root, rank 5, if it's not on the first node.  The return code is 0 on all nodes.

The program loops through the ranks, with each rank broadcasting a vector length and then a vector.  The failure occurs when rank 5 broadcasts its vector length.  The program hangs on the next broadcast because of the incorrect lengths.

I was unable to reproduce the problem in a toy program.

Here's our version information.

[r0111]$ mpiname -a
MVAPICH2 2.1rc1 Thu Dec 18 20:00:00 EDT 2014 ch3:mrail

Compilation
CC: icc    -DNDEBUG -DNVALGRIND -g -O2
CXX: icpc   -DNDEBUG -DNVALGRIND -g -O2
F77: ifort -L/lib -L/lib   -g -O2
FC: ifort   -g -O2

Configuration
--prefix=/usr/local/mvapich2/intel/15/2.1rc1-debug --enable-shared --with-mpe --enable-romio --with-file-system=ufs+nfs --enable-debuginfo --enable-g=dbg --enable-mpit-pvars=mv2

Any suggestions?

Judy

--
Judith D. Gardiner, Ph.D.
Ohio Supercomputer Center
614-292-9623
judithg at osc.edu<mailto:judithg at osc.edu>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20150115/79dee6ec/attachment.html>


More information about the mvapich-discuss mailing list