[mvapich-discuss] MPI_BCAST appears to work incorrectly in vasp code - mvapich2 2.1rc1

Gardiner, Judith judithg at osc.edu
Thu Jan 15 15:36:52 EST 2015


That fixed it!  Is this the solution or just a data point for you?

Judy

From: hari.subramoni at gmail.com [mailto:hari.subramoni at gmail.com] On Behalf Of Hari Subramoni
Sent: Thursday, January 15, 2015 3:20 PM
To: Gardiner, Judith
Cc: mvapich-discuss at cse.ohio-state.edu
Subject: Re: [mvapich-discuss] MPI_BCAST appears to work incorrectly in vasp code - mvapich2 2.1rc1

Hello Judith,
Thanks for the report. We've not seen any data validation issues like this in our internal testing.
Could you please try with MV2_USE_ZCOPY_BCAST=0 and see if the issue persists?

Regards,
Hari.

On Thu, Jan 15, 2015 at 2:27 PM, Gardiner, Judith <judithg at osc.edu<mailto:judithg at osc.edu>> wrote:
We are running VASP successfully with mvapich2 1.9, but it fails with 2.1a and 2.1rc1.  It only happens when we use at least 40 processes.  We have 20 cores per node, so I've tried it with nodes=2:ppn=20, nodes=4:ppn=10, and nodes=8:ppn=5.  It fails on all of them.  The problem is repeatable.

I've narrowed it down to a particular call to MPI_BCAST.  Rank 5 is broadcasting a single integer value.  The correct value (128) is received by all ranks running on the first node.  An incorrect value (2) is received by all ranks running on other nodes, including the root, rank 5, if it's not on the first node.  The return code is 0 on all nodes.

The program loops through the ranks, with each rank broadcasting a vector length and then a vector.  The failure occurs when rank 5 broadcasts its vector length.  The program hangs on the next broadcast because of the incorrect lengths.

I was unable to reproduce the problem in a toy program.

Here's our version information.

[r0111]$ mpiname -a
MVAPICH2 2.1rc1 Thu Dec 18 20:00:00 EDT 2014 ch3:mrail

Compilation
CC: icc    -DNDEBUG -DNVALGRIND -g -O2
CXX: icpc   -DNDEBUG -DNVALGRIND -g -O2
F77: ifort -L/lib -L/lib   -g -O2
FC: ifort   -g -O2

Configuration
--prefix=/usr/local/mvapich2/intel/15/2.1rc1-debug --enable-shared --with-mpe --enable-romio --with-file-system=ufs+nfs --enable-debuginfo --enable-g=dbg --enable-mpit-pvars=mv2

Any suggestions?

Judy

--
Judith D. Gardiner, Ph.D.
Ohio Supercomputer Center
614-292-9623<tel:614-292-9623>
judithg at osc.edu<mailto:judithg at osc.edu>


_______________________________________________
mvapich-discuss mailing list
mvapich-discuss at cse.ohio-state.edu<mailto:mvapich-discuss at cse.ohio-state.edu>
http://mailman.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss

-------------- next part --------------
A non-text attachment was scrubbed...
Name: winmail.dat
Type: application/ms-tnef
Size: 11867 bytes
Desc: not available
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20150115/0a5ea9de/attachment-0001.bin>


More information about the mvapich-discuss mailing list