[mvapich-discuss] MPI_BCAST appears to work incorrectly in vasp code - mvapich2 2.1rc1

Hari Subramoni subramoni.1 at osu.edu
Thu Jan 15 16:47:59 EST 2015


Hello Judith,

Good to know that it fixed the issue.

Unfortunately this is not the solution, just a data point to find the fix.
We do not recommend disabling the feature unless its leading to issues due
to performance reasons.

Is it possible for you to share a reproducer so that we can debug it
locally?

We will try to come up with a more complete solution for this soon.

Regards,
Hari.

On Thu, Jan 15, 2015 at 3:36 PM, Gardiner, Judith <judithg at osc.edu> wrote:

>  That fixed it!  Is this the solution or just a data point for you?
>
>
>
> Judy
>
>
>
> *From:* hari.subramoni at gmail.com [mailto:hari.subramoni at gmail.com] *On
> Behalf Of *Hari Subramoni
> *Sent:* Thursday, January 15, 2015 3:20 PM
> *To:* Gardiner, Judith
> *Cc:* mvapich-discuss at cse.ohio-state.edu
> *Subject:* Re: [mvapich-discuss] MPI_BCAST appears to work incorrectly in
> vasp code - mvapich2 2.1rc1
>
>
>
> Hello Judith,
>
> Thanks for the report. We've not seen any data validation issues like this
> in our internal testing.
>
> Could you please try with MV2_USE_ZCOPY_BCAST=0 and see if the issue
> persists?
>
>
>
> Regards,
> Hari.
>
>
>
> On Thu, Jan 15, 2015 at 2:27 PM, Gardiner, Judith <judithg at osc.edu> wrote:
>
> We are running VASP successfully with mvapich2 1.9, but it fails with 2.1a
> and 2.1rc1.  It only happens when we use at least 40 processes.  We have 20
> cores per node, so I've tried it with nodes=2:ppn=20, nodes=4:ppn=10, and
> nodes=8:ppn=5.  It fails on all of them.  The problem is repeatable.
>
>
>
> I've narrowed it down to a particular call to MPI_BCAST.  Rank 5 is
> broadcasting a single integer value.  The correct value (128) is received
> by all ranks running on the first node.  An incorrect value (2) is received
> by all ranks running on other nodes, including the root, rank 5, if it's
> not on the first node.  The return code is 0 on all nodes.
>
>
>
> The program loops through the ranks, with each rank broadcasting a vector
> length and then a vector.  The failure occurs when rank 5 broadcasts its
> vector length.  The program hangs on the next broadcast because of the
> incorrect lengths.
>
>
>
> I was unable to reproduce the problem in a toy program.
>
>
>
> Here's our version information.
>
>
>
> [r0111]$ mpiname -a
>
> MVAPICH2 2.1rc1 Thu Dec 18 20:00:00 EDT 2014 ch3:mrail
>
>
>
> Compilation
>
> CC: icc    -DNDEBUG -DNVALGRIND -g -O2
>
> CXX: icpc   -DNDEBUG -DNVALGRIND -g -O2
>
> F77: ifort -L/lib -L/lib   -g -O2
>
> FC: ifort   -g -O2
>
>
>
> Configuration
>
> --prefix=/usr/local/mvapich2/intel/15/2.1rc1-debug --enable-shared
> --with-mpe --enable-romio --with-file-system=ufs+nfs --enable-debuginfo
> --enable-g=dbg --enable-mpit-pvars=mv2
>
>
>
> Any suggestions?
>
>
>
> Judy
>
>
>
> --
>
> Judith D. Gardiner, Ph.D.
>
> Ohio Supercomputer Center
>
> 614-292-9623
>
> judithg at osc.edu
>
>
>
>
> _______________________________________________
> mvapich-discuss mailing list
> mvapich-discuss at cse.ohio-state.edu
> http://mailman.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20150115/5ed70565/attachment.html>


More information about the mvapich-discuss mailing list