[mvapich-discuss] MPI_BCAST appears to work incorrectly in vasp code - mvapich2 2.1rc1
Hari Subramoni
subramoni.1 at osu.edu
Thu Jan 15 16:47:59 EST 2015
Hello Judith,
Good to know that it fixed the issue.
Unfortunately this is not the solution, just a data point to find the fix.
We do not recommend disabling the feature unless its leading to issues due
to performance reasons.
Is it possible for you to share a reproducer so that we can debug it
locally?
We will try to come up with a more complete solution for this soon.
Regards,
Hari.
On Thu, Jan 15, 2015 at 3:36 PM, Gardiner, Judith <judithg at osc.edu> wrote:
> That fixed it! Is this the solution or just a data point for you?
>
>
>
> Judy
>
>
>
> *From:* hari.subramoni at gmail.com [mailto:hari.subramoni at gmail.com] *On
> Behalf Of *Hari Subramoni
> *Sent:* Thursday, January 15, 2015 3:20 PM
> *To:* Gardiner, Judith
> *Cc:* mvapich-discuss at cse.ohio-state.edu
> *Subject:* Re: [mvapich-discuss] MPI_BCAST appears to work incorrectly in
> vasp code - mvapich2 2.1rc1
>
>
>
> Hello Judith,
>
> Thanks for the report. We've not seen any data validation issues like this
> in our internal testing.
>
> Could you please try with MV2_USE_ZCOPY_BCAST=0 and see if the issue
> persists?
>
>
>
> Regards,
> Hari.
>
>
>
> On Thu, Jan 15, 2015 at 2:27 PM, Gardiner, Judith <judithg at osc.edu> wrote:
>
> We are running VASP successfully with mvapich2 1.9, but it fails with 2.1a
> and 2.1rc1. It only happens when we use at least 40 processes. We have 20
> cores per node, so I've tried it with nodes=2:ppn=20, nodes=4:ppn=10, and
> nodes=8:ppn=5. It fails on all of them. The problem is repeatable.
>
>
>
> I've narrowed it down to a particular call to MPI_BCAST. Rank 5 is
> broadcasting a single integer value. The correct value (128) is received
> by all ranks running on the first node. An incorrect value (2) is received
> by all ranks running on other nodes, including the root, rank 5, if it's
> not on the first node. The return code is 0 on all nodes.
>
>
>
> The program loops through the ranks, with each rank broadcasting a vector
> length and then a vector. The failure occurs when rank 5 broadcasts its
> vector length. The program hangs on the next broadcast because of the
> incorrect lengths.
>
>
>
> I was unable to reproduce the problem in a toy program.
>
>
>
> Here's our version information.
>
>
>
> [r0111]$ mpiname -a
>
> MVAPICH2 2.1rc1 Thu Dec 18 20:00:00 EDT 2014 ch3:mrail
>
>
>
> Compilation
>
> CC: icc -DNDEBUG -DNVALGRIND -g -O2
>
> CXX: icpc -DNDEBUG -DNVALGRIND -g -O2
>
> F77: ifort -L/lib -L/lib -g -O2
>
> FC: ifort -g -O2
>
>
>
> Configuration
>
> --prefix=/usr/local/mvapich2/intel/15/2.1rc1-debug --enable-shared
> --with-mpe --enable-romio --with-file-system=ufs+nfs --enable-debuginfo
> --enable-g=dbg --enable-mpit-pvars=mv2
>
>
>
> Any suggestions?
>
>
>
> Judy
>
>
>
> --
>
> Judith D. Gardiner, Ph.D.
>
> Ohio Supercomputer Center
>
> 614-292-9623
>
> judithg at osc.edu
>
>
>
>
> _______________________________________________
> mvapich-discuss mailing list
> mvapich-discuss at cse.ohio-state.edu
> http://mailman.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20150115/5ed70565/attachment.html>
More information about the mvapich-discuss
mailing list