[mvapich-discuss] Different results from different workers on MPI_Allreduce

Javier Delgado - NOAA Affiliate javier.delgado at noaa.gov
Thu Oct 22 17:48:38 EDT 2015


Hi all,

After compiling and running with v15.0.3 of the Intel compiler and v2.1 of
Mvapich, I no longer have this problem. The model fails shortly after, but
I'll need to dig deeper to find the cause.

Thanks for all the help!

-Javier

On Tue, Oct 20, 2015 at 9:34 PM, Panda, Dhabaleswar <
panda at cse.ohio-state.edu> wrote:

> Hi Javier,
>
> You are using a very ancient version of MVAPICH2. Version 1.8 was released
> in April 2012.
> It is more than 3.5 years old. The latest GA version is 2.1. The latest
> version is 2.2a
> and the new 2.2b will be coming out soon.
>
> Many new features (including conforming to the latest MPI standard),
> performance enhancements and bug-fixes go with every new release.
>
> I will suggest you to upgrade to the latest 2.1 GA version. Otherwise, it
> will be very hard to provide
> support for such an ancient version.
>
> Thanks,
>
> DK
>
>
>
>
>
> ________________________________
> From: mvapich-discuss-bounces at cse.ohio-state.edu on behalf of Javier
> Delgado - NOAA Affiliate [javier.delgado at noaa.gov]
> Sent: Tuesday, October 20, 2015 9:20 PM
> To: Subramoni, Hari
> Cc: mvapich-discuss at cse.ohio-state.edu
> Subject: Re: [mvapich-discuss] Different results from different workers on
> MPI_Allreduce
>
> Hi Hari,
>
> Here is the output of "mpiname -a":
>
> MVAPICH2 1.8 Mon Apr 30 14:56:40 EDT 2012 ch3:mrail
>
> Compilation
> CC: icc    -DNDEBUG -DNVALGRIND -O2
> CXX: icpc   -DNDEBUG -DNVALGRIND -O2
> F77: ifort   -O2 -L/usr/lib64
> FC: ifort   -O2
>
> Configuration
> CC=icc CXX=icpc F77=ifort FC=ifort --prefix=/apps/mvapich2/1.8-r5609-intel
> --with-rdma=gen2 --with-ib-libpath=/usr/lib64 --enable-romio=yes
> --with-file-system=lustre+panfs --enable-shared
>
>
> I am using version 12.1.4 of the Intel compiler. I normally use mvapich2
> version 1.8, although I also tried with 1.9 and got the same result. I have
> not tried compiling WRF with 1.9 and rerunning, I can try that next. This
> is the newest version available on the system.
>
> The CPU type is an Intel Xeon  E5-2650 v2 @ 2.60GHz.
> The interconnection is QDR Infiniband. Does that answer your question
> about HCA type or did you need something else?
>
> Please let me know if you have any other questions.
>
>
> Thanks,
> Javier
>
>
> On Tue, Oct 20, 2015 at 8:38 PM, Hari Subramoni <subramoni.1 at osu.edu
> <mailto:subramoni.1 at osu.edu>> wrote:
> Hello Javier,
>
> This is a little surprising. Could you please send us the output of
> mpiname -a and the version of Intel compilers you're using? Could you also
> let us know the CPU and HCA type of the system you're running on? Are you
> trying with the latest release of MVAPICH2? If not, can you please try with
> that?
>
> Thx,
> Hari.
>
> On Tue, Oct 20, 2015 at 7:17 PM, Javier Delgado - NOAA Affiliate <
> javier.delgado at noaa.gov<mailto:javier.delgado at noaa.gov>> wrote:
> Hi all,
>
> I am running a program that performs an AllReduce operation using the
> MPI_MAXLOC operation to determine a global maximum value and rank for 3
> variables by passing in a 6-element array wherein the odd-numbered indices
> contain the values and the even-numbered indices the rank. When run with
> 180 workers, 175 of them produce one value for the maxloc index, 4 produce
> another, and 1 produces yet another (i.e. I have three unique results put
> into recvbuf among all the workers). This results in the application later
> hanging since some workers are expecting the corresponding global maximum
> value to be broadcast from a different rank (e.g. task N determines that
> task X contains the maximum, so it waits for a broadcast from task X, which
> never arrives because task X determines that task Z contains the maximum).
> One thing worth noting is that one of the tasks calculates NaN as the
> global max (and itself as the worker containing it), which is odd since my
> understanding is that NaN's should be ignored in MINVAL/MAXVAL as long as
> not all elements of the array are NaN.
>
> My question is, is this indicative of an issue in MVAPICH, the (Intel)
> compiler, or the program itself?
> If NaN's are not ignored by MaxLoc, I guess the code would need to be
> modified to deal with this.
> This is occurring with a WRF model run, so it is difficult to provide a
> simple case that reproduces the problem. Here is an excerpt of the code in
> question:
>
> call MPI_Comm_rank(local_communicator,myrank,ierr)
> comm(1)=have_cen
> comm(2)=myrank
> comm(3)=-mingbl_mslp   ! scalar
> comm(4)=myrank
> comm(5)=maxgbl_wind
> comm(6)=myrank
> call
> MPI_Allreduce(comm,reduced,3,MPI_2REAL,MPI_MAXLOC,local_communicator,ierr)
> mingbl_mslp=-reduced(3)
> grank=reduced(4)
> if(myrank==grank) then
>        bcast=(/ plat,plon,real(imslp),real(jmslp) /)
> endif
> call MPI_Bcast(bcast,4,MPI_REAL,grank,local_communicator,ierr)
> if(myrank/=grank) then
>        plat=bcast(1)
>        plon=bcast(2)
>        imslp=bcast(3)
>        jmslp=bcast(4)
> endif
>
>
>
> Thanks much,
> Javier
>
> _______________________________________________
> mvapich-discuss mailing list
> mvapich-discuss at cse.ohio-state.edu<mailto:
> mvapich-discuss at cse.ohio-state.edu>
> http://mailman.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20151022/5ad1a708/attachment.html>


More information about the mvapich-discuss mailing list