[mvapich-discuss] NaNs from non-blocking comms

Dan Kokron daniel.kokron at nasa.gov
Thu Apr 14 12:59:26 EDT 2011


Yes, resolved in user code.

Thank you.
Dan

On Thu, 2011-04-14 at 11:52 -0500, Sayantan Sur wrote:
> Just closing this issue out on the list.
> 
> We exchanged email, and it turns out that other MPI stacks also showed
> this problem. It seems there is a minor problem of the example code
> assuming temporary buffering (Isend doesn't guarantee MPI library
> buffered the message or sent it out).
> 
> Thanks.
> 
> On Thu, Apr 7, 2011 at 6:09 PM, Dan Kokron <daniel.kokron at nasa.gov> wrote:
> > Sayantan,
> >
> > Hope the workshop talk went well.
> >
> > Some more data-points.  mpich2-1.2.1p1 and the latest available MPICH2
> > (r8363) don't give NaNs when configured as follows.
> >
> > http://www.mcs.anl.gov/research/projects/mpich2/downloads/tarballs/1.2.1p1/mpich2-1.2.1p1.tar.gz
> >
> > ./configure CC=icc CXX=icpc F77=ifort FC=ifort CFLAGS=-fpic
> > CXXFLAGS=-fpic FFLAGS=-fpic FCFLAGS=-fpic
> > --prefix=/discover/nobackup/projects/gmao/share/dkokron/play/MPICH2/mpich2-1.2.1p1/install --enable-f77 --enable-f90 --enable-cxx --enable-romio --enable-smpcoll --without-mpe
> >
> > http://www.mcs.anl.gov/research/projects/mpich2/downloads/tarballs/nightly/trunk/mpich2-trunk-r8363.tar.gz
> >
> > ./configure CC=icc CXX=icpc F77=ifort FC=ifort CFLAGS=-fpic
> > CXXFLAGS=-fpic FFLAGS=-fpic FCFLAGS=-fpic --prefix=$PWD/install
> > --enable-f77 --enable-fc --enable-cxx --enable-romio --with-pm=hydra
> > --enable-smpcoll --without-mpe
> >
> > Dan
> >
> > On Tue, 2011-04-05 at 15:32 -0500, Sayantan Sur wrote:
> >> Hi Dan,
> >>
> >> Thanks for the updated code. I will ask someone to run the code on our
> >> end to see if we can reproduce this. I am at the OpenFabrics workshop,
> >> and our talk is going to be held soon.
> >>
> >> Thanks again.
> >>
> >> On Tue, Apr 5, 2011 at 12:50 PM, Dan Kokron <daniel.kokron at nasa.gov> wrote:
> >> > updated code is attached.  I think I put the Wait in the proper place.
> >> > Still getting NaNs.
> >> >
> >> > mpiexec.hydra -prepend-rank -launcher-exec /usr/bin/sshmpi -np 72 ./a.out
> >> > [3]  NaN found           13          10         660
> >> > [69]  NaN found           13           9         588
> >> >
> >> > Dan
> >> >
> >> > On Tue, 2011-04-05 at 14:11 -0500, Sayantan Sur wrote:
> >> >> Hi Dan,
> >> >>
> >> >> Thanks for posting this example. I took a quick look at the example. I
> >> >> think there is a bug in the application code. The MPI standard
> >> >> requires that all non-blocking communications be (locally) completed
> >> >> before calling finalize. MPI_Barrier doesn't guarantee this. Let me
> >> >> know if you think I am mistaken.
> >> >>
> >> >> http://www.mpi-forum.org/docs/mpi-2.2/mpi22-report.pdf
> >> >>
> >> >> Page 291, line 36
> >> >>
> >> >> "This routine cleans up all MPI state. Each process must call
> >> >> MPI_FINALIZE before
> >> >> it exits. Unless there has been a call to MPI_ABORT, each process must
> >> >> ensure that all
> >> >> pending nonblocking communications are (locally) complete before
> >> >> calling MPI_FINALIZE."
> >> >>
> >> >> Can you try inserting MPI_Wait / MPI_Waitall in your example to see if
> >> >> this works?
> >> >>
> >> >> Thanks!
> >> >>
> >> >> On Tue, Apr 5, 2011 at 10:59 AM, Dan Kokron <daniel.kokron at nasa.gov> wrote:
> >> >> > Using mvapich2-1.6 configured and built under x86_64 Linux with
> >> >> >
> >> >> > Intel-11.0.083 suite of compilers
> >> >> >
> >> >> > ./configure CC=icc CXX=icpc F77=ifort F90=ifort CFLAGS=-fpic
> >> >> > CXXFLAGS=-fpic FFLAGS=-fpic F90FLAGS=-fpic
> >> >> > --prefix=/home/dkokron/play/mvapich2-1.6/install/intel --enable-f77
> >> >> > --enable-f90 --enable-cxx --enable-romio --with-hwloc
> >> >> >
> >> >> > The attached example code gives NaN's as output from the MPI_Recv if
> >> >> > MV2_ON_DEMAND_THRESHOLD is set to be less than the number of processes
> >> >> > used.
> >> >> >
> >> >> > The example also gives NaNs using IntelMPI-4.0.1.002 if
> >> >> > I_MPI_USE_DYNAMIC_CONNECTIONS=enable
> >> >> >
> >> >> > See the 'commands' file in the tarball for more information.
> >> >> > --
> >> >> > Dan Kokron
> >> >> > Global Modeling and Assimilation Office
> >> >> > NASA Goddard Space Flight Center
> >> >> > Greenbelt, MD 20771
> >> >> > Daniel.S.Kokron at nasa.gov
> >> >> > Phone: (301) 614-5192
> >> >> > Fax:   (301) 614-5304
> >> >> >
> >> >> > _______________________________________________
> >> >> > mvapich-discuss mailing list
> >> >> > mvapich-discuss at cse.ohio-state.edu
> >> >> > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
> >> >> >
> >> >> >
> >> >>
> >> >>
> >> >>
> >> > --
> >> > Dan Kokron
> >> > Global Modeling and Assimilation Office
> >> > NASA Goddard Space Flight Center
> >> > Greenbelt, MD 20771
> >> > Daniel.S.Kokron at nasa.gov
> >> > Phone: (301) 614-5192
> >> > Fax:   (301) 614-5304
> >> >
> >>
> >>
> >>
> > --
> > Dan Kokron
> > Global Modeling and Assimilation Office
> > NASA Goddard Space Flight Center
> > Greenbelt, MD 20771
> > Daniel.S.Kokron at nasa.gov
> > Phone: (301) 614-5192
> > Fax:   (301) 614-5304
> >
> >
> 
> 
> 
-- 
Dan Kokron
Global Modeling and Assimilation Office
NASA Goddard Space Flight Center
Greenbelt, MD 20771
Daniel.S.Kokron at nasa.gov
Phone: (301) 614-5192
Fax:   (301) 614-5304



More information about the mvapich-discuss mailing list