[mvapich-discuss] NaNs from non-blocking comms

Sayantan Sur surs at cse.ohio-state.edu
Thu Apr 7 18:25:57 EDT 2011


Thanks, Dan. We have been able to reproduce it and we are taking a
close look at it.

On Thu, Apr 7, 2011 at 10:09 PM, Dan Kokron <daniel.kokron at nasa.gov> wrote:
> Sayantan,
>
> Hope the workshop talk went well.
>
> Some more data-points.  mpich2-1.2.1p1 and the latest available MPICH2
> (r8363) don't give NaNs when configured as follows.
>
> http://www.mcs.anl.gov/research/projects/mpich2/downloads/tarballs/1.2.1p1/mpich2-1.2.1p1.tar.gz
>
> ./configure CC=icc CXX=icpc F77=ifort FC=ifort CFLAGS=-fpic
> CXXFLAGS=-fpic FFLAGS=-fpic FCFLAGS=-fpic
> --prefix=/discover/nobackup/projects/gmao/share/dkokron/play/MPICH2/mpich2-1.2.1p1/install --enable-f77 --enable-f90 --enable-cxx --enable-romio --enable-smpcoll --without-mpe
>
> http://www.mcs.anl.gov/research/projects/mpich2/downloads/tarballs/nightly/trunk/mpich2-trunk-r8363.tar.gz
>
> ./configure CC=icc CXX=icpc F77=ifort FC=ifort CFLAGS=-fpic
> CXXFLAGS=-fpic FFLAGS=-fpic FCFLAGS=-fpic --prefix=$PWD/install
> --enable-f77 --enable-fc --enable-cxx --enable-romio --with-pm=hydra
> --enable-smpcoll --without-mpe
>
> Dan
>
> On Tue, 2011-04-05 at 15:32 -0500, Sayantan Sur wrote:
>> Hi Dan,
>>
>> Thanks for the updated code. I will ask someone to run the code on our
>> end to see if we can reproduce this. I am at the OpenFabrics workshop,
>> and our talk is going to be held soon.
>>
>> Thanks again.
>>
>> On Tue, Apr 5, 2011 at 12:50 PM, Dan Kokron <daniel.kokron at nasa.gov> wrote:
>> > updated code is attached.  I think I put the Wait in the proper place.
>> > Still getting NaNs.
>> >
>> > mpiexec.hydra -prepend-rank -launcher-exec /usr/bin/sshmpi -np 72 ./a.out
>> > [3]  NaN found           13          10         660
>> > [69]  NaN found           13           9         588
>> >
>> > Dan
>> >
>> > On Tue, 2011-04-05 at 14:11 -0500, Sayantan Sur wrote:
>> >> Hi Dan,
>> >>
>> >> Thanks for posting this example. I took a quick look at the example. I
>> >> think there is a bug in the application code. The MPI standard
>> >> requires that all non-blocking communications be (locally) completed
>> >> before calling finalize. MPI_Barrier doesn't guarantee this. Let me
>> >> know if you think I am mistaken.
>> >>
>> >> http://www.mpi-forum.org/docs/mpi-2.2/mpi22-report.pdf
>> >>
>> >> Page 291, line 36
>> >>
>> >> "This routine cleans up all MPI state. Each process must call
>> >> MPI_FINALIZE before
>> >> it exits. Unless there has been a call to MPI_ABORT, each process must
>> >> ensure that all
>> >> pending nonblocking communications are (locally) complete before
>> >> calling MPI_FINALIZE."
>> >>
>> >> Can you try inserting MPI_Wait / MPI_Waitall in your example to see if
>> >> this works?
>> >>
>> >> Thanks!
>> >>
>> >> On Tue, Apr 5, 2011 at 10:59 AM, Dan Kokron <daniel.kokron at nasa.gov> wrote:
>> >> > Using mvapich2-1.6 configured and built under x86_64 Linux with
>> >> >
>> >> > Intel-11.0.083 suite of compilers
>> >> >
>> >> > ./configure CC=icc CXX=icpc F77=ifort F90=ifort CFLAGS=-fpic
>> >> > CXXFLAGS=-fpic FFLAGS=-fpic F90FLAGS=-fpic
>> >> > --prefix=/home/dkokron/play/mvapich2-1.6/install/intel --enable-f77
>> >> > --enable-f90 --enable-cxx --enable-romio --with-hwloc
>> >> >
>> >> > The attached example code gives NaN's as output from the MPI_Recv if
>> >> > MV2_ON_DEMAND_THRESHOLD is set to be less than the number of processes
>> >> > used.
>> >> >
>> >> > The example also gives NaNs using IntelMPI-4.0.1.002 if
>> >> > I_MPI_USE_DYNAMIC_CONNECTIONS=enable
>> >> >
>> >> > See the 'commands' file in the tarball for more information.
>> >> > --
>> >> > Dan Kokron
>> >> > Global Modeling and Assimilation Office
>> >> > NASA Goddard Space Flight Center
>> >> > Greenbelt, MD 20771
>> >> > Daniel.S.Kokron at nasa.gov
>> >> > Phone: (301) 614-5192
>> >> > Fax:   (301) 614-5304
>> >> >
>> >> > _______________________________________________
>> >> > mvapich-discuss mailing list
>> >> > mvapich-discuss at cse.ohio-state.edu
>> >> > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>> >> >
>> >> >
>> >>
>> >>
>> >>
>> > --
>> > Dan Kokron
>> > Global Modeling and Assimilation Office
>> > NASA Goddard Space Flight Center
>> > Greenbelt, MD 20771
>> > Daniel.S.Kokron at nasa.gov
>> > Phone: (301) 614-5192
>> > Fax:   (301) 614-5304
>> >
>>
>>
>>
> --
> Dan Kokron
> Global Modeling and Assimilation Office
> NASA Goddard Space Flight Center
> Greenbelt, MD 20771
> Daniel.S.Kokron at nasa.gov
> Phone: (301) 614-5192
> Fax:   (301) 614-5304
>
>



-- 
Sayantan Sur

Research Scientist
Department of Computer Science
http://www.cse.ohio-state.edu/~surs



More information about the mvapich-discuss mailing list