[mvapich-discuss] mvapich hang with many sends to root

Tue Aug 5 12:11:54 EDT 2014

Hello Ben,

Sorry to hear that the code is hanging at scale. There shouldn't be any
internal limit to the number of back to back sends.

Have you tried out the code with the latest MVAPICH2-2.0GA release? We made
several performance enhancements and bug fixes in between 2.0a2 and 2.0GA.
Thus, it is possible that the code may work with 2.0GA release.

In the mean time, it would be great if you could give us a reproducer that
we can try out locally. Could you also let us know the environment
variables with which you are running your application?

Regards,
Hari.

On Tue, Aug 5, 2014 at 11:07 AM, Ben <Benjamin.M.Auer at nasa.gov> wrote:

> We recently found some code in the GEOS-5 model was hanging when using
> mvapich. The code in questions essentially is our own version of a
> non-blocking gather using sends, receives, and waits. We take a number of
> 2-D arrays that are distributed, gather each to array to a particular
> processor, do some work on it, then send the now processed array of the
> global size to a "root" process where it is written. Most of the time each
> 2-D array is sent to a different rank to do this work but recently the code
> in question was used with each array being gather to root because it was
> never set otherwise. Obviously this is not efficient but I'm just wondering
> why it is hanging. The code works but beyond a certain number of processors
> it just hangs. For example the same code works at 216 processor but on 864
> it hangs.  Fixing it so that the arrays get gathered across all available
> processors in some smart fashion the code runs fine. Is there a limit to
> the number of sends that can go to 1 process, are we simply overwhelming
> mvapich or some internal limit, with over 800,000 sends all to root in the
> case in I was looking at with 864 processors?
>
> I have a tester that duplicates our gather process that exhibits this same
> behaviour with both mvapich 1.8.1 and mvapich 2.0a2 thus far and intel
> fortran 13
>
> mpiname -a shows:
>
> MVAPICH2 2.0a Fri Aug 23 13:38:52 EDT 2013 ch3:mrail
>
> Compilation
> CC: icc -fpic -m64   -DNDEBUG -DNVALGRIND -O2
> CXX: icpc -fpic -m64  -DNDEBUG -DNVALGRIND -O2
> F77: ifort -L/lib -L/lib -m64 -fpic  -O2
> FC: ifort -m64 -fpic  -O2
>
> Configuration
> --with-device=ch3:mrail --with-rdma=gen2 CC=icc CXX=icpc F77=ifort
> FC=ifort CFLAGS=-fpic -m64 CXXFLAGS=-fpic -m64 FFLAGS=-m64 -fpic
> FCFLAGS=-m64 -fpic --enable-f77 --enable-fc --enable-cxx --enable-romio
> --enable-threads=default --with-hwloc -disable-multi-aliases
> -enable-xrc=yes -enable-hybrid --prefix=/usr/local/other/
> SLES11.1/mvapich2/2.0a/intel-13.1.3.192
>
> --
> Ben Auer, PhD   SSAI, Scientific Programmer/Analyst
> NASA GSFC,  Global Modeling and Assimilation Office
> Code 610.1, 8800 Greenbelt Rd, Greenbelt, MD  20771
> Phone: 301-286-9176               Fax: 301-614-6246
>
> _______________________________________________
> mvapich-discuss mailing list
> mvapich-discuss at cse.ohio-state.edu
> http://mailman.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20140805/c0e4cee8/attachment.html>