[mvapich-discuss] mvapich hang with many sends to root
Ben
Benjamin.M.Auer at nasa.gov
Tue Aug 5 11:07:25 EDT 2014
We recently found some code in the GEOS-5 model was hanging when using
mvapich. The code in questions essentially is our own version of a
non-blocking gather using sends, receives, and waits. We take a number
of 2-D arrays that are distributed, gather each to array to a particular
processor, do some work on it, then send the now processed array of the
global size to a "root" process where it is written. Most of the time
each 2-D array is sent to a different rank to do this work but recently
the code in question was used with each array being gather to root
because it was never set otherwise. Obviously this is not efficient but
I'm just wondering why it is hanging. The code works but beyond a
certain number of processors it just hangs. For example the same code
works at 216 processor but on 864 it hangs. Fixing it so that the
arrays get gathered across all available processors in some smart
fashion the code runs fine. Is there a limit to the number of sends that
can go to 1 process, are we simply overwhelming mvapich or some internal
limit, with over 800,000 sends all to root in the case in I was looking
at with 864 processors?
I have a tester that duplicates our gather process that exhibits this
same behaviour with both mvapich 1.8.1 and mvapich 2.0a2 thus far and
intel fortran 13
mpiname -a shows:
MVAPICH2 2.0a Fri Aug 23 13:38:52 EDT 2013 ch3:mrail
Compilation
CC: icc -fpic -m64 -DNDEBUG -DNVALGRIND -O2
CXX: icpc -fpic -m64 -DNDEBUG -DNVALGRIND -O2
F77: ifort -L/lib -L/lib -m64 -fpic -O2
FC: ifort -m64 -fpic -O2
Configuration
--with-device=ch3:mrail --with-rdma=gen2 CC=icc CXX=icpc F77=ifort
FC=ifort CFLAGS=-fpic -m64 CXXFLAGS=-fpic -m64 FFLAGS=-m64 -fpic
FCFLAGS=-m64 -fpic --enable-f77 --enable-fc --enable-cxx --enable-romio
--enable-threads=default --with-hwloc -disable-multi-aliases
-enable-xrc=yes -enable-hybrid
--prefix=/usr/local/other/SLES11.1/mvapich2/2.0a/intel-13.1.3.192
--
Ben Auer, PhD SSAI, Scientific Programmer/Analyst
NASA GSFC, Global Modeling and Assimilation Office
Code 610.1, 8800 Greenbelt Rd, Greenbelt, MD 20771
Phone: 301-286-9176 Fax: 301-614-6246
More information about the mvapich-discuss
mailing list