[mvapich-discuss] mvapich hang with many sends to root

Ben Benjamin.M.Auer at nasa.gov
Tue Aug 5 11:07:25 EDT 2014


We recently found some code in the GEOS-5 model was hanging when using 
mvapich. The code in questions essentially is our own version of a 
non-blocking gather using sends, receives, and waits. We take a number 
of 2-D arrays that are distributed, gather each to array to a particular 
processor, do some work on it, then send the now processed array of the 
global size to a "root" process where it is written. Most of the time 
each 2-D array is sent to a different rank to do this work but recently 
the code in question was used with each array being gather to root 
because it was never set otherwise. Obviously this is not efficient but 
I'm just wondering why it is hanging. The code works but beyond a 
certain number of processors it just hangs. For example the same code 
works at 216 processor but on 864 it hangs.  Fixing it so that the 
arrays get gathered across all available processors in some smart 
fashion the code runs fine. Is there a limit to the number of sends that 
can go to 1 process, are we simply overwhelming mvapich or some internal 
limit, with over 800,000 sends all to root in the case in I was looking 
at with 864 processors?

I have a tester that duplicates our gather process that exhibits this 
same behaviour with both mvapich 1.8.1 and mvapich 2.0a2 thus far and 
intel fortran 13

mpiname -a shows:

MVAPICH2 2.0a Fri Aug 23 13:38:52 EDT 2013 ch3:mrail

Compilation
CC: icc -fpic -m64   -DNDEBUG -DNVALGRIND -O2
CXX: icpc -fpic -m64  -DNDEBUG -DNVALGRIND -O2
F77: ifort -L/lib -L/lib -m64 -fpic  -O2
FC: ifort -m64 -fpic  -O2

Configuration
--with-device=ch3:mrail --with-rdma=gen2 CC=icc CXX=icpc F77=ifort 
FC=ifort CFLAGS=-fpic -m64 CXXFLAGS=-fpic -m64 FFLAGS=-m64 -fpic 
FCFLAGS=-m64 -fpic --enable-f77 --enable-fc --enable-cxx --enable-romio 
--enable-threads=default --with-hwloc -disable-multi-aliases 
-enable-xrc=yes -enable-hybrid 
--prefix=/usr/local/other/SLES11.1/mvapich2/2.0a/intel-13.1.3.192

-- 
Ben Auer, PhD   SSAI, Scientific Programmer/Analyst
NASA GSFC,  Global Modeling and Assimilation Office
Code 610.1, 8800 Greenbelt Rd, Greenbelt, MD  20771
Phone: 301-286-9176               Fax: 301-614-6246



More information about the mvapich-discuss mailing list