[mvapich-discuss] mpi send/recieve hanging, how to diagnose?

Ben Benjamin.M.Auer at nasa.gov
Fri Jul 5 15:15:50 EDT 2013


Hi,
I'm currently having what seems to be an issue with mvapich.
I'm part of a team that maintains a global climate model mostly written 
in Fortran 90/95. At a point in the code, there are
large number of MPI_ISends/MPI_Recv (anywhere from thousands to hundreds 
of thousands) when when the data that is distributed across all mpi 
processes has to be collected on
a particular processor to be transformed to a different resolution 
before being written.
Above a certain resolution/number of mpiprocs the model simply hangs at 
the receive after the send.
The strange thing this is that at the same resolution at lower processor 
count it works fine.
For example at the troublesome resolution the model runs on 864 
processors but hangs with 1536 processors.
However, at a lower resolution the same code runs fine on 1536 
processors and above.
We are currently using the Intel 13 fortran compiler and had been using 
mvapich 1.8.1, although mvapich 1.9 also exhibits this behaviour. Does 
anyone have any suggests on how to diagnose what is going on or some 
parameters that we could play with that might help? This was perhaps a 
bit hand-wavy but we are rather stumped at this point how to proceed. 
Interestingly we have gotten the code to run with other mpi stacks at 
the resolution/processor count where mvapich hangs. I can provide more 
details if needed.
Thanks

-- 
Ben Auer, PhD   SSAI, Scientific Programmer/Analyst
NASA GSFC,  Global Modeling and Assimilation Office
Code 610.1, 8800 Greenbelt Rd, Greenbelt, MD  20771
Phone: 301-286-9176               Fax: 301-614-6246



More information about the mvapich-discuss mailing list