[mvapich-discuss] mpi send/recieve hanging, how to diagnose?
Ben
Benjamin.M.Auer at nasa.gov
Fri Jul 5 15:15:50 EDT 2013
Hi,
I'm currently having what seems to be an issue with mvapich.
I'm part of a team that maintains a global climate model mostly written
in Fortran 90/95. At a point in the code, there are
large number of MPI_ISends/MPI_Recv (anywhere from thousands to hundreds
of thousands) when when the data that is distributed across all mpi
processes has to be collected on
a particular processor to be transformed to a different resolution
before being written.
Above a certain resolution/number of mpiprocs the model simply hangs at
the receive after the send.
The strange thing this is that at the same resolution at lower processor
count it works fine.
For example at the troublesome resolution the model runs on 864
processors but hangs with 1536 processors.
However, at a lower resolution the same code runs fine on 1536
processors and above.
We are currently using the Intel 13 fortran compiler and had been using
mvapich 1.8.1, although mvapich 1.9 also exhibits this behaviour. Does
anyone have any suggests on how to diagnose what is going on or some
parameters that we could play with that might help? This was perhaps a
bit hand-wavy but we are rather stumped at this point how to proceed.
Interestingly we have gotten the code to run with other mpi stacks at
the resolution/processor count where mvapich hangs. I can provide more
details if needed.
Thanks
--
Ben Auer, PhD SSAI, Scientific Programmer/Analyst
NASA GSFC, Global Modeling and Assimilation Office
Code 610.1, 8800 Greenbelt Rd, Greenbelt, MD 20771
Phone: 301-286-9176 Fax: 301-614-6246
More information about the mvapich-discuss
mailing list