[mvapich-discuss] VALGRIND on MVAPICH2 F90 code

David Stuebe dstuebe at umassd.edu
Fri Jan 18 13:53:49 EST 2008


Hello MVAPICH and VALGRIND

I am a research associate at UMASSD. I work on a numerical ocean model,
fvcom, written in F90.

We have recently run into problems:

forrtl: error (78): process killed (SIGTERM)
forrtl: error (78): process killed (SIGTERM)
forrtl: error (78): process killed (SIGTERM)
mpiexec: Warning: tasks 0-1,3 exited with status 1.
mpiexec: Warning: task 2 died with signal 11 (Segmentation fault).


The error is problem size dependent
The error is compiler optimization dependent.
The error only occurs when running on more than one node. (in the example
error above, I used 2 procs. per node, on 2 nodes)
If I run on four procs in one node, the code passes!

The only clue that I have is that the problem seems to be related to
subroutines which use explicit shape arrays - but I have checked all the
upper and lower bounds. Running under valgrind or compiling with '-check
all' in ifort allows the routine to pass?

It seems my only hope for tracing this mess is using valgrind, but I am
having trouble using valgrind on our cluster. It does run but I am concerned
that it is not running properly. The mpi_init call alone results in hundreds
of errors in the mpi and vapi libraries including leaks, uninitialized
memory use/conditionals and invalid read/writes. Has anyone had success
using valgrind with mvapich2?

Valgrind also found problems with the fvcom fortran code but most of these
seemed to go away when I increased the max-framestack. None of the remaining
errors seem to be related to what causes the sigsev when I run without
valgrind.

Selected system info:
Nodes are Dell 1850. Intel Xeon EM64-T
Network is Infiniband PCI-EX 4X
System is Rocks 4.2

Thread model: posix
gcc version 3.4.6 20060404 (Red Hat 3.4.6-3)

ifort Version 9.1

mpif90 for mvapich2-1.0

valgrind-3.2.3

mpiexec-0.82


Again, all of these tools/libraries seem to work fine under normal tests,
but this particular combination of code and model case is causing a real
mess!

Thanks for any help you can offer!

David
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20080118/e6f046a8/attachment-0001.html


More information about the mvapich-discuss mailing list