[mvapich-discuss] non-deterministic crashes in mvapich-1.1
Noam Bernstein
noam.bernstein at nrl.navy.mil
Wed Dec 10 11:13:43 EST 2008
We have a system with dual Opteron nodes, Infiniband Infinihost III Lx
cards,
running Rocks 5.1 (CentOS 5.2) OFED 1.3.1, and mvapich 1.1. My code
is crashing
in odd, non deterministic ways. The same code on the same hardware
worked fine
under Rocks 4.1 (CentOS 4.3, OFED 1.2.4.?,and mvapich-0.99?), and on
this cluster
when I use OpenMPI 1.2.8 instead of mvapich. It also works fine on
other platforms.
The code is in Fortran 90, compiled with Intel fortran 10.1.021 (same
compiler
used for MPI compilation, together with gcc). I'm using acml 3.6.0,
because
4.2.0 leads to problem with the intel compiled code.
I always run with exactly the same input, and there should be no
randomness
involved. There are several things that I have observed to happen:
1. some numbers become infinities (results of LAPACK routines, which
are then
combined using mpi_allreduce, but I'm not sure at what point they
become infinity -
the non-reproducibilty of the symptoms makes it hard to determine)
2. LAPACK zhegv complains that the B matrix is not positive definite,
despite
the fact that it should be exactly the same as on the previous
call to zhegv
3. mpi_allreduce complains that the cookie on the communicator is
invalid
4. segmentation fault in mpi_finalize()
Symptoms 1-3 usually occur not on the first call to the problem
routine, but after
many calls. Symptoms 1 and 2 usually after a few calls (1-4), symptom
3 usually
after tens of calls (about 40 iterations of the code, not sure exactly
how many calls
to mpi_allreduce).
Right now the problem seems fairly reproducible - usually symptom 3,
with infrequent
symptom 1 or 2. Symptom 3 always occurs on task number 16, regardless
of which node it happens to be. The allreduce is doing a sum of a
smallish (dim=1116)
array of reals.
Given that the code behaves fine on other machines and on this machine
with OpenMPI, I tend to suspect mvapich (or perhaps how mvapich
interacts
with OFED).
I know this is a relatively unhelpful description of the problem, but
I haven't been
able to isolate it or make it more reproducible. Has anyone seen
anything
like this before? Does anyone have any ideas how to go about finding/
fixing the
problem?
thanks,
Noam
More information about the mvapich-discuss
mailing list