[mvapich-discuss] non-deterministic crashes in mvapich-1.1

Noam Bernstein noam.bernstein at nrl.navy.mil
Wed Dec 10 11:13:43 EST 2008


We have a system with dual Opteron nodes, Infiniband Infinihost III Lx  
cards,
running Rocks 5.1 (CentOS 5.2) OFED 1.3.1, and mvapich 1.1.  My code  
is crashing
in odd, non deterministic ways.  The same code on the same hardware  
worked fine
under Rocks 4.1 (CentOS 4.3, OFED 1.2.4.?,and mvapich-0.99?), and on  
this cluster
when I use OpenMPI 1.2.8 instead of mvapich. It also works fine on  
other platforms.

The code is in Fortran 90, compiled with Intel fortran 10.1.021 (same  
compiler
used for MPI compilation, together with gcc).  I'm using acml 3.6.0,  
because
4.2.0 leads to problem with the intel compiled code.

I always run with exactly the same input, and there should be no  
randomness
involved. There are several things that I have observed to happen:
1. some numbers become infinities (results of LAPACK routines, which  
are then
     combined using mpi_allreduce, but I'm not sure at what point they  
become infinity -
     the non-reproducibilty of the symptoms makes it hard to determine)
2. LAPACK zhegv complains that the B matrix is not positive definite,  
despite
     the fact that it should be exactly the same as on the previous  
call to zhegv
3. mpi_allreduce complains that the cookie on the communicator is  
invalid
4. segmentation fault in mpi_finalize()

Symptoms 1-3 usually occur not on the first call to the problem  
routine, but after
many calls.  Symptoms 1 and 2 usually after a few calls (1-4), symptom  
3 usually
after tens of calls (about 40 iterations of the code, not sure exactly  
how many calls
to mpi_allreduce).

Right now the problem seems fairly reproducible - usually symptom 3,  
with infrequent
symptom 1 or 2. Symptom 3 always occurs on task number 16, regardless
of which node it happens to be.   The allreduce is doing a sum of a  
smallish (dim=1116)
array of reals.

Given that the code behaves fine on other machines and on this machine
with OpenMPI, I tend to suspect mvapich (or perhaps how mvapich  
interacts
with OFED).

I know this is a relatively unhelpful description of the problem, but  
I haven't been
able to isolate it or make it more reproducible.  Has anyone seen  
anything
like this before? Does anyone have any ideas how to go about finding/ 
fixing the
problem?

													thanks,
													Noam


More information about the mvapich-discuss mailing list