[mvapich-discuss] non-deterministic crashes in mvapich-1.1

Matthew Koop koop at cse.ohio-state.edu
Wed Dec 10 12:14:44 EST 2008


Noam,

I suspect this may have something to do with the shared memory all
reduce optimization. Can you try turning it off and seeing if the problem
still occurs to help us narrow down the problem?

e.g.
mpirun_rsh -np 128 -hostfile ./h VIADEV_USE_SHMEM_ALLREDUCE=0 ./exec

Thanks,

Matt

On Wed, 10 Dec 2008, Noam Bernstein wrote:

> We have a system with dual Opteron nodes, Infiniband Infinihost III Lx
> cards,
> running Rocks 5.1 (CentOS 5.2) OFED 1.3.1, and mvapich 1.1.  My code
> is crashing
> in odd, non deterministic ways.  The same code on the same hardware
> worked fine
> under Rocks 4.1 (CentOS 4.3, OFED 1.2.4.?,and mvapich-0.99?), and on
> this cluster
> when I use OpenMPI 1.2.8 instead of mvapich. It also works fine on
> other platforms.
>
> The code is in Fortran 90, compiled with Intel fortran 10.1.021 (same
> compiler
> used for MPI compilation, together with gcc).  I'm using acml 3.6.0,
> because
> 4.2.0 leads to problem with the intel compiled code.
>
> I always run with exactly the same input, and there should be no
> randomness
> involved. There are several things that I have observed to happen:
> 1. some numbers become infinities (results of LAPACK routines, which
> are then
>      combined using mpi_allreduce, but I'm not sure at what point they
> become infinity -
>      the non-reproducibilty of the symptoms makes it hard to determine)
> 2. LAPACK zhegv complains that the B matrix is not positive definite,
> despite
>      the fact that it should be exactly the same as on the previous
> call to zhegv
> 3. mpi_allreduce complains that the cookie on the communicator is
> invalid
> 4. segmentation fault in mpi_finalize()
>
> Symptoms 1-3 usually occur not on the first call to the problem
> routine, but after
> many calls.  Symptoms 1 and 2 usually after a few calls (1-4), symptom
> 3 usually
> after tens of calls (about 40 iterations of the code, not sure exactly
> how many calls
> to mpi_allreduce).
>
> Right now the problem seems fairly reproducible - usually symptom 3,
> with infrequent
> symptom 1 or 2. Symptom 3 always occurs on task number 16, regardless
> of which node it happens to be.   The allreduce is doing a sum of a
> smallish (dim=1116)
> array of reals.
>
> Given that the code behaves fine on other machines and on this machine
> with OpenMPI, I tend to suspect mvapich (or perhaps how mvapich
> interacts
> with OFED).
>
> I know this is a relatively unhelpful description of the problem, but
> I haven't been
> able to isolate it or make it more reproducible.  Has anyone seen
> anything
> like this before? Does anyone have any ideas how to go about finding/
> fixing the
> problem?
>
> 													thanks,
> 													Noam
> _______________________________________________
> mvapich-discuss mailing list
> mvapich-discuss at cse.ohio-state.edu
> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>



More information about the mvapich-discuss mailing list