[mvapich-discuss] deadlock problem

Mehmet mbelgin at gmail.com
Fri Oct 21 15:25:44 EDT 2011


Jennifer,

We also experienced deadlock issues in several cases. Using OSC's
mpiexec<http://www.osc.edu/~djohnson/mpiexec/index.php>solved it for
us. For some reason, this is different than the mpiexec/mpirun
that comes with mvapich2. Hope this helps.

-Mehmet

On Fri, Oct 21, 2011 at 2:15 PM, Jonathan Perkins <
perkinjo at cse.ohio-state.edu> wrote:

> Hello Jennifer, can you run `mpiname -a' to show the exact settings
> used when mvapich2 was compiled.  Also, does your cluster use a
> distributed file system or is this just to a local filesystem or nfs?
>
> Are you able to run any simple benchmarks such as the osu-micro-benchmarks?
>
> http://mvapich.cse.ohio-state.edu/benchmarks/osu-micro-benchmarks-3.4.tar.gz
>
>
> On Fri, Oct 21, 2011 at 5:23 AM, Jennifer Brauch <jbrauch at senckenberg.de>
> wrote:
> > Hello,
> >
> > I'm running an atmospheric model on an high performance computer where
> the
> > scheduler was recently switched from torque to slurm. Under slurm, we
> have
> > problems with openmpi. It does not perform well with our model, so our
> > admins suggested to use mvapich2.
> > We compile the model with ifort and mvapich2 now and it runs into a
> deadlock
> > which I don't understand.
> > Theoretically, it should open and write into an ascii file, when the
> > deadlock occurs. So I can see the file is created but empty. When the
> time
> > runs out, and the kill signal is send, suddenly there is output in that
> > ascii file, presumably from the buffer.
> >
> >
> > srun -n 4 --kill-on-bad-exit ./lmparbin
> > + srun -n 4 --kill-on-bad-exit ./lmparbin
> > Got unknown event 17 ... continuing ...
> > Got unknown event 17 ... continuing ...
> > Got unknown event 17 ... continuing ...
> > Got unknown event 17 ... continuing ...
> > slurmd[node9-081]: *** STEP 426573.0 CANCELLED AT 2011-10-21T10:56:07 DUE
> TO
> > TIME LIMIT ***
> > slurmd[node9-081]: *** JOB 426573 CANCELLED AT 2011-10-21T10:56:07 DUE TO
> > TIME LIMIT ***
> >
> > Other run (I've tried to tweak th buffer/send parameters to no avail):
> > srun -n 4 --kill-on-bad-exit ./lmparbin
> > + srun -n 4 --kill-on-bad-exit ./lmparbin
> > slurmd[node9-082]: *** JOB 426729 CANCELLED AT 2011-10-21T11:19:48 DUE TO
> > TIME LIMIT ***
> > slurmd[node9-082]: *** STEP 426729.0 CANCELLED AT 2011-10-21T11:19:48 DUE
> TO
> > TIME LIMIT ***
> >
> >
> > These are my modules:
> >
> > Currently Loaded Modulefiles:
> >  1) slurm/2.3.0-2                           3) netcdf/intel/64/4.1.1
> >  2) mpi/mvapich2/intel/64/1.6_slurm         4)
> > intel/compiler/64/12.1/2011_sp1.6.233
> >
> >
> > Is there any possibility to get more information from mvapich2 where the
> > problem could be?
> > I'm a bit desperate...
> >
> > Many thanks in advance,
> >
> > Jennifer Brauch
> >
> >
> > _______________________________________________
> > mvapich-discuss mailing list
> > mvapich-discuss at cse.ohio-state.edu
> > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
> >
> >
>
>
>
> --
> Jonathan Perkins
> http://www.cse.ohio-state.edu/~perkinjo
>
> _______________________________________________
> mvapich-discuss mailing list
> mvapich-discuss at cse.ohio-state.edu
> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>



-- 
=========================================
Mehmet Belgin, Ph.D. (mehmet.belgin at oit.gatech.edu)
Scientific Computing Consultant | OIT - Academic and Research Technologies
Georgia Institute of Technology
258 Fourth Street, Rich Building, Room 326
Atlanta, GA  30332-0700
Office: (404) 385-0665
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20111021/05d27e46/attachment.html


More information about the mvapich-discuss mailing list