[mvapich-discuss] deadlock problem

Jonathan Perkins perkinjo at cse.ohio-state.edu
Fri Oct 21 14:15:08 EDT 2011


Hello Jennifer, can you run `mpiname -a' to show the exact settings
used when mvapich2 was compiled.  Also, does your cluster use a
distributed file system or is this just to a local filesystem or nfs?

Are you able to run any simple benchmarks such as the osu-micro-benchmarks?
http://mvapich.cse.ohio-state.edu/benchmarks/osu-micro-benchmarks-3.4.tar.gz


On Fri, Oct 21, 2011 at 5:23 AM, Jennifer Brauch <jbrauch at senckenberg.de> wrote:
> Hello,
>
> I'm running an atmospheric model on an high performance computer where the
> scheduler was recently switched from torque to slurm. Under slurm, we have
> problems with openmpi. It does not perform well with our model, so our
> admins suggested to use mvapich2.
> We compile the model with ifort and mvapich2 now and it runs into a deadlock
> which I don't understand.
> Theoretically, it should open and write into an ascii file, when the
> deadlock occurs. So I can see the file is created but empty. When the time
> runs out, and the kill signal is send, suddenly there is output in that
> ascii file, presumably from the buffer.
>
>
> srun -n 4 --kill-on-bad-exit ./lmparbin
> + srun -n 4 --kill-on-bad-exit ./lmparbin
> Got unknown event 17 ... continuing ...
> Got unknown event 17 ... continuing ...
> Got unknown event 17 ... continuing ...
> Got unknown event 17 ... continuing ...
> slurmd[node9-081]: *** STEP 426573.0 CANCELLED AT 2011-10-21T10:56:07 DUE TO
> TIME LIMIT ***
> slurmd[node9-081]: *** JOB 426573 CANCELLED AT 2011-10-21T10:56:07 DUE TO
> TIME LIMIT ***
>
> Other run (I've tried to tweak th buffer/send parameters to no avail):
> srun -n 4 --kill-on-bad-exit ./lmparbin
> + srun -n 4 --kill-on-bad-exit ./lmparbin
> slurmd[node9-082]: *** JOB 426729 CANCELLED AT 2011-10-21T11:19:48 DUE TO
> TIME LIMIT ***
> slurmd[node9-082]: *** STEP 426729.0 CANCELLED AT 2011-10-21T11:19:48 DUE TO
> TIME LIMIT ***
>
>
> These are my modules:
>
> Currently Loaded Modulefiles:
>  1) slurm/2.3.0-2                           3) netcdf/intel/64/4.1.1
>  2) mpi/mvapich2/intel/64/1.6_slurm         4)
> intel/compiler/64/12.1/2011_sp1.6.233
>
>
> Is there any possibility to get more information from mvapich2 where the
> problem could be?
> I'm a bit desperate...
>
> Many thanks in advance,
>
> Jennifer Brauch
>
>
> _______________________________________________
> mvapich-discuss mailing list
> mvapich-discuss at cse.ohio-state.edu
> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>
>



-- 
Jonathan Perkins
http://www.cse.ohio-state.edu/~perkinjo



More information about the mvapich-discuss mailing list