[mvapich-discuss] deadlock problem
Jennifer Brauch
jbrauch at senckenberg.de
Fri Oct 21 05:23:40 EDT 2011
Hello,
I'm running an atmospheric model on an high performance computer where
the scheduler was recently switched from torque to slurm. Under slurm,
we have problems with openmpi. It does not perform well with our model,
so our admins suggested to use mvapich2.
We compile the model with ifort and mvapich2 now and it runs into a
deadlock which I don't understand.
Theoretically, it should open and write into an ascii file, when the
deadlock occurs. So I can see the file is created but empty. When the
time runs out, and the kill signal is send, suddenly there is output in
that ascii file, presumably from the buffer.
srun -n 4 --kill-on-bad-exit ./lmparbin
+ srun -n 4 --kill-on-bad-exit ./lmparbin
Got unknown event 17 ... continuing ...
Got unknown event 17 ... continuing ...
Got unknown event 17 ... continuing ...
Got unknown event 17 ... continuing ...
slurmd[node9-081]: *** STEP 426573.0 CANCELLED AT 2011-10-21T10:56:07
DUE TO TIME LIMIT ***
slurmd[node9-081]: *** JOB 426573 CANCELLED AT 2011-10-21T10:56:07 DUE
TO TIME LIMIT ***
Other run (I've tried to tweak th buffer/send parameters to no avail):
srun -n 4 --kill-on-bad-exit ./lmparbin
+ srun -n 4 --kill-on-bad-exit ./lmparbin
slurmd[node9-082]: *** JOB 426729 CANCELLED AT 2011-10-21T11:19:48 DUE
TO TIME LIMIT ***
slurmd[node9-082]: *** STEP 426729.0 CANCELLED AT 2011-10-21T11:19:48
DUE TO TIME LIMIT ***
These are my modules:
Currently Loaded Modulefiles:
1) slurm/2.3.0-2 3) netcdf/intel/64/4.1.1
2) mpi/mvapich2/intel/64/1.6_slurm 4)
intel/compiler/64/12.1/2011_sp1.6.233
Is there any possibility to get more information from mvapich2 where the
problem could be?
I'm a bit desperate...
Many thanks in advance,
Jennifer Brauch
More information about the mvapich-discuss
mailing list