[mvapich-discuss] deadlock problem

Jennifer Brauch jbrauch at senckenberg.de
Fri Oct 21 05:23:40 EDT 2011


Hello,

I'm running an atmospheric model on an high performance computer where 
the scheduler was recently switched from torque to slurm. Under slurm, 
we have problems with openmpi. It does not perform well with our model, 
so our admins suggested to use mvapich2.
We compile the model with ifort and mvapich2 now and it runs into a 
deadlock which I don't understand.
Theoretically, it should open and write into an ascii file, when the 
deadlock occurs. So I can see the file is created but empty. When the 
time runs out, and the kill signal is send, suddenly there is output in 
that ascii file, presumably from the buffer.


srun -n 4 --kill-on-bad-exit ./lmparbin
+ srun -n 4 --kill-on-bad-exit ./lmparbin
Got unknown event 17 ... continuing ...
Got unknown event 17 ... continuing ...
Got unknown event 17 ... continuing ...
Got unknown event 17 ... continuing ...
slurmd[node9-081]: *** STEP 426573.0 CANCELLED AT 2011-10-21T10:56:07 
DUE TO TIME LIMIT ***
slurmd[node9-081]: *** JOB 426573 CANCELLED AT 2011-10-21T10:56:07 DUE 
TO TIME LIMIT ***

Other run (I've tried to tweak th buffer/send parameters to no avail):
srun -n 4 --kill-on-bad-exit ./lmparbin
+ srun -n 4 --kill-on-bad-exit ./lmparbin
slurmd[node9-082]: *** JOB 426729 CANCELLED AT 2011-10-21T11:19:48 DUE 
TO TIME LIMIT ***
slurmd[node9-082]: *** STEP 426729.0 CANCELLED AT 2011-10-21T11:19:48 
DUE TO TIME LIMIT ***


These are my modules:

Currently Loaded Modulefiles:
   1) slurm/2.3.0-2                           3) netcdf/intel/64/4.1.1
   2) mpi/mvapich2/intel/64/1.6_slurm         4) 
intel/compiler/64/12.1/2011_sp1.6.233


Is there any possibility to get more information from mvapich2 where the 
problem could be?
I'm a bit desperate...

Many thanks in advance,

Jennifer Brauch




More information about the mvapich-discuss mailing list