[mvapich-discuss] deadlock problem

Wed Oct 26 07:21:15 EDT 2011

Hi Jennifer, sorry for the delay.  I got a bit side tracked by another
issue but I'm revisiting your issue and will get back to you sometime
today.

On Wed, Oct 26, 2011 at 6:43 AM, Jennifer Brauch <jbrauch at senckenberg.de> wrote:
> Hello Jonathan,
>
> I recompiled the model with pgi:
>
> module load mpi/mvapich2/pgi/64/1.6_2 pgi/11.8 netcdf/pgi/64/4.1.1
>
> And the result was the same. Somehow it has something to do with mvapich2 or
> the computer architecture itself.
>
> Do you have any conclusions concering the benchmarks or do you have any
> ideas what els I could try?
>
> Many thanks in advance,
>
> Jennifer Brauch
>
> Am 10/21/11 8:15 PM, schrieb Jonathan Perkins:
>>
>> Hello Jennifer, can you run `mpiname -a' to show the exact settings
>> used when mvapich2 was compiled.  Also, does your cluster use a
>> distributed file system or is this just to a local filesystem or nfs?
>>
>> Are you able to run any simple benchmarks such as the
>> osu-micro-benchmarks?
>>
>> http://mvapich.cse.ohio-state.edu/benchmarks/osu-micro-benchmarks-3.4.tar.gz
>>
>>
>> On Fri, Oct 21, 2011 at 5:23 AM, Jennifer Brauch<jbrauch at senckenberg.de>
>>  wrote:
>>>
>>> Hello,
>>>
>>> I'm running an atmospheric model on an high performance computer where
>>> the
>>> scheduler was recently switched from torque to slurm. Under slurm, we
>>> have
>>> problems with openmpi. It does not perform well with our model, so our
>>> admins suggested to use mvapich2.
>>> We compile the model with ifort and mvapich2 now and it runs into a
>>> deadlock
>>> which I don't understand.
>>> Theoretically, it should open and write into an ascii file, when the
>>> deadlock occurs. So I can see the file is created but empty. When the
>>> time
>>> runs out, and the kill signal is send, suddenly there is output in that
>>> ascii file, presumably from the buffer.
>>>
>>>
>>> srun -n 4 --kill-on-bad-exit ./lmparbin
>>> + srun -n 4 --kill-on-bad-exit ./lmparbin
>>> Got unknown event 17 ... continuing ...
>>> Got unknown event 17 ... continuing ...
>>> Got unknown event 17 ... continuing ...
>>> Got unknown event 17 ... continuing ...
>>> slurmd[node9-081]: *** STEP 426573.0 CANCELLED AT 2011-10-21T10:56:07 DUE
>>> TO
>>> TIME LIMIT ***
>>> slurmd[node9-081]: *** JOB 426573 CANCELLED AT 2011-10-21T10:56:07 DUE TO
>>> TIME LIMIT ***
>>>
>>> Other run (I've tried to tweak th buffer/send parameters to no avail):
>>> srun -n 4 --kill-on-bad-exit ./lmparbin
>>> + srun -n 4 --kill-on-bad-exit ./lmparbin
>>> slurmd[node9-082]: *** JOB 426729 CANCELLED AT 2011-10-21T11:19:48 DUE TO
>>> TIME LIMIT ***
>>> slurmd[node9-082]: *** STEP 426729.0 CANCELLED AT 2011-10-21T11:19:48 DUE
>>> TO
>>> TIME LIMIT ***
>>>
>>>
>>> These are my modules:
>>>
>>> Currently Loaded Modulefiles:
>>>  1) slurm/2.3.0-2                           3) netcdf/intel/64/4.1.1
>>>  2) mpi/mvapich2/intel/64/1.6_slurm         4)
>>> intel/compiler/64/12.1/2011_sp1.6.233
>>>
>>>
>>> Is there any possibility to get more information from mvapich2 where the
>>> problem could be?
>>> I'm a bit desperate...
>>>
>>> Many thanks in advance,
>>>
>>> Jennifer Brauch
>>>
>>>
>>> _______________________________________________
>>> mvapich-discuss mailing list
>>> mvapich-discuss at cse.ohio-state.edu
>>> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>>>
>>>
>>
>>
>>
>
>

-- 
Jonathan Perkins
http://www.cse.ohio-state.edu/~perkinjo