[mvapich-discuss] slurm+mvapich2 -> cannot create cq ?

Henderson, Brent brent.henderson at hp.com
Wed Oct 31 16:25:05 EDT 2012


Add the change below in /etc/init.d/slurm right before you start the slurm daemon on your nodes.  Essentially the init process (that starts up everything on the nodes) does not read/honor memlock.conf, so you need to bump it up manually so that children of the slurm daemon will also see the increased limit.  

If you ssh into the node, your shell is started by the sshd - which does honor the setting in memlock.conf.

Brent

-bash-4.1$ diff -C 3 -w slurm.orig slurm 
*** slurm.orig  2012-10-31 15:21:16.207026911 -0500
--- slurm.new   2012-10-31 15:21:10.549035405 -0500
***************
*** 89,94 ****
--- 89,95 ----
      shift
      echo -n "starting $prog: "
      unset HOME MAIL USER USERNAME
+     ulimit -l unlimited
      $STARTPROC $SBINDIR/$prog $*
      rc_status -v
      echo
-bash-4.1$

-----Original Message-----
From: mvapich-discuss-bounces at cse.ohio-state.edu [mailto:mvapich-discuss-bounces at cse.ohio-state.edu] On Behalf Of Evren Yurtesen IB
Sent: Thursday, September 06, 2012 2:40 PM
To: Jonathan Perkins
Cc: mvapich-discuss at cse.ohio-state.edu
Subject: Re: [mvapich-discuss] slurm+mvapich2 -> cannot create cq ?

Hi Jonathan,

You are right, I found out that I get memlock limit of 64 when I run a process with srun. Which is strange but still...

I have created the /etc/sysconfing/slurm with a reasonable limit and restarted slurm processes.

It is strange since I had unlimited in /etc/security/limits.d/memlock.conf
and when I login as root, I got unlimited as limit, and if I restarted slurm manually it also got unlimited as limit (from 64). But after reboot and automatic start of slurm process, it got 64 as limit (I wonder from where it found that). Anyway, your advice fixed the issue. Thanks!

Thanks!
Evren

On Thu, 6 Sep 2012, Jonathan Perkins wrote:

> Hi, it looks like you checked ulimit -l outside of srun.  It's 
> possible that you're getting a lower memory limit because of the 
> limits imposed on slurm when the service started.
>
> Can you run the following?
>
>    [perkinjo at nowlab ~]$ srun -N 2 ulimit.sh
>    test2: unlimited
>    test1: unlimited
>    [perkinjo at nowlab ~]$ cat ulimit.sh
>    #!/bin/sh
>
>    echo $(hostname): $(ulimit -l)
>
> If it doesn't show unlimited (or some other number much higher than 
> 64) then you'll need to do something to update the limits slurm is using.
> On redhat systems you can put the following in /etc/sysconfig/slurm.
>
>    ulimit -l unlimited
>
>
> On Thu, Sep 06, 2012 at 05:56:05PM +0300, Evren Yurtesen IB wrote:
>> Hello,
>> I am trying to run simple hello.c program with slurm+mvapich2.
>> I have searched a lot but could not find the solution to my problem.
>> The closest match was:
>> http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/2008-Decembe
>> r/002075.html
>> and
>> http://www.mail-archive.com/ewg@lists.openfabrics.org/msg11087.html
>> but I already have the memlock limits set to unlimited...
>>
>> Any ideas on what might be going wrong?
>>
>> running on 2 processes on same node works fine...
>>
>>
>> -bash-4.1$ mpich2version
>> MVAPICH2 Version:     	1.8
>> MVAPICH2 Release date:	Mon Apr 30 14:50:19 EDT 2012
>> MVAPICH2 Device:      	ch3:mrail
>> MVAPICH2 configure:
>> 	--prefix=/export/modules/apps/mvapich2/1.8/gnu
>> --with-valgrind=/export/modules/tools/valgrind/3.8.0/include/valgrind
>> --enable-fast=O3 --enable-shared --enable-mpe --with-pmi=slurm 
>> --with-pm=no
>> MVAPICH2 CC:  	gcc -O3 -march=corei7 -mtune=corei7   -O3
>> MVAPICH2 CXX: 	c++ -O3 -march=corei7 -mtune=corei7  -O3
>> MVAPICH2 F77: 	gfortran -O3 -march=corei7 -mtune=corei7  -O3
>> MVAPICH2 FC:  	gfortran -O3 -march=corei7 -mtune=corei7  -O3
>>
>> -bash-4.1$ mpicc hello.c -lpmi -lslurm
>>
>> -bash-4.1$ srun -n 2  ./a.out
>> srun: job 23458 queued and waiting for resources
>> srun: job 23458 has been allocated resources Hello World from process 
>> 0 running on asg1 Hello World from process 1 running on asg1 Ready
>>
>> -bash-4.1$ srun -N 2  ./a.out
>> srun: job 23454 queued and waiting for resources
>> srun: job 23454 has been allocated resources
>> In: PMI_Abort(1, Fatal error in MPI_Init:
>> Other MPI error, error stack:
>> MPIR_Init_thread(408).......:
>> MPID_Init(296)..............: channel initialization failed
>> MPIDI_CH3_Init(283).........:
>> MPIDI_CH3I_RDMA_init(172)...:
>> rdma_setup_startup_ring(431): cannot create cq
>> )
>> In: PMI_Abort(1, Fatal error in MPI_Init:
>> Other MPI error, error stack:
>> MPIR_Init_thread(408).......:
>> MPID_Init(296)..............: channel initialization failed
>> MPIDI_CH3_Init(283).........:
>> MPIDI_CH3I_RDMA_init(172)...:
>> rdma_setup_startup_ring(431): cannot create cq
>> )
>> srun: Job step aborted: Waiting up to 2 seconds for job step to finish.
>> slurmd[asg1]: error: *** STEP 23454.0 KILLED AT 2012-09-06T17:52:06 
>> WITH SIGNAL 9 ***
>> slurmd[asg2]: error: *** STEP 23454.0 KILLED AT 2012-09-06T17:52:06 
>> WITH SIGNAL 9 ***
>> slurmd[asg1]: error: *** STEP 23454.0 KILLED AT 2012-09-06T17:52:06 
>> WITH SIGNAL 9 ***
>> slurmd[asg2]: error: *** STEP 23454.0 KILLED AT 2012-09-06T17:52:06 
>> WITH SIGNAL 9 ***
>> slurmd[asg2]: error: *** STEP 23454.0 KILLED AT 2012-09-06T17:52:06 
>> WITH SIGNAL 9 ***
>> slurmd[asg1]: error: *** STEP 23454.0 KILLED AT 2012-09-06T17:52:06 
>> WITH SIGNAL 9 ***
>> srun: error: asg1: task 0: Exited with exit code 1
>> srun: Terminating job step 23454.0
>> srun: error: asg2: task 1: Exited with exit code 1
>> slurmd[asg1]: error: *** STEP 23454.0 KILLED AT 2012-09-06T17:52:06 
>> WITH SIGNAL 9 ***
>> slurmd[asg2]: error: *** STEP 23454.0 KILLED AT 2012-09-06T17:52:06 
>> WITH SIGNAL 9 ***
>>
>> -bash-4.1$ ulimit -l
>> unlimited
>> -bash-4.1$ ssh asg1 -C 'ulimit -l'
>> unlimited
>> -bash-4.1$ ssh asg2 -C 'ulimit -l'
>> unlimited
>> -bash-4.1$
>> _______________________________________________
>> mvapich-discuss mailing list
>> mvapich-discuss at cse.ohio-state.edu
>> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>>
>
> --
> Jonathan Perkins
> http://www.cse.ohio-state.edu/~perkinjo
>
>
_______________________________________________
mvapich-discuss mailing list
mvapich-discuss at cse.ohio-state.edu
http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss



More information about the mvapich-discuss mailing list