[mvapich-discuss] slurm+mvapich2 -> cannot create cq ?

Evren Yurtesen IB eyurtese at abo.fi
Thu Sep 6 15:40:28 EDT 2012


Hi Jonathan,

You are right, I found out that I get memlock limit of 64 when I run a 
process with srun. Which is strange but still...

I have created the /etc/sysconfing/slurm with a reasonable limit and 
restarted slurm processes.

It is strange since I had unlimited in /etc/security/limits.d/memlock.conf 
and when I login as root, I got unlimited as limit, and if I restarted 
slurm manually it also got unlimited as limit (from 64). But after reboot 
and automatic start of slurm process, it got 64 as limit (I wonder from 
where it found that). Anyway, your advice fixed the issue. Thanks!

Thanks!
Evren

On Thu, 6 Sep 2012, Jonathan Perkins wrote:

> Hi, it looks like you checked ulimit -l outside of srun.  It's possible
> that you're getting a lower memory limit because of the limits imposed
> on slurm when the service started.
>
> Can you run the following?
>
>    [perkinjo at nowlab ~]$ srun -N 2 ulimit.sh
>    test2: unlimited
>    test1: unlimited
>    [perkinjo at nowlab ~]$ cat ulimit.sh
>    #!/bin/sh
>
>    echo $(hostname): $(ulimit -l)
>
> If it doesn't show unlimited (or some other number much higher than 64)
> then you'll need to do something to update the limits slurm is using.
> On redhat systems you can put the following in /etc/sysconfig/slurm.
>
>    ulimit -l unlimited
>
>
> On Thu, Sep 06, 2012 at 05:56:05PM +0300, Evren Yurtesen IB wrote:
>> Hello,
>> I am trying to run simple hello.c program with slurm+mvapich2.
>> I have searched a lot but could not find the solution to my problem.
>> The closest match was:
>> http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/2008-December/002075.html
>> and
>> http://www.mail-archive.com/ewg@lists.openfabrics.org/msg11087.html
>> but I already have the memlock limits set to unlimited...
>>
>> Any ideas on what might be going wrong?
>>
>> running on 2 processes on same node works fine...
>>
>>
>> -bash-4.1$ mpich2version
>> MVAPICH2 Version:     	1.8
>> MVAPICH2 Release date:	Mon Apr 30 14:50:19 EDT 2012
>> MVAPICH2 Device:      	ch3:mrail
>> MVAPICH2 configure:
>> 	--prefix=/export/modules/apps/mvapich2/1.8/gnu
>> --with-valgrind=/export/modules/tools/valgrind/3.8.0/include/valgrind
>> --enable-fast=O3 --enable-shared --enable-mpe --with-pmi=slurm
>> --with-pm=no
>> MVAPICH2 CC:  	gcc -O3 -march=corei7 -mtune=corei7   -O3
>> MVAPICH2 CXX: 	c++ -O3 -march=corei7 -mtune=corei7  -O3
>> MVAPICH2 F77: 	gfortran -O3 -march=corei7 -mtune=corei7  -O3
>> MVAPICH2 FC:  	gfortran -O3 -march=corei7 -mtune=corei7  -O3
>>
>> -bash-4.1$ mpicc hello.c -lpmi -lslurm
>>
>> -bash-4.1$ srun -n 2  ./a.out
>> srun: job 23458 queued and waiting for resources
>> srun: job 23458 has been allocated resources
>> Hello World from process 0 running on asg1
>> Hello World from process 1 running on asg1
>> Ready
>>
>> -bash-4.1$ srun -N 2  ./a.out
>> srun: job 23454 queued and waiting for resources
>> srun: job 23454 has been allocated resources
>> In: PMI_Abort(1, Fatal error in MPI_Init:
>> Other MPI error, error stack:
>> MPIR_Init_thread(408).......:
>> MPID_Init(296)..............: channel initialization failed
>> MPIDI_CH3_Init(283).........:
>> MPIDI_CH3I_RDMA_init(172)...:
>> rdma_setup_startup_ring(431): cannot create cq
>> )
>> In: PMI_Abort(1, Fatal error in MPI_Init:
>> Other MPI error, error stack:
>> MPIR_Init_thread(408).......:
>> MPID_Init(296)..............: channel initialization failed
>> MPIDI_CH3_Init(283).........:
>> MPIDI_CH3I_RDMA_init(172)...:
>> rdma_setup_startup_ring(431): cannot create cq
>> )
>> srun: Job step aborted: Waiting up to 2 seconds for job step to finish.
>> slurmd[asg1]: error: *** STEP 23454.0 KILLED AT 2012-09-06T17:52:06
>> WITH SIGNAL 9 ***
>> slurmd[asg2]: error: *** STEP 23454.0 KILLED AT 2012-09-06T17:52:06
>> WITH SIGNAL 9 ***
>> slurmd[asg1]: error: *** STEP 23454.0 KILLED AT 2012-09-06T17:52:06
>> WITH SIGNAL 9 ***
>> slurmd[asg2]: error: *** STEP 23454.0 KILLED AT 2012-09-06T17:52:06
>> WITH SIGNAL 9 ***
>> slurmd[asg2]: error: *** STEP 23454.0 KILLED AT 2012-09-06T17:52:06
>> WITH SIGNAL 9 ***
>> slurmd[asg1]: error: *** STEP 23454.0 KILLED AT 2012-09-06T17:52:06
>> WITH SIGNAL 9 ***
>> srun: error: asg1: task 0: Exited with exit code 1
>> srun: Terminating job step 23454.0
>> srun: error: asg2: task 1: Exited with exit code 1
>> slurmd[asg1]: error: *** STEP 23454.0 KILLED AT 2012-09-06T17:52:06
>> WITH SIGNAL 9 ***
>> slurmd[asg2]: error: *** STEP 23454.0 KILLED AT 2012-09-06T17:52:06
>> WITH SIGNAL 9 ***
>>
>> -bash-4.1$ ulimit -l
>> unlimited
>> -bash-4.1$ ssh asg1 -C 'ulimit -l'
>> unlimited
>> -bash-4.1$ ssh asg2 -C 'ulimit -l'
>> unlimited
>> -bash-4.1$
>> _______________________________________________
>> mvapich-discuss mailing list
>> mvapich-discuss at cse.ohio-state.edu
>> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>>
>
> -- 
> Jonathan Perkins
> http://www.cse.ohio-state.edu/~perkinjo
>
>


More information about the mvapich-discuss mailing list