[mvapich-discuss] Help! Problems Slurm and MVAPICH2

Jonathan Perkins perkinjo at cse.ohio-state.edu
Fri Sep 14 08:23:49 EDT 2012


You're having a creation of cq failure.  Take a look at the following
section of our userguide.  You're also using slurm so I'm posting a
link to their faq as well.

http://mvapich.cse.ohio-state.edu/support/user_guide_mvapich2-1.8.html#x1-1250009.4.3
https://computing.llnl.gov/linux/slurm/faq.html#memlock

Basically you'll want to make sure memlock is set to unlimited in
/etc/security/limits.conf and that slurm is respecting this as well.  On
our systems we have added `ulimit -l unlimited' into
/etc/sysconfig/slurm (redhat systems).

On Fri, Sep 14, 2012 at 01:13:07PM +0200, José Manuel Molero wrote:
> The output after rebuild:
> 
> 
>  srun -N 2 helloworld 
> 
> In: PMI_Abort(1, Fatal error in MPI_Init:
> Other MPI error, error stack:
> MPIR_Init_thread(408).......: 
> MPID_Init(296)..............: channel initialization failed
> MPIDI_CH3_Init(283).........: 
> MPIDI_CH3I_RDMA_init(171)...: 
> rdma_setup_startup_ring(434): cannot create cq
> )
> In: PMI_Abort(1, Fatal error in MPI_Init:
> Other MPI error, error stack:
> MPIR_Init_thread(408).......: 
> MPID_Init(296)..............: channel initialization failed
> MPIDI_CH3_Init(283).........: 
> MPIDI_CH3I_RDMA_init(171)...: 
> rdma_setup_startup_ring(434): cannot create cq
> )
> srun: Job step aborted: Waiting up to 2 seconds for job step to finish.
> slurmd[node17]: *** STEP 127.0 KILLED AT 2012-09-14T13:00:12 WITH SIGNAL 9 ***
> srun: error: bullxual17: task 0: Exited with exit code 1
> slurmd[node18]: *** STEP 127.0 KILLED AT 2012-09-14T13:00:16 WITH SIGNAL 9 ***
> slurmd[node18]: *** STEP 127.0 KILLED AT 2012-09-14T13:00:16 WITH SIGNAL 9 ***
> srun: error: node18: task 1: Exited with exit code 1
> 
> 
> 
> 
> From: jmlero at hotmail.com
> To: perkinjo at cse.ohio-state.edu; mvapich-discuss at cse.ohio-state.edu
> Subject: RE: [mvapich-discuss] Help! Problems Slurm and MVAPICH2
> Date: Fri, 14 Sep 2012 11:01:04 +0200
> 
> 
> 
> 
> Thanks for your response.
> 
> Compiling without flags, the result is the same.
> 
> The result when I execute ulimit.sh:
> 
> :~$ srun -N2 ulimit.sh 
> node18: 64
> node17: 64
> 
> 
> 
> 
> Now I'm rebuilding adding `--enable-g=dbg --disable-fast'
> 
> 
> Thanks!
> 
> 
> > Date: Thu, 13 Sep 2012 07:59:12 -0400
> > From: perkinjo at cse.ohio-state.edu
> > To: jmlero at hotmail.com
> > CC: mvapich-discuss at cse.ohio-state.edu
> > Subject: Re: [mvapich-discuss] Help! Problems Slurm and MVAPICH2
> > 
> > On Thu, Sep 13, 2012 at 09:58:59AM +0200, José Manuel Molero wrote:
> > > Hello,
> > 
> > Hi, my reply is inline.
> > 
> > > We have a new cluster with an Infiniband network, and I think that Slurm and MVAPICH2 would be the best option in this case.
> > > 
> > > I have configured SLURM 2.3.2 on Ubuntu Server and its works.
> > > 
> > > Now I tried to install MVAPICH2 1.8, with the following:
> > > 
> > > ./configure --with-pm=none --with-pmi=slurm ;  make ; make install  (in the front end and all the compute nodes)
> > 
> > Looks good so far.
> > 
> > > 
> > > But it dosent work.
> > > 
> > > I compile using :
> > > 
> > > mpicc file.c -o file -lpmi -L/usr/include/slurm/
> > 
> > This step should be unnecessary.  Try just using:
> > 
> >     mpicc file.c -o file
> > 
> > > 
> > > and then:
> > > 
> > >  srun -N2 file
> > > 
> > > And the result is:
> > > 
> > > In: PMI_Abort(1, Fatal error in MPI_Init:
> > > Other MPI error
> > > )
> > > In: PMI_Abort(1, Fatal error in MPI_Init:
> > > Other MPI error
> > > )
> > > srun: Job step aborted: Waiting up to 2 seconds for job step to finish.
> > > srun: error: node17: task 1: Exited with exit code 1
> > > slurmd[node16]: *** STEP 102.0 KILLED AT 2012-09-13T09:56:18 WITH SIGNAL 9 ***
> > > srun: error: node16: task 0: Exited with exit code 1
> > > slurmd[node16]: *** STEP 102.0 KILLED AT 2012-09-13T09:56:18 WITH SIGNAL 9 ***
> > > 
> > > 
> > > 
> > > And the output of mpiname -a
> > > 
> > > MVAPICH2 1.8 Mon Apr 30 14:56:40 EDT 2012 ch3:mrail
> > > 
> > > Compilation
> > > CC: gcc    -DNDEBUG -DNVALGRIND -O2
> > > CXX: c++   -DNDEBUG -DNVALGRIND -O2
> > > F77: gfortran   -O2 
> > > FC: gfortran   -O2
> > > 
> > > Configuration
> > > --with-pm=none --with-pmi=slurm
> > > 
> > > 
> > > 
> > > What I'm doing wrong?
> > 
> > I think the only thing that is getting tripped up is the direct linking
> > to slurms pmi library.  Let us know how it goes when you try the command
> > without those linking options.
> > 
> > Another thing that you may want to check is that `ulimit -l' returns
> > unlimited (or some other value much higher than 64) on each host when
> > using slurm.
> > 
> >     [perkinjo at nowlab ~]$ srun -N 2 ulimit.sh
> >     test2: unlimited
> >     test1: unlimited
> >     [perkinjo at nowlab ~]$ cat ulimit.sh
> >     #!/bin/sh
> > 
> >     echo $(hostname): $(ulimit -l)
> > 
> > For more debugging information you may want to rebuilding mvapich2 with
> > the addition of `--enable-g=dbg --disable-fast' to the configure line.
> > Hope this info helps.
> > 
> > -- 
> > Jonathan Perkins
> > http://www.cse.ohio-state.edu/~perkinjo
>  		 	   		   		 	   		  

-- 
Jonathan Perkins
http://www.cse.ohio-state.edu/~perkinjo


More information about the mvapich-discuss mailing list