[mvapich-discuss] Help! Problems Slurm and MVAPICH2

Jonathan Perkins perkinjo at cse.ohio-state.edu
Wed Jan 23 14:24:07 EST 2013


FYI, this issue turned out to be related to his OFED installation.

On Tue, Jan 22, 2013 at 09:56:27AM +0100, José Manuel Molero wrote:
> Hi,
> 
> Thanks for the response. I'm still having the same problem.
> When I execute a program as root, everything is fine, no problem.
> 
> When I execute a program as normal user, I have the same problem:
> 
> In: PMI_Abort(1, Fatal error in MPI_Init:
> Other MPI error, error stack:
> MPIR_Init_thread(408).......: 
> MPID_Init(296)..............: channel initialization failed
> MPIDI_CH3_Init(283).........: 
> MPIDI_CH3I_RDMA_init(171)...: 
> rdma_setup_startup_ring(434): cannot create cq
> )
> 
> 
> In the front end and all compute nodes the result of ulimit -l is:
> unlimited
> 
> What is the problem?
> 
> Thanks.
> 
> 
> 
> > Date: Fri, 14 Sep 2012 08:23:49 -0400
> > From: perkinjo at cse.ohio-state.edu
> > To: jmlero at hotmail.com
> > CC: mvapich-discuss at cse.ohio-state.edu
> > Subject: Re: [mvapich-discuss] Help! Problems Slurm and MVAPICH2
> > 
> > You're having a creation of cq failure.  Take a look at the following
> > section of our userguide.  You're also using slurm so I'm posting a
> > link to their faq as well.
> > 
> > http://mvapich.cse.ohio-state.edu/support/user_guide_mvapich2-1.8.html#x1-1250009.4.3
> > https://computing.llnl.gov/linux/slurm/faq.html#memlock
> > 
> > Basically you'll want to make sure memlock is set to unlimited in
> > /etc/security/limits.conf and that slurm is respecting this as well.  On
> > our systems we have added `ulimit -l unlimited' into
> > /etc/sysconfig/slurm (redhat systems).
> > 
> > On Fri, Sep 14, 2012 at 01:13:07PM +0200, José Manuel Molero wrote:
> > > The output after rebuild:
> > > 
> > > 
> > >  srun -N 2 helloworld 
> > > 
> > > In: PMI_Abort(1, Fatal error in MPI_Init:
> > > Other MPI error, error stack:
> > > MPIR_Init_thread(408).......: 
> > > MPID_Init(296)..............: channel initialization failed
> > > MPIDI_CH3_Init(283).........: 
> > > MPIDI_CH3I_RDMA_init(171)...: 
> > > rdma_setup_startup_ring(434): cannot create cq
> > > )
> > > In: PMI_Abort(1, Fatal error in MPI_Init:
> > > Other MPI error, error stack:
> > > MPIR_Init_thread(408).......: 
> > > MPID_Init(296)..............: channel initialization failed
> > > MPIDI_CH3_Init(283).........: 
> > > MPIDI_CH3I_RDMA_init(171)...: 
> > > rdma_setup_startup_ring(434): cannot create cq
> > > )
> > > srun: Job step aborted: Waiting up to 2 seconds for job step to finish.
> > > slurmd[node17]: *** STEP 127.0 KILLED AT 2012-09-14T13:00:12 WITH SIGNAL 9 ***
> > > srun: error: bullxual17: task 0: Exited with exit code 1
> > > slurmd[node18]: *** STEP 127.0 KILLED AT 2012-09-14T13:00:16 WITH SIGNAL 9 ***
> > > slurmd[node18]: *** STEP 127.0 KILLED AT 2012-09-14T13:00:16 WITH SIGNAL 9 ***
> > > srun: error: node18: task 1: Exited with exit code 1
> > > 
> > > 
> > > 
> > > 
> > > From: jmlero at hotmail.com
> > > To: perkinjo at cse.ohio-state.edu; mvapich-discuss at cse.ohio-state.edu
> > > Subject: RE: [mvapich-discuss] Help! Problems Slurm and MVAPICH2
> > > Date: Fri, 14 Sep 2012 11:01:04 +0200
> > > 
> > > 
> > > 
> > > 
> > > Thanks for your response.
> > > 
> > > Compiling without flags, the result is the same.
> > > 
> > > The result when I execute ulimit.sh:
> > > 
> > > :~$ srun -N2 ulimit.sh 
> > > node18: 64
> > > node17: 64
> > > 
> > > 
> > > 
> > > 
> > > Now I'm rebuilding adding `--enable-g=dbg --disable-fast'
> > > 
> > > 
> > > Thanks!
> > > 
> > > 
> > > > Date: Thu, 13 Sep 2012 07:59:12 -0400
> > > > From: perkinjo at cse.ohio-state.edu
> > > > To: jmlero at hotmail.com
> > > > CC: mvapich-discuss at cse.ohio-state.edu
> > > > Subject: Re: [mvapich-discuss] Help! Problems Slurm and MVAPICH2
> > > > 
> > > > On Thu, Sep 13, 2012 at 09:58:59AM +0200, José Manuel Molero wrote:
> > > > > Hello,
> > > > 
> > > > Hi, my reply is inline.
> > > > 
> > > > > We have a new cluster with an Infiniband network, and I think that Slurm and MVAPICH2 would be the best option in this case.
> > > > > 
> > > > > I have configured SLURM 2.3.2 on Ubuntu Server and its works.
> > > > > 
> > > > > Now I tried to install MVAPICH2 1.8, with the following:
> > > > > 
> > > > > ./configure --with-pm=none --with-pmi=slurm ;  make ; make install  (in the front end and all the compute nodes)
> > > > 
> > > > Looks good so far.
> > > > 
> > > > > 
> > > > > But it dosent work.
> > > > > 
> > > > > I compile using :
> > > > > 
> > > > > mpicc file.c -o file -lpmi -L/usr/include/slurm/
> > > > 
> > > > This step should be unnecessary.  Try just using:
> > > > 
> > > >     mpicc file.c -o file
> > > > 
> > > > > 
> > > > > and then:
> > > > > 
> > > > >  srun -N2 file
> > > > > 
> > > > > And the result is:
> > > > > 
> > > > > In: PMI_Abort(1, Fatal error in MPI_Init:
> > > > > Other MPI error
> > > > > )
> > > > > In: PMI_Abort(1, Fatal error in MPI_Init:
> > > > > Other MPI error
> > > > > )
> > > > > srun: Job step aborted: Waiting up to 2 seconds for job step to finish.
> > > > > srun: error: node17: task 1: Exited with exit code 1
> > > > > slurmd[node16]: *** STEP 102.0 KILLED AT 2012-09-13T09:56:18 WITH SIGNAL 9 ***
> > > > > srun: error: node16: task 0: Exited with exit code 1
> > > > > slurmd[node16]: *** STEP 102.0 KILLED AT 2012-09-13T09:56:18 WITH SIGNAL 9 ***
> > > > > 
> > > > > 
> > > > > 
> > > > > And the output of mpiname -a
> > > > > 
> > > > > MVAPICH2 1.8 Mon Apr 30 14:56:40 EDT 2012 ch3:mrail
> > > > > 
> > > > > Compilation
> > > > > CC: gcc    -DNDEBUG -DNVALGRIND -O2
> > > > > CXX: c++   -DNDEBUG -DNVALGRIND -O2
> > > > > F77: gfortran   -O2 
> > > > > FC: gfortran   -O2
> > > > > 
> > > > > Configuration
> > > > > --with-pm=none --with-pmi=slurm
> > > > > 
> > > > > 
> > > > > 
> > > > > What I'm doing wrong?
> > > > 
> > > > I think the only thing that is getting tripped up is the direct linking
> > > > to slurms pmi library.  Let us know how it goes when you try the command
> > > > without those linking options.
> > > > 
> > > > Another thing that you may want to check is that `ulimit -l' returns
> > > > unlimited (or some other value much higher than 64) on each host when
> > > > using slurm.
> > > > 
> > > >     [perkinjo at nowlab ~]$ srun -N 2 ulimit.sh
> > > >     test2: unlimited
> > > >     test1: unlimited
> > > >     [perkinjo at nowlab ~]$ cat ulimit.sh
> > > >     #!/bin/sh
> > > > 
> > > >     echo $(hostname): $(ulimit -l)
> > > > 
> > > > For more debugging information you may want to rebuilding mvapich2 with
> > > > the addition of `--enable-g=dbg --disable-fast' to the configure line.
> > > > Hope this info helps.
> > > > 
> > > > -- 
> > > > Jonathan Perkins
> > > > http://www.cse.ohio-state.edu/~perkinjo
> > >  		 	   		   		 	   		  
> > 
> > -- 
> > Jonathan Perkins
> > http://www.cse.ohio-state.edu/~perkinjo
>  		 	   		  

-- 
Jonathan Perkins
http://www.cse.ohio-state.edu/~perkinjo



More information about the mvapich-discuss mailing list