[mvapich-discuss] Help! Problems Slurm and MVAPICH2

José Manuel Molero jmlero at hotmail.com
Tue Jan 22 03:56:27 EST 2013


Hi,

Thanks for the response. I'm still having the same problem.
When I execute a program as root, everything is fine, no problem.

When I execute a program as normal user, I have the same problem:

In: PMI_Abort(1, Fatal error in MPI_Init:
Other MPI error, error stack:
MPIR_Init_thread(408).......: 
MPID_Init(296)..............: channel initialization failed
MPIDI_CH3_Init(283).........: 
MPIDI_CH3I_RDMA_init(171)...: 
rdma_setup_startup_ring(434): cannot create cq
)


In the front end and all compute nodes the result of ulimit -l is:
unlimited

What is the problem?

Thanks.



> Date: Fri, 14 Sep 2012 08:23:49 -0400
> From: perkinjo at cse.ohio-state.edu
> To: jmlero at hotmail.com
> CC: mvapich-discuss at cse.ohio-state.edu
> Subject: Re: [mvapich-discuss] Help! Problems Slurm and MVAPICH2
> 
> You're having a creation of cq failure.  Take a look at the following
> section of our userguide.  You're also using slurm so I'm posting a
> link to their faq as well.
> 
> http://mvapich.cse.ohio-state.edu/support/user_guide_mvapich2-1.8.html#x1-1250009.4.3
> https://computing.llnl.gov/linux/slurm/faq.html#memlock
> 
> Basically you'll want to make sure memlock is set to unlimited in
> /etc/security/limits.conf and that slurm is respecting this as well.  On
> our systems we have added `ulimit -l unlimited' into
> /etc/sysconfig/slurm (redhat systems).
> 
> On Fri, Sep 14, 2012 at 01:13:07PM +0200, José Manuel Molero wrote:
> > The output after rebuild:
> > 
> > 
> >  srun -N 2 helloworld 
> > 
> > In: PMI_Abort(1, Fatal error in MPI_Init:
> > Other MPI error, error stack:
> > MPIR_Init_thread(408).......: 
> > MPID_Init(296)..............: channel initialization failed
> > MPIDI_CH3_Init(283).........: 
> > MPIDI_CH3I_RDMA_init(171)...: 
> > rdma_setup_startup_ring(434): cannot create cq
> > )
> > In: PMI_Abort(1, Fatal error in MPI_Init:
> > Other MPI error, error stack:
> > MPIR_Init_thread(408).......: 
> > MPID_Init(296)..............: channel initialization failed
> > MPIDI_CH3_Init(283).........: 
> > MPIDI_CH3I_RDMA_init(171)...: 
> > rdma_setup_startup_ring(434): cannot create cq
> > )
> > srun: Job step aborted: Waiting up to 2 seconds for job step to finish.
> > slurmd[node17]: *** STEP 127.0 KILLED AT 2012-09-14T13:00:12 WITH SIGNAL 9 ***
> > srun: error: bullxual17: task 0: Exited with exit code 1
> > slurmd[node18]: *** STEP 127.0 KILLED AT 2012-09-14T13:00:16 WITH SIGNAL 9 ***
> > slurmd[node18]: *** STEP 127.0 KILLED AT 2012-09-14T13:00:16 WITH SIGNAL 9 ***
> > srun: error: node18: task 1: Exited with exit code 1
> > 
> > 
> > 
> > 
> > From: jmlero at hotmail.com
> > To: perkinjo at cse.ohio-state.edu; mvapich-discuss at cse.ohio-state.edu
> > Subject: RE: [mvapich-discuss] Help! Problems Slurm and MVAPICH2
> > Date: Fri, 14 Sep 2012 11:01:04 +0200
> > 
> > 
> > 
> > 
> > Thanks for your response.
> > 
> > Compiling without flags, the result is the same.
> > 
> > The result when I execute ulimit.sh:
> > 
> > :~$ srun -N2 ulimit.sh 
> > node18: 64
> > node17: 64
> > 
> > 
> > 
> > 
> > Now I'm rebuilding adding `--enable-g=dbg --disable-fast'
> > 
> > 
> > Thanks!
> > 
> > 
> > > Date: Thu, 13 Sep 2012 07:59:12 -0400
> > > From: perkinjo at cse.ohio-state.edu
> > > To: jmlero at hotmail.com
> > > CC: mvapich-discuss at cse.ohio-state.edu
> > > Subject: Re: [mvapich-discuss] Help! Problems Slurm and MVAPICH2
> > > 
> > > On Thu, Sep 13, 2012 at 09:58:59AM +0200, José Manuel Molero wrote:
> > > > Hello,
> > > 
> > > Hi, my reply is inline.
> > > 
> > > > We have a new cluster with an Infiniband network, and I think that Slurm and MVAPICH2 would be the best option in this case.
> > > > 
> > > > I have configured SLURM 2.3.2 on Ubuntu Server and its works.
> > > > 
> > > > Now I tried to install MVAPICH2 1.8, with the following:
> > > > 
> > > > ./configure --with-pm=none --with-pmi=slurm ;  make ; make install  (in the front end and all the compute nodes)
> > > 
> > > Looks good so far.
> > > 
> > > > 
> > > > But it dosent work.
> > > > 
> > > > I compile using :
> > > > 
> > > > mpicc file.c -o file -lpmi -L/usr/include/slurm/
> > > 
> > > This step should be unnecessary.  Try just using:
> > > 
> > >     mpicc file.c -o file
> > > 
> > > > 
> > > > and then:
> > > > 
> > > >  srun -N2 file
> > > > 
> > > > And the result is:
> > > > 
> > > > In: PMI_Abort(1, Fatal error in MPI_Init:
> > > > Other MPI error
> > > > )
> > > > In: PMI_Abort(1, Fatal error in MPI_Init:
> > > > Other MPI error
> > > > )
> > > > srun: Job step aborted: Waiting up to 2 seconds for job step to finish.
> > > > srun: error: node17: task 1: Exited with exit code 1
> > > > slurmd[node16]: *** STEP 102.0 KILLED AT 2012-09-13T09:56:18 WITH SIGNAL 9 ***
> > > > srun: error: node16: task 0: Exited with exit code 1
> > > > slurmd[node16]: *** STEP 102.0 KILLED AT 2012-09-13T09:56:18 WITH SIGNAL 9 ***
> > > > 
> > > > 
> > > > 
> > > > And the output of mpiname -a
> > > > 
> > > > MVAPICH2 1.8 Mon Apr 30 14:56:40 EDT 2012 ch3:mrail
> > > > 
> > > > Compilation
> > > > CC: gcc    -DNDEBUG -DNVALGRIND -O2
> > > > CXX: c++   -DNDEBUG -DNVALGRIND -O2
> > > > F77: gfortran   -O2 
> > > > FC: gfortran   -O2
> > > > 
> > > > Configuration
> > > > --with-pm=none --with-pmi=slurm
> > > > 
> > > > 
> > > > 
> > > > What I'm doing wrong?
> > > 
> > > I think the only thing that is getting tripped up is the direct linking
> > > to slurms pmi library.  Let us know how it goes when you try the command
> > > without those linking options.
> > > 
> > > Another thing that you may want to check is that `ulimit -l' returns
> > > unlimited (or some other value much higher than 64) on each host when
> > > using slurm.
> > > 
> > >     [perkinjo at nowlab ~]$ srun -N 2 ulimit.sh
> > >     test2: unlimited
> > >     test1: unlimited
> > >     [perkinjo at nowlab ~]$ cat ulimit.sh
> > >     #!/bin/sh
> > > 
> > >     echo $(hostname): $(ulimit -l)
> > > 
> > > For more debugging information you may want to rebuilding mvapich2 with
> > > the addition of `--enable-g=dbg --disable-fast' to the configure line.
> > > Hope this info helps.
> > > 
> > > -- 
> > > Jonathan Perkins
> > > http://www.cse.ohio-state.edu/~perkinjo
> >  		 	   		   		 	   		  
> 
> -- 
> Jonathan Perkins
> http://www.cse.ohio-state.edu/~perkinjo
 		 	   		  
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20130122/9e0b85b9/attachment-0001.html


More information about the mvapich-discuss mailing list