[mvapich-discuss] Help! Problems Slurm and MVAPICH2

José Manuel Molero jmlero at hotmail.com
Fri Sep 14 07:13:07 EDT 2012


The output after rebuild:


 srun -N 2 helloworld 

In: PMI_Abort(1, Fatal error in MPI_Init:
Other MPI error, error stack:
MPIR_Init_thread(408).......: 
MPID_Init(296)..............: channel initialization failed
MPIDI_CH3_Init(283).........: 
MPIDI_CH3I_RDMA_init(171)...: 
rdma_setup_startup_ring(434): cannot create cq
)
In: PMI_Abort(1, Fatal error in MPI_Init:
Other MPI error, error stack:
MPIR_Init_thread(408).......: 
MPID_Init(296)..............: channel initialization failed
MPIDI_CH3_Init(283).........: 
MPIDI_CH3I_RDMA_init(171)...: 
rdma_setup_startup_ring(434): cannot create cq
)
srun: Job step aborted: Waiting up to 2 seconds for job step to finish.
slurmd[node17]: *** STEP 127.0 KILLED AT 2012-09-14T13:00:12 WITH SIGNAL 9 ***
srun: error: bullxual17: task 0: Exited with exit code 1
slurmd[node18]: *** STEP 127.0 KILLED AT 2012-09-14T13:00:16 WITH SIGNAL 9 ***
slurmd[node18]: *** STEP 127.0 KILLED AT 2012-09-14T13:00:16 WITH SIGNAL 9 ***
srun: error: node18: task 1: Exited with exit code 1




From: jmlero at hotmail.com
To: perkinjo at cse.ohio-state.edu; mvapich-discuss at cse.ohio-state.edu
Subject: RE: [mvapich-discuss] Help! Problems Slurm and MVAPICH2
Date: Fri, 14 Sep 2012 11:01:04 +0200




Thanks for your response.

Compiling without flags, the result is the same.

The result when I execute ulimit.sh:

:~$ srun -N2 ulimit.sh 
node18: 64
node17: 64




Now I'm rebuilding adding `--enable-g=dbg --disable-fast'


Thanks!


> Date: Thu, 13 Sep 2012 07:59:12 -0400
> From: perkinjo at cse.ohio-state.edu
> To: jmlero at hotmail.com
> CC: mvapich-discuss at cse.ohio-state.edu
> Subject: Re: [mvapich-discuss] Help! Problems Slurm and MVAPICH2
> 
> On Thu, Sep 13, 2012 at 09:58:59AM +0200, José Manuel Molero wrote:
> > Hello,
> 
> Hi, my reply is inline.
> 
> > We have a new cluster with an Infiniband network, and I think that Slurm and MVAPICH2 would be the best option in this case.
> > 
> > I have configured SLURM 2.3.2 on Ubuntu Server and its works.
> > 
> > Now I tried to install MVAPICH2 1.8, with the following:
> > 
> > ./configure --with-pm=none --with-pmi=slurm ;  make ; make install  (in the front end and all the compute nodes)
> 
> Looks good so far.
> 
> > 
> > But it dosent work.
> > 
> > I compile using :
> > 
> > mpicc file.c -o file -lpmi -L/usr/include/slurm/
> 
> This step should be unnecessary.  Try just using:
> 
>     mpicc file.c -o file
> 
> > 
> > and then:
> > 
> >  srun -N2 file
> > 
> > And the result is:
> > 
> > In: PMI_Abort(1, Fatal error in MPI_Init:
> > Other MPI error
> > )
> > In: PMI_Abort(1, Fatal error in MPI_Init:
> > Other MPI error
> > )
> > srun: Job step aborted: Waiting up to 2 seconds for job step to finish.
> > srun: error: node17: task 1: Exited with exit code 1
> > slurmd[node16]: *** STEP 102.0 KILLED AT 2012-09-13T09:56:18 WITH SIGNAL 9 ***
> > srun: error: node16: task 0: Exited with exit code 1
> > slurmd[node16]: *** STEP 102.0 KILLED AT 2012-09-13T09:56:18 WITH SIGNAL 9 ***
> > 
> > 
> > 
> > And the output of mpiname -a
> > 
> > MVAPICH2 1.8 Mon Apr 30 14:56:40 EDT 2012 ch3:mrail
> > 
> > Compilation
> > CC: gcc    -DNDEBUG -DNVALGRIND -O2
> > CXX: c++   -DNDEBUG -DNVALGRIND -O2
> > F77: gfortran   -O2 
> > FC: gfortran   -O2
> > 
> > Configuration
> > --with-pm=none --with-pmi=slurm
> > 
> > 
> > 
> > What I'm doing wrong?
> 
> I think the only thing that is getting tripped up is the direct linking
> to slurms pmi library.  Let us know how it goes when you try the command
> without those linking options.
> 
> Another thing that you may want to check is that `ulimit -l' returns
> unlimited (or some other value much higher than 64) on each host when
> using slurm.
> 
>     [perkinjo at nowlab ~]$ srun -N 2 ulimit.sh
>     test2: unlimited
>     test1: unlimited
>     [perkinjo at nowlab ~]$ cat ulimit.sh
>     #!/bin/sh
> 
>     echo $(hostname): $(ulimit -l)
> 
> For more debugging information you may want to rebuilding mvapich2 with
> the addition of `--enable-g=dbg --disable-fast' to the configure line.
> Hope this info helps.
> 
> -- 
> Jonathan Perkins
> http://www.cse.ohio-state.edu/~perkinjo
 		 	   		   		 	   		  
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20120914/e312283d/attachment.html


More information about the mvapich-discuss mailing list