[mvapich-discuss] Help! Problems Slurm and MVAPICH2
José Manuel Molero
jmlero at hotmail.com
Fri Sep 14 07:13:07 EDT 2012
The output after rebuild:
srun -N 2 helloworld
In: PMI_Abort(1, Fatal error in MPI_Init:
Other MPI error, error stack:
MPIR_Init_thread(408).......:
MPID_Init(296)..............: channel initialization failed
MPIDI_CH3_Init(283).........:
MPIDI_CH3I_RDMA_init(171)...:
rdma_setup_startup_ring(434): cannot create cq
)
In: PMI_Abort(1, Fatal error in MPI_Init:
Other MPI error, error stack:
MPIR_Init_thread(408).......:
MPID_Init(296)..............: channel initialization failed
MPIDI_CH3_Init(283).........:
MPIDI_CH3I_RDMA_init(171)...:
rdma_setup_startup_ring(434): cannot create cq
)
srun: Job step aborted: Waiting up to 2 seconds for job step to finish.
slurmd[node17]: *** STEP 127.0 KILLED AT 2012-09-14T13:00:12 WITH SIGNAL 9 ***
srun: error: bullxual17: task 0: Exited with exit code 1
slurmd[node18]: *** STEP 127.0 KILLED AT 2012-09-14T13:00:16 WITH SIGNAL 9 ***
slurmd[node18]: *** STEP 127.0 KILLED AT 2012-09-14T13:00:16 WITH SIGNAL 9 ***
srun: error: node18: task 1: Exited with exit code 1
From: jmlero at hotmail.com
To: perkinjo at cse.ohio-state.edu; mvapich-discuss at cse.ohio-state.edu
Subject: RE: [mvapich-discuss] Help! Problems Slurm and MVAPICH2
Date: Fri, 14 Sep 2012 11:01:04 +0200
Thanks for your response.
Compiling without flags, the result is the same.
The result when I execute ulimit.sh:
:~$ srun -N2 ulimit.sh
node18: 64
node17: 64
Now I'm rebuilding adding `--enable-g=dbg --disable-fast'
Thanks!
> Date: Thu, 13 Sep 2012 07:59:12 -0400
> From: perkinjo at cse.ohio-state.edu
> To: jmlero at hotmail.com
> CC: mvapich-discuss at cse.ohio-state.edu
> Subject: Re: [mvapich-discuss] Help! Problems Slurm and MVAPICH2
>
> On Thu, Sep 13, 2012 at 09:58:59AM +0200, José Manuel Molero wrote:
> > Hello,
>
> Hi, my reply is inline.
>
> > We have a new cluster with an Infiniband network, and I think that Slurm and MVAPICH2 would be the best option in this case.
> >
> > I have configured SLURM 2.3.2 on Ubuntu Server and its works.
> >
> > Now I tried to install MVAPICH2 1.8, with the following:
> >
> > ./configure --with-pm=none --with-pmi=slurm ; make ; make install (in the front end and all the compute nodes)
>
> Looks good so far.
>
> >
> > But it dosent work.
> >
> > I compile using :
> >
> > mpicc file.c -o file -lpmi -L/usr/include/slurm/
>
> This step should be unnecessary. Try just using:
>
> mpicc file.c -o file
>
> >
> > and then:
> >
> > srun -N2 file
> >
> > And the result is:
> >
> > In: PMI_Abort(1, Fatal error in MPI_Init:
> > Other MPI error
> > )
> > In: PMI_Abort(1, Fatal error in MPI_Init:
> > Other MPI error
> > )
> > srun: Job step aborted: Waiting up to 2 seconds for job step to finish.
> > srun: error: node17: task 1: Exited with exit code 1
> > slurmd[node16]: *** STEP 102.0 KILLED AT 2012-09-13T09:56:18 WITH SIGNAL 9 ***
> > srun: error: node16: task 0: Exited with exit code 1
> > slurmd[node16]: *** STEP 102.0 KILLED AT 2012-09-13T09:56:18 WITH SIGNAL 9 ***
> >
> >
> >
> > And the output of mpiname -a
> >
> > MVAPICH2 1.8 Mon Apr 30 14:56:40 EDT 2012 ch3:mrail
> >
> > Compilation
> > CC: gcc -DNDEBUG -DNVALGRIND -O2
> > CXX: c++ -DNDEBUG -DNVALGRIND -O2
> > F77: gfortran -O2
> > FC: gfortran -O2
> >
> > Configuration
> > --with-pm=none --with-pmi=slurm
> >
> >
> >
> > What I'm doing wrong?
>
> I think the only thing that is getting tripped up is the direct linking
> to slurms pmi library. Let us know how it goes when you try the command
> without those linking options.
>
> Another thing that you may want to check is that `ulimit -l' returns
> unlimited (or some other value much higher than 64) on each host when
> using slurm.
>
> [perkinjo at nowlab ~]$ srun -N 2 ulimit.sh
> test2: unlimited
> test1: unlimited
> [perkinjo at nowlab ~]$ cat ulimit.sh
> #!/bin/sh
>
> echo $(hostname): $(ulimit -l)
>
> For more debugging information you may want to rebuilding mvapich2 with
> the addition of `--enable-g=dbg --disable-fast' to the configure line.
> Hope this info helps.
>
> --
> Jonathan Perkins
> http://www.cse.ohio-state.edu/~perkinjo
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20120914/e312283d/attachment.html
More information about the mvapich-discuss
mailing list