[mvapich-discuss] Issues running mvapich2 with slurm
Jonathan Perkins
perkinjo at cse.ohio-state.edu
Mon Oct 22 14:11:10 EDT 2012
On Mon, Oct 22, 2012 at 12:26:41PM -0400, Matthew Russell wrote:
> Hi,
Hello, I'll respond inline.
> I'm not sure whether I should be posting this here or to a slurm mailing
> list, figured I'd try here though.
>
> I can't seem to run even simple hello_world executables with mvapich2 with
> slurm.
>
> I built mvapich2 to use slurm as it's PMI, i.e.
>
> ./configure --prefix=/cm/shared/apps/mvapich2/pgi/64/1.8 --with-pmi=slurm \
> --with-pm=no CPPFLAGS=-I/cm/shared/apps/slurm/2.4.2/slurm-2.4.2/ \
> LDFLAGS=-L/cm/shared/apps/slurm/2.4.2/lib/
For more debugging information you may want to rebuilding mvapich2 with
the addition of `--enable-g=dbg --disable-fast' to the configure line.
> When I try to run apps though, I get:
> [matt at dena]~/cluster_tests% srun -n16 --mpi=none hello_mvapich2_slurm
> srun: error: Unable to confirm allocation for job 14: Invalid job id
> specified
> srun: Check SLURM_JOB_ID environment variable for expired or invalid job.
I believe the above issue is related to slurm. I can help you with the
issue you have noted below.
> When I first allocate cores, I get:
> [matt at dena]~/cluster_tests% salloc -N 2
> salloc: Granted job allocation 16
> [matt at dena]~/cluster_tests% srun
> /home/matt/cluster_tests/hello_mvapich2_slurm
> In: PMI_Abort(1, Fatal error in MPI_Init:
> Other MPI error
> )
> In: PMI_Abort(1, Fatal error in MPI_Init:
> Other MPI error
> )
> srun: Job step aborted: Waiting up to 2 seconds for job step to finish.
> slurmd[dena2]: *** STEP 16.0 KILLED AT 2012-10-19T12:45:22 WITH SIGNAL 9 ***
> srun: error: dena2: task 1: Exited with exit code 1
> srun: error: dena1: task 0: Exited with exit code 1
> slurmd[dena1]: *** STEP 16.0 KILLED AT 2012-10-19T12:45:22 WITH SIGNAL 9 ***
> slurmd[dena2]: *** STEP 16.0 KILLED AT 2012-10-19T12:45:22 WITH SIGNAL 9 ***
>
> I've searched around, but have had no luck finding solutions to either
> problem. I have no idea how to proceed.
One thing that you may want to check is that `ulimit -l' returns
unlimited (or some other value much higher than 64) on each host when
using slurm.
[perkinjo at nowlab ~]$ srun -N 2 ulimit.sh
test2: unlimited
test1: unlimited
[perkinjo at nowlab ~]$ cat ulimit.sh
#!/bin/sh
echo $(hostname): $(ulimit -l)
If the output is not unlimitd you will probably have a cq creation
failure. Take a look at the following section of our userguide. You're
also using slurm so I'm posting a link to their faq as well.
http://mvapich.cse.ohio-state.edu/support/user_guide_mvapich2-1.8.html#x1-125000+9.4.3
https://computing.llnl.gov/linux/slurm/faq.html#memlock
Basically you'll want to make sure memlock is set to unlimited in
/etc/security/limits.conf and that slurm is respecting this as well. On
our systems we have added `ulimit -l unlimited' into
/etc/sysconfig/slurm (redhat systems).
Hope this info helps.
--
Jonathan Perkins
http://www.cse.ohio-state.edu/~perkinjo
More information about the mvapich-discuss
mailing list