[mvapich-discuss] Issues running mvapich2 with slurm

Mon Oct 22 14:11:10 EDT 2012

On Mon, Oct 22, 2012 at 12:26:41PM -0400, Matthew Russell wrote:
> Hi,

Hello, I'll respond inline.

> I'm not sure whether I should be posting this here or to a slurm mailing
> list, figured I'd try here though.
> 
> I can't seem to run even simple hello_world executables with mvapich2 with
> slurm.
> 
> I built mvapich2 to use slurm as it's PMI, i.e.
> 
> ./configure --prefix=/cm/shared/apps/mvapich2/pgi/64/1.8 --with-pmi=slurm \
>    --with-pm=no CPPFLAGS=-I/cm/shared/apps/slurm/2.4.2/slurm-2.4.2/ \
>    LDFLAGS=-L/cm/shared/apps/slurm/2.4.2/lib/

For more debugging information you may want to rebuilding mvapich2 with
the addition of `--enable-g=dbg --disable-fast' to the configure line.

> When I try to run apps though, I get:
> [matt at dena]~/cluster_tests% srun -n16 --mpi=none hello_mvapich2_slurm
> srun: error: Unable to confirm allocation for job 14: Invalid job id
> specified
> srun: Check SLURM_JOB_ID environment variable for expired or invalid job.

I believe the above issue is related to slurm.  I can help you with the
issue you have noted below.

> When I first allocate cores, I get:
> [matt at dena]~/cluster_tests% salloc -N 2
> salloc: Granted job allocation 16
> [matt at dena]~/cluster_tests% srun
>  /home/matt/cluster_tests/hello_mvapich2_slurm
> In: PMI_Abort(1, Fatal error in MPI_Init:
> Other MPI error
> )
> In: PMI_Abort(1, Fatal error in MPI_Init:
> Other MPI error
> )
> srun: Job step aborted: Waiting up to 2 seconds for job step to finish.
> slurmd[dena2]: *** STEP 16.0 KILLED AT 2012-10-19T12:45:22 WITH SIGNAL 9 ***
> srun: error: dena2: task 1: Exited with exit code 1
> srun: error: dena1: task 0: Exited with exit code 1
> slurmd[dena1]: *** STEP 16.0 KILLED AT 2012-10-19T12:45:22 WITH SIGNAL 9 ***
> slurmd[dena2]: *** STEP 16.0 KILLED AT 2012-10-19T12:45:22 WITH SIGNAL 9 ***
> 
> I've searched around, but have had no luck finding solutions to either
> problem.  I have no idea how to proceed.

One thing that you may want to check is that `ulimit -l' returns
unlimited (or some other value much higher than 64) on each host when
using slurm.

    [perkinjo at nowlab ~]$ srun -N 2 ulimit.sh
    test2: unlimited
    test1: unlimited
    [perkinjo at nowlab ~]$ cat ulimit.sh
    #!/bin/sh

    echo $(hostname): $(ulimit -l)

If the output is not unlimitd you will probably have a cq creation
failure.  Take a look at the following section of our userguide.  You're
also using slurm so I'm posting a link to their faq as well.

http://mvapich.cse.ohio-state.edu/support/user_guide_mvapich2-1.8.html#x1-125000+9.4.3
https://computing.llnl.gov/linux/slurm/faq.html#memlock

Basically you'll want to make sure memlock is set to unlimited in
/etc/security/limits.conf and that slurm is respecting this as well.  On
our systems we have added `ulimit -l unlimited' into
/etc/sysconfig/slurm (redhat systems).

Hope this info helps.

-- 
Jonathan Perkins
http://www.cse.ohio-state.edu/~perkinjo