[mvapich-discuss] mvapich2+slurm+blcr Oh, My!

Dhabaleswar Panda panda at cse.ohio-state.edu
Thu Jun 24 00:07:08 EDT 2010


Hi,

Thanks for reporting this issue. This is to let you know that the
following combinations (in addition to other combinations like hydra,
etc.) have been tested:

   - mvapich2 + slurm
   - mvapich2 + mpirun_rsh
   - mvapich2 + mpirun_rsh + blcr

However, the combination you are trying to use (mvapich2 + slurm + blcr)
has not been tested. We will take a look at it.

In the mean time, I will suggest you to try the `mvapich2 + mpirun_rsh +
blcr' combination.

Sections 5.2.1 and 6.3 of the latest MVAPICH2 1.5 user guide mentions
about the detailed steps in using mpirun_rsh and check-point restart with
BLCR, respectively.

The User Guide can be obtained from the following URL:

http://mvapich.cse.ohio-state.edu/support/user_guide_mvapich2-1.5rc2.html

Thanks,

DK


On Wed, 23 Jun 2010, David Brown wrote:

> So I'm not sure who to ask this question to and what setup I'm doing
> wrong, but something isn't working right.
>
> I built mvapich2 1.5rc2 this way:
>
> ./configure \
>                 --prefix=%{mpidir} \
>                 --mandir=%{mpidir}/man \
>                 --enable-error-checking=runtime \
>                 --enable-timing=none \
>                 --enable-g=mem,dbg,meminit \
>                 --enable-sharedlibs=gcc \
>                 --with-rdma=gen2 \
>                 --enable-romio \
>                 --with-file-system=lustre+nfs \
>                 --with-slurm=/usr \
>                 --with-pmi=slurm \
>                 --with-pm=no \
>                 --enable-threads=multiple \
>                 --with-thread-package=pthreads \
>                 --disable-mpe \
>                 --without-mpe \
>                 --disable-nmpi-as-mpi \
>                 --enable-f77 \
>                 --enable-f90 \
>                 --enable-cxx \
>                 --enable-blcr
>
> I built slurm this way:
>
> rpmbuild -ta --with blcr --with postgresql slurm-2.1.9.tar.bz2
>
> I configured slurm with:
>
> CheckpointType=checkpoint/blcr
>
> And I built IOR and launched it this way:
>
> $ srun --checkpoint-dir=/lustre -n 8 -N 4  ./IOR -i 4 -b 32g -T 600 -E
> -k -e -t 1m -o /lustre/testFile
> [Rank 1][cr.c: line 781]MV2_CKPT_MPD_BASE_PORT is not set
> [Rank 3][cr.c: line 781]MV2_CKPT_MPD_BASE_PORT is not set
> [Rank 0][cr.c: line 781]MV2_CKPT_MPD_BASE_PORT is not set
> [Rank 2][cr.c: line 781]MV2_CKPT_MPD_BASE_PORT is not set
> [Rank 5][cr.c: line 781]MV2_CKPT_MPD_BASE_PORT is not set
> [Rank 4][cr.c: line 781]MV2_CKPT_MPD_BASE_PORT is not set
> [Rank 7][cr.c: line 781]MV2_CKPT_MPD_BASE_PORT is not set
> [Rank 6][cr.c: line 781]MV2_CKPT_MPD_BASE_PORT is not set
> srun: error: x11: tasks 2-3: Exited with exit code 255
> srun: error: x10: tasks 0-1: Exited with exit code 255
> srun: error: x12: tasks 4-5: Exited with exit code 255
> srun: error: x13: tasks 6-7: Exited with exit code 255
>
> So I'm confused, this is obviously not working. Do I need to use some
> other mechanism to launch? am I missing configuration somewhere?
>
> Thanks,
> - David Brown
> _______________________________________________
> mvapich-discuss mailing list
> mvapich-discuss at cse.ohio-state.edu
> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>



More information about the mvapich-discuss mailing list