[mvapich-discuss] mvapich2+slurm+blcr Oh, My! (fwd)

Dhabaleswar Panda panda at cse.ohio-state.edu
Thu Jun 24 15:04:44 EDT 2010


Hi David,

> Thanks, I rebuilt mvapich2 to use mpirun_rsh and now its running fine.

Good to know that it is running fine.

> Question about the environment variables.
>
> MV2_CKPT_INTERVAL is this in minutes? seconds? hours?

It is in `minutes'. The paramter is described in detail in the user guide
here:

http://mvapich.cse.ohio-state.edu/support/user_guide_mvapich2-1.5rc2.html#x1-9000011.2

> Also, what would need to be done to get this working? Is slurm
> supposed to setup the controlling daemon and open all the ports to the
> processes for communication?
>
> Just curious, Thanks.

We did some initial analysis on this. Some parameters to the checkpoint
restart environment are not being passed when Slurm is used. They are
passed when mpirun_rsh is used. We are taking a further look at this
problem.

Thanks,

DK

>  - David Brown
>
> On Wed, Jun 23, 2010 at 9:29 PM, Dhabaleswar Panda
> <panda at cse.ohio-state.edu> wrote:
> > Hi,
> >
> > Thanks for reporting this issue. This is to let you know that the
> > following combinations (in addition to other combinations like hydra,
> > etc.) have been tested:
> >
> >   - mvapich2 + slurm
> >   - mvapich2 + mpirun_rsh
> >   - mvapich2 + mpirun_rsh + blcr
> >
> > However, the combination you are trying to use (mvapich2 + slurm + blcr)
> > has not been tested. We will take a look at it.
> >
> > In the mean time, I will suggest you to try the `mvapich2 + mpirun_rsh +
> > blcr' combination.
> >
> > Sections 5.2.1 and 6.3 of the latest MVAPICH2 1.5 user guide mentions
> > about the detailed steps in using mpirun_rsh and check-point restart with
> > BLCR, respectively.
> >
> > The User Guide can be obtained from the following URL:
> >
> > http://mvapich.cse.ohio-state.edu/support/user_guide_mvapich2-1.5rc2.html
> >
> > Thanks,
> >
> > DK
> >
> >
> > On Wed, 23 Jun 2010, David Brown wrote:
> >
> >> So I'm not sure who to ask this question to and what setup I'm doing
> >> wrong, but something isn't working right.
> >>
> >> I built mvapich2 1.5rc2 this way:
> >>
> >> ./configure \
> >>                 --prefix=%{mpidir} \
> >>                 --mandir=%{mpidir}/man \
> >>                 --enable-error-checking=runtime \
> >>                 --enable-timing=none \
> >>                 --enable-g=mem,dbg,meminit \
> >>                 --enable-sharedlibs=gcc \
> >>                 --with-rdma=gen2 \
> >>                 --enable-romio \
> >>                 --with-file-system=lustre+nfs \
> >>                 --with-slurm=/usr \
> >>                 --with-pmi=slurm \
> >>                 --with-pm=no \
> >>                 --enable-threads=multiple \
> >>                 --with-thread-package=pthreads \
> >>                 --disable-mpe \
> >>                 --without-mpe \
> >>                 --disable-nmpi-as-mpi \
> >>                 --enable-f77 \
> >>                 --enable-f90 \
> >>                 --enable-cxx \
> >>                 --enable-blcr
> >>
> >> I built slurm this way:
> >>
> >> rpmbuild -ta --with blcr --with postgresql slurm-2.1.9.tar.bz2
> >>
> >> I configured slurm with:
> >>
> >> CheckpointType=checkpoint/blcr
> >>
> >> And I built IOR and launched it this way:
> >>
> >> $ srun --checkpoint-dir=/lustre -n 8 -N 4  ./IOR -i 4 -b 32g -T 600 -E
> >> -k -e -t 1m -o /lustre/testFile
> >> [Rank 1][cr.c: line 781]MV2_CKPT_MPD_BASE_PORT is not set
> >> [Rank 3][cr.c: line 781]MV2_CKPT_MPD_BASE_PORT is not set
> >> [Rank 0][cr.c: line 781]MV2_CKPT_MPD_BASE_PORT is not set
> >> [Rank 2][cr.c: line 781]MV2_CKPT_MPD_BASE_PORT is not set
> >> [Rank 5][cr.c: line 781]MV2_CKPT_MPD_BASE_PORT is not set
> >> [Rank 4][cr.c: line 781]MV2_CKPT_MPD_BASE_PORT is not set
> >> [Rank 7][cr.c: line 781]MV2_CKPT_MPD_BASE_PORT is not set
> >> [Rank 6][cr.c: line 781]MV2_CKPT_MPD_BASE_PORT is not set
> >> srun: error: x11: tasks 2-3: Exited with exit code 255
> >> srun: error: x10: tasks 0-1: Exited with exit code 255
> >> srun: error: x12: tasks 4-5: Exited with exit code 255
> >> srun: error: x13: tasks 6-7: Exited with exit code 255
> >>
> >> So I'm confused, this is obviously not working. Do I need to use some
> >> other mechanism to launch? am I missing configuration somewhere?
> >>
> >> Thanks,
> >> - David Brown
> >> _______________________________________________
> >> mvapich-discuss mailing list
> >> mvapich-discuss at cse.ohio-state.edu
> >> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
> >>
> >
> > _______________________________________________
> > mvapich-discuss mailing list
> > mvapich-discuss at cse.ohio-state.edu
> > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
> >
> >
>




More information about the mvapich-discuss mailing list