[mvapich-discuss] [MVAPICH2] Suspend / Resume

wei huang huanwei at cse.ohio-state.edu
Mon Feb 19 21:37:53 EST 2007


Hi Yann,

Thanks for letting us know your detailed requirements for the
suspend/resume feature. The closest functionality to meet your
requirements in current mvapich2 releases is our CR support, which writes
the application memory footprints to disks and restart from that later.

However, we are working on the feature you mentioned (suspend and resume)
and it will be available during the next MVAPICH2 release.

Thanks.

Regards,
Wei Huang

774 Dreese Lab, 2015 Neil Ave,
Dept. of Computer Science and Engineering
Ohio State University
OH 43210
Tel: (614)292-8501


On Mon, 19 Feb 2007, Yann K. wrote:

> Wei,
>
> Thanks for answering this one. To clarify my point. Some jobs in time
> can become more important than other and be scheduled to replace already
> running jobs. LSF allows this. Thus, the current running job must be
> stopped. How does this go ?
> + Does it happen without any pain with mvapich2 ?
> + How does the pinned memory behave ?
> + Are the memory pages swapped out ? How do they come back ?
> + How does the ofed memory registration which make virtual/physical
> associations behave then ?
> + What happens technically when jobs are stopped by a batch/scheduler ?
> + Will the second job have the benefit of all the RAM, will the pinned
> memory stay somehow ?
>
> Of course, I don't want to spend time to checkpoint/restart my job. I
> just want to suspend it (like a suspend to disk), let the pages being
> swapped out, let the other go job and work, and then putting my first
> job back to work.
>
> Y
>
>
> wei huang a écrit :
> > Hi Yann,
> >
> > Thanks for using mvapich2.
> >
> > May I have you clarify your question a bit more? Typically SIGSTOP is to
> > pause the program and SIGCONT is to restart that program. Is this what you
> > want to have?
> >
> > If you want to suspend a MPI job and restart later. May I suggest you to
> > use the checkpoint/restart function of the latest mvapich2 release.
> > Detailed instructions can be found at:
> >
> > http://nowlab.cse.ohio-state.edu/projects/mpi-iba/download-mvapich2/mvapich2_user_guide.html
> >
> > Please note that you need BLCR installed on your systems.
> >
> > Let us know if we undertand your question correctly.
> >
> > Thanks.
> >
> > Regards,
> > Wei Huang
> >
> > 774 Dreese Lab, 2015 Neil Ave,
> > Dept. of Computer Science and Engineering
> > Ohio State University
> > OH 43210
> > Tel: (614)292-8501
> >
> >
> > On Fri, 16 Feb 2007, Yann K. wrote:
> >
> >
> >> Hello everybody,
> >>
> >> While looking at the mvapich2 gen2 code, I was looking for routines
> >> handling SIGSTOP and CONT, and couldn't find any. I work with an OFED
> >> stack and couldn't find anything on handling those signals as well at
> >> that level. What happens to MPI processes being served with an lsf, mpd,
> >> or slurmd SIGSTOP signal, especially if rdma memory is pinned and
> >> already registered on the board ?
> >>
> >> Thanks for ideas
> >>
> >> Yann K.
> >>
> >>
> >> _______________________________________________
> >> mvapich-discuss mailing list
> >> mvapich-discuss at cse.ohio-state.edu
> >> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
> >>
> >>
> >
> >
> >
>
> --
> Yann Kalemkarian
> HPC Software Engineer
> Open Software R&D
> Bull, Architect of an Open World TM
> Phone: +33 4 7629 7393
> www.bull.com
>





More information about the mvapich-discuss mailing list