[mvapich-discuss] [MVAPICH2] Suspend / Resume

Yann K. yann.kalemkarian at bull.net
Mon Feb 19 04:04:40 EST 2007


Wei,

Thanks for answering this one. To clarify my point. Some jobs in time 
can become more important than other and be scheduled to replace already 
running jobs. LSF allows this. Thus, the current running job must be 
stopped. How does this go ?
+ Does it happen without any pain with mvapich2 ?
+ How does the pinned memory behave ?
+ Are the memory pages swapped out ? How do they come back ?
+ How does the ofed memory registration which make virtual/physical 
associations behave then ?
+ What happens technically when jobs are stopped by a batch/scheduler ?
+ Will the second job have the benefit of all the RAM, will the pinned 
memory stay somehow ?

Of course, I don't want to spend time to checkpoint/restart my job. I 
just want to suspend it (like a suspend to disk), let the pages being 
swapped out, let the other go job and work, and then putting my first 
job back to work.

Y


wei huang a écrit :
> Hi Yann,
>
> Thanks for using mvapich2.
>
> May I have you clarify your question a bit more? Typically SIGSTOP is to
> pause the program and SIGCONT is to restart that program. Is this what you
> want to have?
>
> If you want to suspend a MPI job and restart later. May I suggest you to
> use the checkpoint/restart function of the latest mvapich2 release.
> Detailed instructions can be found at:
>
> http://nowlab.cse.ohio-state.edu/projects/mpi-iba/download-mvapich2/mvapich2_user_guide.html
>
> Please note that you need BLCR installed on your systems.
>
> Let us know if we undertand your question correctly.
>
> Thanks.
>
> Regards,
> Wei Huang
>
> 774 Dreese Lab, 2015 Neil Ave,
> Dept. of Computer Science and Engineering
> Ohio State University
> OH 43210
> Tel: (614)292-8501
>
>
> On Fri, 16 Feb 2007, Yann K. wrote:
>
>   
>> Hello everybody,
>>
>> While looking at the mvapich2 gen2 code, I was looking for routines
>> handling SIGSTOP and CONT, and couldn't find any. I work with an OFED
>> stack and couldn't find anything on handling those signals as well at
>> that level. What happens to MPI processes being served with an lsf, mpd,
>> or slurmd SIGSTOP signal, especially if rdma memory is pinned and
>> already registered on the board ?
>>
>> Thanks for ideas
>>
>> Yann K.
>>
>>
>> _______________________________________________
>> mvapich-discuss mailing list
>> mvapich-discuss at cse.ohio-state.edu
>> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>>
>>     
>
>
>   

-- 
Yann Kalemkarian
HPC Software Engineer
Open Software R&D
Bull, Architect of an Open World TM
Phone: +33 4 7629 7393
www.bull.com



More information about the mvapich-discuss mailing list