[mvapich-discuss] SCR+BLCR usage

Raghu rajachan at cse.ohio-state.edu
Wed Jan 29 17:01:37 EST 2014


Hi Arjun,

Thanks for that information. I tested this combination and there is
indeed an issue when triggering checkpoints with srun. We will work on
a fix and make it available in the next release. Meanwhile, you can
use mpirun_rsh as the launcher to checkpoint your applications.

Feel free to email us if you have any further questions.


Raghu


On Tue, Jan 28, 2014 at 7:24 AM, Arjun J Rao <rectangle.king at gmail.com> wrote:
> I issue my srun commands as :
> srun -N2 -n4 --checkpoint 1 --checkpoint-dir /home/username/Checkpoint
> MPIExecutable
>
>
> On Tue, Jan 28, 2014 at 5:49 PM, Arjun J Rao <rectangle.king at gmail.com>
> wrote:
>>
>> After installing SLURM, I installed MVAPICH2 using the parameters
>> --enable-ckpt --with-scr --with-pm=no --with-pmi=slurm
>> Inserted the BLCR module into the kernel using insmod
>> Set the following variables in the /etc/scr.conf file in both the
>> environment and in the scr.conf file :
>> SCR_FLUSH=2
>> SCR_CACHE_BASE=/home/username/Cache
>> SCR_CNTL_BASE=/home/username/Control
>> SCR_HALT_SECONDS=3600
>> SCR_PREFIX=/home/username/Checkpoint
>> SCR_RUNS=3
>>
>> In the scr.conf file, i have
>> CNTLDIR=/home/username/Control  BYTES=1000000000  [1 followed by 9 0s ie 1
>> GB]
>> SCR_CNTL_BASE=/home/username/Control
>>
>> CACHEDIR=/home/arjun/Cache BYTES=12000000000 [12 followed by 9 0s ie 12
>> GB]
>> SCR_CACHE_BASE=/home/arjun/Cache
>>
>> SCR_DB_ENABLE=0
>>
>> I run my MPI jobs using
>> salloc -N2 -n4 bash (to create a job allocation of 2 nodes and 4
>> processes)
>> srun -N2 -n4 MPIExecutable
>>
>>
>> On Tue, Jan 28, 2014 at 10:45 AM, Raghu <rajachan at cse.ohio-state.edu>
>> wrote:
>>>
>>> Hi Arjun,
>>>
>>> Thanks for your note. The capability to transparently checkpoint MPI
>>> applications launched using SLURM's 'srun' was added to
>>> MVAPICH2 in the latest release (2.0-beta). While it has been tested
>>> extensively with the BLCR, I have not tested all possible
>>> configurations with the BLCR+SLURM+SCR setup.  Can you send me
>>> your MVAPICH2 (and SCR) configuration parameters, and I'll try to see
>>> if I can reproduce your issue?
>>>
>>> Raghu
>>>
>>>
>>> On Mon, Jan 27, 2014 at 11:59 PM, Arjun J Rao <rectangle.king at gmail.com>
>>> wrote:
>>> > The manual mentions in passing in Section 6.15.2.2 (Transparent
>>> > Multilevel
>>> > Checkpointing) that SCR+BLCR can be run with the same level of ease as
>>> > running BLCR alone. My job launcher is SLURM (the manual mentions using
>>> > mpirun_rsh or mpiexec as job launchers)
>>> >
>>> > Has anybody any experience using this ? I ask because while I am able
>>> > to run
>>> > SCR alone on its own well, using the SCR installed with MVAPICH2 causes
>>> > weird errors :
>>> > "The variable SCR_CNTL_BASE cannot be set in the environment or
>>> > configuration file"
>>> >
>>> > _______________________________________________
>>> > mvapich-discuss mailing list
>>> > mvapich-discuss at cse.ohio-state.edu
>>> > http://mailman.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>>> >
>>
>>
>
>
> _______________________________________________
> mvapich-discuss mailing list
> mvapich-discuss at cse.ohio-state.edu
> http://mailman.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>



More information about the mvapich-discuss mailing list