[mvapich-discuss] problem on mvapich2-1.8.1 checkpoint/restart with BLCR

Raghunath rajachan at cse.ohio-state.edu
Thu Feb 21 13:59:43 EST 2013


Hi Suja,

For the convenience of users, MVAPICH provides a wrapper to
cr_checkpoint that lists the different MVAPICH jobs you are running,
from which you can select the one you would like to take a checkpoint
of. This wrapper, named "mv2_checkpoint", is placed in the
"$PREFIX/bin" directory where prefix is /share/apps/mvapich2-1.9a2 in
your case.

Alternatively, you can also use the MV2_CKPT_INTERVAL environment
variable to set a desired automatic checkpointing interval. More
details about this parameter can be found here:
http://mvapich.cse.ohio-state.edu/support/user_guide_mvapich2-1.9a2.html#x1-15400011.5

--
Raghu


On Thu, Feb 21, 2013 at 7:30 AM, Suja Ramachandran <sujaram at igcar.gov.in> wrote:
> Hi,
>
> I was checkpointing the MPI application using cr_checkpoint --p <pid>
> command where 'pid' is the process id of mpirun_rsh process. Should I use
> any other option for checkpointing such as '--tree' or '--pgid' (these are
> giving errors)
>
> thanks & regards,
> suja
>
>
> On Friday 15 February 2013 05:26 PM, Suja Ramachandran wrote:
>>
>> Hi,
>>
>> Unfortunately, the same error is repeating in mvapich2-1.9a2 too..The
>> interesting factor is, the job is able to complete its execution and  after
>> job completion only the error is occurring (in both versions). Also, once
>> the checkpoint context file is restarted with cr_restart , it is able to
>> restart and complete the execution, after which only the error appears. May
>> be, I can ignore the error messages!
>>
>> thanks and regards,
>> Suja
>>
>> On Friday 15 February 2013 03:21 PM, Raghunath wrote:
>>>
>>> Suja,
>>>
>>> There is a known bug in the 1.9-alpha2 version that causes the
>>> configure script to look for FUSE when Checkpoint-Restart support is
>>> enabled, even after adding the "--disable-ckpt-agregation" flag. The
>>> fix for this will be available as part of the next release.
>>>
>>> Meanwhile, you can work around this bug by using the alternative
>>> "--without-fuse" flag in place of "--disable-ckpt-agregation".
>>> --
>>> Raghu
>>>
>>>
>>> On Fri, Feb 15, 2013 at 4:45 AM, Suja Ramachandran <sujaram at igcar.gov.in>
>>> wrote:
>>>>
>>>> Hi,
>>>>
>>>> I have trouble  building mvapich2-1.9a2. I am configuring it with
>>>>
>>>> ./configure --with-device=ch3:mrail --with-rdma=gen2
>>>> --disable-ckpt-aggregation  --disable-rdma-cm --enable-ckpt
>>>> --with-blcr=/usr/local --enable-g=all --enable-error-messages=all
>>>> --enable-shared --with-file-system=nfs --enable-xrc
>>>> --prefix=/share/apps/mvapich2-1.9a2
>>>>
>>>> It's giving errors:
>>>> configure: checking checkpoint aggregation components
>>>> checking for library containing fuse_new... no
>>>> configure: error: fuse library not found
>>>>
>>>> I don't have FUSE library installed.. Why is it checking for checkpoint
>>>> aggregation components even after giving the option
>>>> --disable-ckpt-aggregation ?
>>>>
>>>> thanks and regards,
>>>> suja
>>>>
>>>>
>>>> On Friday 15 February 2013 01:51 PM, Raghunath wrote:
>>>>>
>>>>> Suja,
>>>>>
>>>>>> /share/apps/mvapich2-1.8.1/bin/mpirun_rsh -np 8 -hostfile ./hostfile
>>>>>> MV2_IBA_HCA=mlx4_0  MV2_USE_RDMAOE=1 MV2_DEFAULT_PORT=1
>>>>>> MV2_CKPT_FILE=~rpmaps/checkpoint/scripts/mvapichckpt
>>>>>> MV2_DEBUG_SHOW_BACKTRACE=1 ./vector
>>>>>>
>>>>>> Also, I have tried it by avoiding the options MV2_IBA_HCA=mlx4_0
>>>>>> MV2_USE_RDMAOE=1 MV2_DEFAULT_PORT=1.
>>>>>>
>>>>>> The environment variables set are
>>>>>> export PATH=/share/apps/mvapich2-1.8.1/bin/:$PATH
>>>>>> export
>>>>>>
>>>>>>
>>>>>> LD_LIBRARY_PATH=/share/apps/mvapich2-1.8.1/lib:/usr/local/lib:$LD_LIBRARY_PATH
>>>>>
>>>>> Thanks for sending this information. I see nothing strange here.
>>>>>
>>>>>> I have not yet installed the fault tolerance backplane(FTB). Is that
>>>>>> mandatory for checkpoint/restart?
>>>>>
>>>>> No, FTB is not mandatory to get Checkpoint/Restart working with
>>>>> MVAPICH. The only external library that is "mandatory" for the CR
>>>>> mechanism to work is BLCR.
>>>>>
>>>>>> I will also try with mvapich2-1.9
>>>>>>
>>>>>> (FYI,'vector' is  the executable of the vector addition program given
>>>>>> here:
>>>>>> http://www.cs.umanitoba.ca/~comp4510/examplesDIR/vsum.c )
>>>>>
>>>>> Thanks for this pointer. I was able to checkpoint this sample
>>>>> application successfully as well (again, I compiled MVAPICH using the
>>>>> same config flags that you have used).
>>>>>
>>>>> Do let us know if you continue facing issues even after upgrading to
>>>>> MVAPICH2-1.9
>>>>>
>>>>> --
>>>>> Raghu
>>>>>
>>
>> _______________________________________________
>> mvapich-discuss mailing list
>> mvapich-discuss at cse.ohio-state.edu
>> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>>
>
> _______________________________________________
> mvapich-discuss mailing list
> mvapich-discuss at cse.ohio-state.edu
> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss


More information about the mvapich-discuss mailing list