[mvapich-discuss] problem on mvapich2-1.8.1 checkpoint/restart with BLCR

Suja Ramachandran sujaram at igcar.gov.in
Fri Feb 22 07:01:15 EST 2013


Hi,

Yes, I have used the mv2_checkpoint tool too..One more problem I have 
noticed is during the first checkpoint alone the application is able to 
complete its execution. If I  try to checkpoint an application the 
second time, or checkpoint an application restarted using cr_restart , 
the same errors are causing the program to stop execution..Now thats a 
real problem for me!
(Btw, any option is required while configuring BLCR for making it work 
with MVAPICH?)

thanks and regards,
Suja

thanks and regards,
Suja
On Friday 22 February 2013 12:29 AM, Raghunath wrote:
> Hi Suja,
>
> For the convenience of users, MVAPICH provides a wrapper to
> cr_checkpoint that lists the different MVAPICH jobs you are running,
> from which you can select the one you would like to take a checkpoint
> of. This wrapper, named "mv2_checkpoint", is placed in the
> "$PREFIX/bin" directory where prefix is /share/apps/mvapich2-1.9a2 in
> your case.
>
> Alternatively, you can also use the MV2_CKPT_INTERVAL environment
> variable to set a desired automatic checkpointing interval. More
> details about this parameter can be found here:
> http://mvapich.cse.ohio-state.edu/support/user_guide_mvapich2-1.9a2.html#x1-15400011.5
>
> --
> Raghu
>
>
> On Thu, Feb 21, 2013 at 7:30 AM, Suja Ramachandran <sujaram at igcar.gov.in> wrote:
>> Hi,
>>
>> I was checkpointing the MPI application using cr_checkpoint --p <pid>
>> command where 'pid' is the process id of mpirun_rsh process. Should I use
>> any other option for checkpointing such as '--tree' or '--pgid' (these are
>> giving errors)
>>
>> thanks & regards,
>> suja
>>
>>
>> On Friday 15 February 2013 05:26 PM, Suja Ramachandran wrote:
>>> Hi,
>>>
>>> Unfortunately, the same error is repeating in mvapich2-1.9a2 too..The
>>> interesting factor is, the job is able to complete its execution and  after
>>> job completion only the error is occurring (in both versions). Also, once
>>> the checkpoint context file is restarted with cr_restart , it is able to
>>> restart and complete the execution, after which only the error appears. May
>>> be, I can ignore the error messages!
>>>
>>> thanks and regards,
>>> Suja
>>>
>>> On Friday 15 February 2013 03:21 PM, Raghunath wrote:
>>>> Suja,
>>>>
>>>> There is a known bug in the 1.9-alpha2 version that causes the
>>>> configure script to look for FUSE when Checkpoint-Restart support is
>>>> enabled, even after adding the "--disable-ckpt-agregation" flag. The
>>>> fix for this will be available as part of the next release.
>>>>
>>>> Meanwhile, you can work around this bug by using the alternative
>>>> "--without-fuse" flag in place of "--disable-ckpt-agregation".
>>>> --
>>>> Raghu
>>>>
>>>>
>>>> On Fri, Feb 15, 2013 at 4:45 AM, Suja Ramachandran <sujaram at igcar.gov.in>
>>>> wrote:
>>>>> Hi,
>>>>>
>>>>> I have trouble  building mvapich2-1.9a2. I am configuring it with
>>>>>
>>>>> ./configure --with-device=ch3:mrail --with-rdma=gen2
>>>>> --disable-ckpt-aggregation  --disable-rdma-cm --enable-ckpt
>>>>> --with-blcr=/usr/local --enable-g=all --enable-error-messages=all
>>>>> --enable-shared --with-file-system=nfs --enable-xrc
>>>>> --prefix=/share/apps/mvapich2-1.9a2
>>>>>
>>>>> It's giving errors:
>>>>> configure: checking checkpoint aggregation components
>>>>> checking for library containing fuse_new... no
>>>>> configure: error: fuse library not found
>>>>>
>>>>> I don't have FUSE library installed.. Why is it checking for checkpoint
>>>>> aggregation components even after giving the option
>>>>> --disable-ckpt-aggregation ?
>>>>>
>>>>> thanks and regards,
>>>>> suja
>>>>>
>>>>>
>>>>> On Friday 15 February 2013 01:51 PM, Raghunath wrote:
>>>>>> Suja,
>>>>>>
>>>>>>> /share/apps/mvapich2-1.8.1/bin/mpirun_rsh -np 8 -hostfile ./hostfile
>>>>>>> MV2_IBA_HCA=mlx4_0  MV2_USE_RDMAOE=1 MV2_DEFAULT_PORT=1
>>>>>>> MV2_CKPT_FILE=~rpmaps/checkpoint/scripts/mvapichckpt
>>>>>>> MV2_DEBUG_SHOW_BACKTRACE=1 ./vector
>>>>>>>
>>>>>>> Also, I have tried it by avoiding the options MV2_IBA_HCA=mlx4_0
>>>>>>> MV2_USE_RDMAOE=1 MV2_DEFAULT_PORT=1.
>>>>>>>
>>>>>>> The environment variables set are
>>>>>>> export PATH=/share/apps/mvapich2-1.8.1/bin/:$PATH
>>>>>>> export
>>>>>>>
>>>>>>>
>>>>>>> LD_LIBRARY_PATH=/share/apps/mvapich2-1.8.1/lib:/usr/local/lib:$LD_LIBRARY_PATH
>>>>>> Thanks for sending this information. I see nothing strange here.
>>>>>>
>>>>>>> I have not yet installed the fault tolerance backplane(FTB). Is that
>>>>>>> mandatory for checkpoint/restart?
>>>>>> No, FTB is not mandatory to get Checkpoint/Restart working with
>>>>>> MVAPICH. The only external library that is "mandatory" for the CR
>>>>>> mechanism to work is BLCR.
>>>>>>
>>>>>>> I will also try with mvapich2-1.9
>>>>>>>
>>>>>>> (FYI,'vector' is  the executable of the vector addition program given
>>>>>>> here:
>>>>>>> http://www.cs.umanitoba.ca/~comp4510/examplesDIR/vsum.c )
>>>>>> Thanks for this pointer. I was able to checkpoint this sample
>>>>>> application successfully as well (again, I compiled MVAPICH using the
>>>>>> same config flags that you have used).
>>>>>>
>>>>>> Do let us know if you continue facing issues even after upgrading to
>>>>>> MVAPICH2-1.9
>>>>>>
>>>>>> --
>>>>>> Raghu
>>>>>>
>>> _______________________________________________
>>> mvapich-discuss mailing list
>>> mvapich-discuss at cse.ohio-state.edu
>>> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>>>
>> _______________________________________________
>> mvapich-discuss mailing list
>> mvapich-discuss at cse.ohio-state.edu
>> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss



More information about the mvapich-discuss mailing list