[mvapich-discuss] problem on mvapich2-1.8.1 checkpoint/restart with BLCR

Suja Ramachandran sujaram at igcar.gov.in
Thu Feb 21 07:30:40 EST 2013


Hi,

I was checkpointing the MPI application using cr_checkpoint --p <pid> 
command where 'pid' is the process id of mpirun_rsh process. Should I 
use any other option for checkpointing such as '--tree' or '--pgid' 
(these are giving errors)

thanks & regards,
suja

On Friday 15 February 2013 05:26 PM, Suja Ramachandran wrote:
> Hi,
>
> Unfortunately, the same error is repeating in mvapich2-1.9a2 too..The 
> interesting factor is, the job is able to complete its execution and  
> after job completion only the error is occurring (in both versions). 
> Also, once  the checkpoint context file is restarted with cr_restart , 
> it is able to restart and complete the execution, after which only the 
> error appears. May be, I can ignore the error messages!
>
> thanks and regards,
> Suja
>
> On Friday 15 February 2013 03:21 PM, Raghunath wrote:
>> Suja,
>>
>> There is a known bug in the 1.9-alpha2 version that causes the
>> configure script to look for FUSE when Checkpoint-Restart support is
>> enabled, even after adding the "--disable-ckpt-agregation" flag. The
>> fix for this will be available as part of the next release.
>>
>> Meanwhile, you can work around this bug by using the alternative
>> "--without-fuse" flag in place of "--disable-ckpt-agregation".
>> -- 
>> Raghu
>>
>>
>> On Fri, Feb 15, 2013 at 4:45 AM, Suja Ramachandran 
>> <sujaram at igcar.gov.in> wrote:
>>> Hi,
>>>
>>> I have trouble  building mvapich2-1.9a2. I am configuring it with
>>>
>>> ./configure --with-device=ch3:mrail --with-rdma=gen2
>>> --disable-ckpt-aggregation  --disable-rdma-cm --enable-ckpt
>>> --with-blcr=/usr/local --enable-g=all --enable-error-messages=all
>>> --enable-shared --with-file-system=nfs --enable-xrc
>>> --prefix=/share/apps/mvapich2-1.9a2
>>>
>>> It's giving errors:
>>> configure: checking checkpoint aggregation components
>>> checking for library containing fuse_new... no
>>> configure: error: fuse library not found
>>>
>>> I don't have FUSE library installed.. Why is it checking for checkpoint
>>> aggregation components even after giving the option
>>> --disable-ckpt-aggregation ?
>>>
>>> thanks and regards,
>>> suja
>>>
>>>
>>> On Friday 15 February 2013 01:51 PM, Raghunath wrote:
>>>> Suja,
>>>>
>>>>> /share/apps/mvapich2-1.8.1/bin/mpirun_rsh -np 8 -hostfile ./hostfile
>>>>> MV2_IBA_HCA=mlx4_0  MV2_USE_RDMAOE=1 MV2_DEFAULT_PORT=1
>>>>> MV2_CKPT_FILE=~rpmaps/checkpoint/scripts/mvapichckpt
>>>>> MV2_DEBUG_SHOW_BACKTRACE=1 ./vector
>>>>>
>>>>> Also, I have tried it by avoiding the options MV2_IBA_HCA=mlx4_0
>>>>> MV2_USE_RDMAOE=1 MV2_DEFAULT_PORT=1.
>>>>>
>>>>> The environment variables set are
>>>>> export PATH=/share/apps/mvapich2-1.8.1/bin/:$PATH
>>>>> export
>>>>>
>>>>> LD_LIBRARY_PATH=/share/apps/mvapich2-1.8.1/lib:/usr/local/lib:$LD_LIBRARY_PATH 
>>>>>
>>>> Thanks for sending this information. I see nothing strange here.
>>>>
>>>>> I have not yet installed the fault tolerance backplane(FTB). Is that
>>>>> mandatory for checkpoint/restart?
>>>> No, FTB is not mandatory to get Checkpoint/Restart working with
>>>> MVAPICH. The only external library that is "mandatory" for the CR
>>>> mechanism to work is BLCR.
>>>>
>>>>> I will also try with mvapich2-1.9
>>>>>
>>>>> (FYI,'vector' is  the executable of the vector addition program given
>>>>> here:
>>>>> http://www.cs.umanitoba.ca/~comp4510/examplesDIR/vsum.c )
>>>> Thanks for this pointer. I was able to checkpoint this sample
>>>> application successfully as well (again, I compiled MVAPICH using the
>>>> same config flags that you have used).
>>>>
>>>> Do let us know if you continue facing issues even after upgrading to
>>>> MVAPICH2-1.9
>>>>
>>>> -- 
>>>> Raghu
>>>>
>
> _______________________________________________
> mvapich-discuss mailing list
> mvapich-discuss at cse.ohio-state.edu
> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>



More information about the mvapich-discuss mailing list