[mvapich-discuss] problem on mvapich2-1.8.1 checkpoint/restart with BLCR

Suja Ramachandran sujaram at igcar.gov.in
Fri Feb 15 06:56:57 EST 2013


Hi,

Unfortunately, the same error is repeating in mvapich2-1.9a2 too..The 
interesting factor is, the job is able to complete its execution and  
after job completion only the error is occurring (in both versions). 
Also, once  the checkpoint context file is restarted with cr_restart , 
it is able to restart and complete the execution, after which only the 
error appears. May be, I can ignore the error messages!

thanks and regards,
Suja

On Friday 15 February 2013 03:21 PM, Raghunath wrote:
> Suja,
>
> There is a known bug in the 1.9-alpha2 version that causes the
> configure script to look for FUSE when Checkpoint-Restart support is
> enabled, even after adding the "--disable-ckpt-agregation"  flag. The
> fix for this will be available as part of the next release.
>
> Meanwhile, you can work around this bug by using the alternative
> "--without-fuse" flag in place of "--disable-ckpt-agregation".
> --
> Raghu
>
>
> On Fri, Feb 15, 2013 at 4:45 AM, Suja Ramachandran <sujaram at igcar.gov.in> wrote:
>> Hi,
>>
>> I have trouble  building mvapich2-1.9a2. I am configuring it  with
>>
>> ./configure --with-device=ch3:mrail --with-rdma=gen2
>> --disable-ckpt-aggregation  --disable-rdma-cm --enable-ckpt
>> --with-blcr=/usr/local --enable-g=all --enable-error-messages=all
>> --enable-shared --with-file-system=nfs --enable-xrc
>> --prefix=/share/apps/mvapich2-1.9a2
>>
>> It's giving errors:
>> configure: checking checkpoint aggregation components
>> checking for library containing fuse_new... no
>> configure: error: fuse library not found
>>
>> I don't have FUSE library installed.. Why is it checking for checkpoint
>> aggregation components even after giving the option
>> --disable-ckpt-aggregation ?
>>
>> thanks and regards,
>> suja
>>
>>
>> On Friday 15 February 2013 01:51 PM, Raghunath wrote:
>>> Suja,
>>>
>>>>    /share/apps/mvapich2-1.8.1/bin/mpirun_rsh -np 8 -hostfile ./hostfile
>>>> MV2_IBA_HCA=mlx4_0  MV2_USE_RDMAOE=1 MV2_DEFAULT_PORT=1
>>>> MV2_CKPT_FILE=~rpmaps/checkpoint/scripts/mvapichckpt
>>>> MV2_DEBUG_SHOW_BACKTRACE=1 ./vector
>>>>
>>>> Also, I have tried it by avoiding the options MV2_IBA_HCA=mlx4_0
>>>> MV2_USE_RDMAOE=1 MV2_DEFAULT_PORT=1.
>>>>
>>>> The environment variables set are
>>>> export PATH=/share/apps/mvapich2-1.8.1/bin/:$PATH
>>>> export
>>>>
>>>> LD_LIBRARY_PATH=/share/apps/mvapich2-1.8.1/lib:/usr/local/lib:$LD_LIBRARY_PATH
>>> Thanks for sending this information. I see nothing strange here.
>>>
>>>> I have not yet installed the fault tolerance backplane(FTB). Is that
>>>> mandatory for checkpoint/restart?
>>> No, FTB is not mandatory to get Checkpoint/Restart working with
>>> MVAPICH. The only external library that is "mandatory" for the CR
>>> mechanism to work is BLCR.
>>>
>>>> I will also try with mvapich2-1.9
>>>>
>>>> (FYI,'vector' is  the executable of the vector addition program given
>>>> here:
>>>> http://www.cs.umanitoba.ca/~comp4510/examplesDIR/vsum.c )
>>> Thanks for this pointer. I was able to checkpoint this sample
>>> application successfully as well (again, I compiled MVAPICH using the
>>> same config flags that you have used).
>>>
>>> Do let us know if you continue facing issues even after upgrading to
>>> MVAPICH2-1.9
>>>
>>> --
>>> Raghu
>>>



More information about the mvapich-discuss mailing list