[mvapich-discuss] problem on mvapich2-1.8.1 checkpoint/restart
with BLCR
Suja Ramachandran
sujaram at igcar.gov.in
Fri Feb 15 06:56:57 EST 2013
Hi,
Unfortunately, the same error is repeating in mvapich2-1.9a2 too..The
interesting factor is, the job is able to complete its execution and
after job completion only the error is occurring (in both versions).
Also, once the checkpoint context file is restarted with cr_restart ,
it is able to restart and complete the execution, after which only the
error appears. May be, I can ignore the error messages!
thanks and regards,
Suja
On Friday 15 February 2013 03:21 PM, Raghunath wrote:
> Suja,
>
> There is a known bug in the 1.9-alpha2 version that causes the
> configure script to look for FUSE when Checkpoint-Restart support is
> enabled, even after adding the "--disable-ckpt-agregation" flag. The
> fix for this will be available as part of the next release.
>
> Meanwhile, you can work around this bug by using the alternative
> "--without-fuse" flag in place of "--disable-ckpt-agregation".
> --
> Raghu
>
>
> On Fri, Feb 15, 2013 at 4:45 AM, Suja Ramachandran <sujaram at igcar.gov.in> wrote:
>> Hi,
>>
>> I have trouble building mvapich2-1.9a2. I am configuring it with
>>
>> ./configure --with-device=ch3:mrail --with-rdma=gen2
>> --disable-ckpt-aggregation --disable-rdma-cm --enable-ckpt
>> --with-blcr=/usr/local --enable-g=all --enable-error-messages=all
>> --enable-shared --with-file-system=nfs --enable-xrc
>> --prefix=/share/apps/mvapich2-1.9a2
>>
>> It's giving errors:
>> configure: checking checkpoint aggregation components
>> checking for library containing fuse_new... no
>> configure: error: fuse library not found
>>
>> I don't have FUSE library installed.. Why is it checking for checkpoint
>> aggregation components even after giving the option
>> --disable-ckpt-aggregation ?
>>
>> thanks and regards,
>> suja
>>
>>
>> On Friday 15 February 2013 01:51 PM, Raghunath wrote:
>>> Suja,
>>>
>>>> /share/apps/mvapich2-1.8.1/bin/mpirun_rsh -np 8 -hostfile ./hostfile
>>>> MV2_IBA_HCA=mlx4_0 MV2_USE_RDMAOE=1 MV2_DEFAULT_PORT=1
>>>> MV2_CKPT_FILE=~rpmaps/checkpoint/scripts/mvapichckpt
>>>> MV2_DEBUG_SHOW_BACKTRACE=1 ./vector
>>>>
>>>> Also, I have tried it by avoiding the options MV2_IBA_HCA=mlx4_0
>>>> MV2_USE_RDMAOE=1 MV2_DEFAULT_PORT=1.
>>>>
>>>> The environment variables set are
>>>> export PATH=/share/apps/mvapich2-1.8.1/bin/:$PATH
>>>> export
>>>>
>>>> LD_LIBRARY_PATH=/share/apps/mvapich2-1.8.1/lib:/usr/local/lib:$LD_LIBRARY_PATH
>>> Thanks for sending this information. I see nothing strange here.
>>>
>>>> I have not yet installed the fault tolerance backplane(FTB). Is that
>>>> mandatory for checkpoint/restart?
>>> No, FTB is not mandatory to get Checkpoint/Restart working with
>>> MVAPICH. The only external library that is "mandatory" for the CR
>>> mechanism to work is BLCR.
>>>
>>>> I will also try with mvapich2-1.9
>>>>
>>>> (FYI,'vector' is the executable of the vector addition program given
>>>> here:
>>>> http://www.cs.umanitoba.ca/~comp4510/examplesDIR/vsum.c )
>>> Thanks for this pointer. I was able to checkpoint this sample
>>> application successfully as well (again, I compiled MVAPICH using the
>>> same config flags that you have used).
>>>
>>> Do let us know if you continue facing issues even after upgrading to
>>> MVAPICH2-1.9
>>>
>>> --
>>> Raghu
>>>
More information about the mvapich-discuss
mailing list