[mvapich-discuss] [Checkpoint] BLCR call cr_poll_checkpoint() failed with error 2363: Request is invalid across a restart

Raghunath rajachan at cse.ohio-state.edu
Wed Jul 20 20:42:28 EDT 2011


Hi Alexandr,

We just made a release of MVAPICH2-1.7RC1. You can download a tarball of
this version from the following link:

http://mvapich.cse.ohio-state.edu/download/mvapich2/mvapich2-1.7rc1.tgz

This version fixes the bug that you were seeing with the automatic
checkpointing feature.
Feel free to try it out and let us know if you have any other problems.

Thanks,
--
Raghu <http://www.cse.ohio-state.edu/%7Erajachan/>


2011/7/17 Raghunath <rajachan at cse.ohio-state.edu>

> Hi Alexandr,
>
> Thanks for posting this issue to the list.
>
> Which version of MVAPICH2 are you using for these runs?
> We were able to reproduce this error with the "Automated Checkpointing
> mode" that you are using,
> and we are taking a look at it. The "Manual Checkpointing mode" works as
> expected with the
> latest version - MVAPICH2-1.7a2.
>
> You can download a tarball of this version here:
> http://mvapich.cse.ohio-state.edu/download/mvapich2/mvapich2-1.7a2.tar.gz
>
> While we take a look at this bug, may I suggest that you use the manual
> mode to take a checkpoint of your
> application. You can find the commands to trigger a manual checkpoint and
> to restart from a given checkpoint
> in the following section of the MVAPICH2 userguide:
>
>
> http://mvapich.cse.ohio-state.edu/support/user_guide_mvapich2-1.7_alpha2.html#x1-530006.9.1
>
>
> Thanks,
> --
> Raghu <http://www.cse.ohio-state.edu/%7Erajachan/>
>
>
> 2011/7/17 Александр Твеленев <santvel at mail.ru>
>
>>  Hello group,
>> I installed BLCR (with configures "
>> --prefix=/home/santvel/BLCR_MVAPICH2/BLCR --enable-static
>> --enable-testsuite") and
>>
>> MVAPICH2 (with configures "
>>
>>  --with-rdma=gen2 --enable-romio --with-file-system=lustre+nfs --with-blcr=/home/santvel/BLCR_MVAPICH2/BLCR --with-blcr-include=/home/santvel/BLCR_MVAPICH2/BLCR/include --with-blcr-libpath=/home/santvel/BLCR_MVAPICH2/BLCR/library
>>
>> ") in the one node of the claster.
>>
>> Run my test program.
>>
>> /home/santvel/BLCR_MVAPICH2/MVAPICH2/bin/mpirun_rsh -np 2 --hostfile /home/santvel/BLCR_MVAPICH2/MVAPICH2/bin/hosts MV2_CKPT_FILE=/home/santvel/TEST/avt/temp MV2_CKPT_INTERVAL=1 MV2_CKPT_MAX_SAVE_CKPTS=3 /home/santvel/TEST/TEST
>>
>> During the execution of the program was set up several auto
>> checkpoints.
>> After the crash, I tried to restore the program from one of
>> the checkpoints.
>>
>> cr_restart /home/santvel/TEST/avt/temp.1.auto
>>
>> and got an error message:
>>
>> [mpirun_ckpt.c:680] BLCR call cr_poll_checkpoint() failed with error 2363: Request is invalid across a restart
>>
>> when attempting to restore the program from other checkpoints. I got an
>> error message too.
>>
>> how to fix this error and restore the program after a crash?
>>
>> Kind regards, Alexandr.
>>
>>
>>
>>
>> _______________________________________________
>> mvapich-discuss mailing list
>> mvapich-discuss at cse.ohio-state.edu
>> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20110720/968ede75/attachment.html


More information about the mvapich-discuss mailing list