[mvapich-discuss] [Checkpoint] BLCR call cr_poll_checkpoint() failed with error 2363: Request is invalid across a restart

Raghunath rajachan at cse.ohio-state.edu
Sun Jul 17 13:09:36 EDT 2011


Hi Alexandr,

Thanks for posting this issue to the list.

Which version of MVAPICH2 are you using for these runs?
We were able to reproduce this error with the "Automated Checkpointing
mode" that you are using,
and we are taking a look at it. The "Manual Checkpointing mode" works as
expected with the
latest version - MVAPICH2-1.7a2.

You can download a tarball of this version here:
http://mvapich.cse.ohio-state.edu/download/mvapich2/mvapich2-1.7a2.tar.gz

While we take a look at this bug, may I suggest that you use the manual mode
to take a checkpoint of your
application. You can find the commands to trigger a manual checkpoint and to
restart from a given checkpoint
in the following section of the MVAPICH2 userguide:

http://mvapich.cse.ohio-state.edu/support/user_guide_mvapich2-1.7_alpha2.html#x1-530006.9.1


Thanks,
--
Raghu <http://www.cse.ohio-state.edu/%7Erajachan/>


2011/7/17 Александр Твеленев <santvel at mail.ru>

> Hello group,
> I installed BLCR (with configures "
> --prefix=/home/santvel/BLCR_MVAPICH2/BLCR --enable-static
> --enable-testsuite") and
>
> MVAPICH2 (with configures "
>
>  --with-rdma=gen2 --enable-romio --with-file-system=lustre+nfs --with-blcr=/home/santvel/BLCR_MVAPICH2/BLCR --with-blcr-include=/home/santvel/BLCR_MVAPICH2/BLCR/include --with-blcr-libpath=/home/santvel/BLCR_MVAPICH2/BLCR/library
>
> ") in the one node of the claster.
>
> Run my test program.
>
> /home/santvel/BLCR_MVAPICH2/MVAPICH2/bin/mpirun_rsh -np 2 --hostfile /home/santvel/BLCR_MVAPICH2/MVAPICH2/bin/hosts MV2_CKPT_FILE=/home/santvel/TEST/avt/temp MV2_CKPT_INTERVAL=1 MV2_CKPT_MAX_SAVE_CKPTS=3 /home/santvel/TEST/TEST
>
> During the execution of the program was set up several auto checkpoints.
> After the crash, I tried to restore the program from one of
> the checkpoints.
>
> cr_restart /home/santvel/TEST/avt/temp.1.auto
>
> and got an error message:
>
> [mpirun_ckpt.c:680] BLCR call cr_poll_checkpoint() failed with error 2363: Request is invalid across a restart
>
> when attempting to restore the program from other checkpoints. I got an
> error message too.
>
> how to fix this error and restore the program after a crash?
>
> Kind regards, Alexandr.
>
>
>
>
> _______________________________________________
> mvapich-discuss mailing list
> mvapich-discuss at cse.ohio-state.edu
> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20110717/cccc73ed/attachment.html


More information about the mvapich-discuss mailing list