[mvapich-discuss] cr_restart does not work when you change the hosts file

Xavier Besseron besseron at cse.ohio-state.edu
Wed Jul 20 10:56:27 EDT 2011


Hi Alexandr,

Please make sure that you follow all the configuration steps under the
"Basic Checkpoint/Restart" section of the userguide:
http://mvapich.cse.ohio-state.edu/support/user_guide_mvapich2-1.7_alpha2.html#x1-530006.9.1

Because you want to restart on a different node, I guess you need to
disable prelinking on your nodes.
You can find further details about this here:
https://upc-bugs.lbl.gov//blcr/doc/html/FAQ.html#prelink

Let us know if it solves your issue.

Thanks,

Xavier


2011/7/20 Александр Твеленев <santvel at mail.ru>:
> Hello group,
> I run my application with the parameters
> "/home/santvel/BLCR_MVAPICH2/MVAPICH2-1.7/bin/mpirun_rsh -np 2 --hostfile
> /home/santvel/BLCR_MVAPICH2/MVAPICH2-1.7/bin/hosts
> MV2_CKPT_FILE=/home/santvel/TEST2/hand/temp MV2_CKPT_MAX_SAVE_CKPTS=10
> /home/santvel/TEST2/TEST"
> hosts file content
>>opt02
>>opt02
> create a checkpoint using the command: "mv2_checkpoint 4784"
> After a failure, I decided to run the node opt06. To do this, change the
> hosts file.
> hosts file content
>>opt06
>>opt06
> Run cr_restart and got an error message:
> "
> [santvel at opt06 bin]$ cr_restart context.4784
> Restarting...
> [opt06:mpispawn_0][child_handler] MPI process (rank: 1, pid: 5082)
> terminated with signal 11 -> abort job
> [opt06:mpirun_rsh][wait_for_mpispawn] mpispawn_0 from node opt06 aborted:
> MPI process error (1)
> "
> how to fix this error and restore the program after a crash?
> Kind regards, Alexandr.
> 
>
> _______________________________________________
> mvapich-discuss mailing list
> mvapich-discuss at cse.ohio-state.edu
> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>
>



More information about the mvapich-discuss mailing list