[mvapich-discuss] cr_restart does not work when you change the hosts file

Александр Твеленев santvel at mail.ru
Wed Jul 20 04:30:28 EDT 2011


Hello group, 

I run my application with the parameters 
"/home/santvel/BLCR_MVAPICH2/MVAPICH2-1.7/bin/mpirun_rsh -np 2 --hostfile /home/santvel/BLCR_MVAPICH2/MVAPICH2-1.7/bin/hosts MV2_CKPT_FILE=/home/santvel/TEST2/hand/temp MV2_CKPT_MAX_SAVE_CKPTS=10 /home/santvel/TEST2/TEST" 

hosts file content 
>opt02 
>opt02 

create a checkpoint using the command: "mv2_checkpoint 4784" 

After a failure, I decided to run the node opt06. To do this, change the hosts file. 

hosts file content 
>opt06 
>opt06 

Run cr_restart and got an error message: 

" 
[santvel at opt06 bin]$ cr_restart context.4784  
Restarting... 
[opt06:mpispawn_0][child_handler] MPI process (rank: 1, pid: 5082) terminated with signal 11 -> abort job 
[opt06:mpirun_rsh][wait_for_mpispawn] mpispawn_0 from node opt06 aborted: MPI process error (1) 
" 

how to fix this error and restore the program after a crash? 

Kind regards, Alexandr.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20110720/8d34fd21/attachment-0001.html


More information about the mvapich-discuss mailing list