[mvapich-discuss] cr_restart problem when use mpirun_rsh

Sonya Marcarelli smarcare at cse.ohio-state.edu
Thu Dec 3 17:32:44 EST 2009


Hi Cai,
you can try some test using blcr-0.8.0. With this release we can 
checkpoint and restart the job successfully.

Sonya Marcarelli

Jonathan Perkins ha scritto:
> On Thu, Dec 03, 2009 at 03:29:29PM +0800, cai jingnan wrote:
>   
>> Hi:
>> I come up against some problem when I test the checkpoint function of
>> mvapich2.
>>
>> blcr version: blcr-0.8.1
>> mvapich2 version: mvapich2-1.4-2009-12-01      (or previous version)
>> linux kernel: 2.6.9
>> --------------------------------
>> export CFLAGS=-D_GNU_SOURCE
>> ./configure --prefix=/home/jn/bin/mvapich2 --with-rdma=gen2 --with-pm=mpd
>> --enable-blcr --with-blcr-libpath=/home/jn/bin/blcr/lib
>> --with-blcr-include=/home/jn/bin/blcr/include
>> make
>> make install
>> --------------------------------
>>
>> When I use mpd, I can checkpoint and restart the job successfully.
>>
>> When I use mpirun_rsh as follow:
>>
>> mpirun_rsh -ssh -np 4 -hostfile ./hostfile
>> MV2_CKPT_FILE=/home/jn/nouse/mpickt ./cg.B.4
>>
>> I can use mv2_checkpoint to checkpoint it successfully too,
>> but fail to restart it as the out of cr_restart follow:
>> --------------------------------
>> $cr_restart ./context.7104
>> Restarting...
>> Exit code -5 signaled from node60
>> [CR Restart] execv: No such file or directory
>> [CR Restart] execv: No such file or directory
>> MPI process (rank: 0) terminated unexpectedly on node60
>> --------------------------------
>>
>> So how can I do now?
>> Thank you!
>>     
>
> Thanks for the report, we're looking into this issue and will get back
> to you with our findings.
>
>   
> ------------------------------------------------------------------------
>
> _______________________________________________
> mvapich-discuss mailing list
> mvapich-discuss at cse.ohio-state.edu
> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>   



More information about the mvapich-discuss mailing list