[mvapich-discuss] cr_restart problem when use mpirun_rsh

Jonathan Perkins perkinjo at cse.ohio-state.edu
Thu Dec 3 15:03:01 EST 2009


On Thu, Dec 03, 2009 at 03:29:29PM +0800, cai jingnan wrote:
> Hi:
> I come up against some problem when I test the checkpoint function of
> mvapich2.
> 
> blcr version: blcr-0.8.1
> mvapich2 version: mvapich2-1.4-2009-12-01      (or previous version)
> linux kernel: 2.6.9
> --------------------------------
> export CFLAGS=-D_GNU_SOURCE
> ./configure --prefix=/home/jn/bin/mvapich2 --with-rdma=gen2 --with-pm=mpd
> --enable-blcr --with-blcr-libpath=/home/jn/bin/blcr/lib
> --with-blcr-include=/home/jn/bin/blcr/include
> make
> make install
> --------------------------------
> 
> When I use mpd, I can checkpoint and restart the job successfully.
> 
> When I use mpirun_rsh as follow:
> 
> mpirun_rsh -ssh -np 4 -hostfile ./hostfile
> MV2_CKPT_FILE=/home/jn/nouse/mpickt ./cg.B.4
> 
> I can use mv2_checkpoint to checkpoint it successfully too,
> but fail to restart it as the out of cr_restart follow:
> --------------------------------
> $cr_restart ./context.7104
> Restarting...
> Exit code -5 signaled from node60
> [CR Restart] execv: No such file or directory
> [CR Restart] execv: No such file or directory
> MPI process (rank: 0) terminated unexpectedly on node60
> --------------------------------
> 
> So how can I do now?
> Thank you!

Thanks for the report, we're looking into this issue and will get back
to you with our findings.

-- 
Jonathan Perkins
http://www.cse.ohio-state.edu/~perkinjo
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 197 bytes
Desc: not available
Url : http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20091203/9b437718/attachment.bin


More information about the mvapich-discuss mailing list