[mvapich-discuss] cr_restart problem when use mpirun_rsh

cai jingnan keepeppy at gmail.com
Thu Dec 3 02:29:29 EST 2009


Hi:
I come up against some problem when I test the checkpoint function of
mvapich2.

blcr version: blcr-0.8.1
mvapich2 version: mvapich2-1.4-2009-12-01      (or previous version)
linux kernel: 2.6.9
--------------------------------
export CFLAGS=-D_GNU_SOURCE
./configure --prefix=/home/jn/bin/mvapich2 --with-rdma=gen2 --with-pm=mpd
--enable-blcr --with-blcr-libpath=/home/jn/bin/blcr/lib
--with-blcr-include=/home/jn/bin/blcr/include
make
make install
--------------------------------

When I use mpd, I can checkpoint and restart the job successfully.

When I use mpirun_rsh as follow:

mpirun_rsh -ssh -np 4 -hostfile ./hostfile
MV2_CKPT_FILE=/home/jn/nouse/mpickt ./cg.B.4

I can use mv2_checkpoint to checkpoint it successfully too,
but fail to restart it as the out of cr_restart follow:
--------------------------------
$cr_restart ./context.7104
Restarting...
Exit code -5 signaled from node60
[CR Restart] execv: No such file or directory
[CR Restart] execv: No such file or directory
MPI process (rank: 0) terminated unexpectedly on node60
--------------------------------

So how can I do now?
Thank you!
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20091203/ac710755/attachment.html


More information about the mvapich-discuss mailing list