[mvapich-discuss] mvapich2-1.7rc1 with BLCR
=?GB2312?B?0vzN8s36?=
yinwanwang at gmail.com
Tue Sep 27 11:10:21 EDT 2011
Hi
I have a problem with mvapich2-1.7rc1!
Iconfigure the mvapich2-1.7rc1 like this:
PREFIX=${PREFIX:-$HOME/mv2-normal}
export CC=${CC:-icc}
export CXX=${CXX:-icpc}
export F77=${F77:-ifort}
export FC=${FC:-ifort}
./configure --with-arch=LINUX -prefix=${PREFIX} --enable-a=all --enable-error-message=all --with-device=ch3:mrail --with-rdma=gen2 --with-ib-include=/usr/local/ib_hpc/include --with-ib-libpath=/usr/local/ib_hpc/lib64 --disable-rdma-cm --enable-romio --enable-ckpt --with-blcr=/opt/blcr --enable-ckpt-migration --with-ftb=/opt/ftb --with-hydra-ckpointlib=blcr
run the program like this:
mpirun_rsh -np 4 -hostfile ./host MV2_CKPT_FILE=/tmp/app.ckpt MV2_CKPT_USE_AGGREGATION=0 ./bt.A.4
my hostfile:
cn01
cn01
cn02
cn02
That means the bt.A.4 will run on two nodes.
I use the "mv2_checkpoint" and the checkpointing is successful,like this
NAS Parallel Benchmarks 3.3 -- BT Benchmark
No input file inputbt.data. Using compiled defaults
Size: 64x 64x 64
Iterations: 200 dt: 0.0008000
Number of active processes: 4
Time step 1
Time step 20
Time step 40
Time step 60
Time step 80
[3]: CR completed...
[0]: CR completed...
Time step 100
Time step 120
Time step 140
Time step 160
Time step 180
Time step 200
When I want to restart the program, I use "cr_restart context.5826 ".
But some error occur like this:
[cn02:mpispawn_1][restart_mpi_process] execvp() failed: No such file or directory (2)
[cn02:mpispawn_1][restart_mpi_process] execvp() failed: No such file or directory (2)
[cn02:mpispawn_1][child_handler] MPI process (rank: 3, pid: 2587) exited with status 1
[cn02:mpispawn_1][child_handler] MPI process (rank: 2, pid: 2586) exited with status 1
open: No such file or directory
Rank 0 cannot open /tmp/cr.session.2692250545826
open: No such file or directory
Rank 1 cannot open /tmp/cr.session.2692250545826
CR_Callback: Rank[1] CR_FTB_Init() Failed
[cli_1]: connect failed with connection refused
[cli_1]: Unable to connect to cn01 on 34880
[Rank 1][cr.c:918] PMI_Init failed
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20110927/9ecab35b/attachment-0001.html
More information about the mvapich-discuss
mailing list