[mvapich-discuss] mvapich2-1.7rc1 with BLCR
Xavier Besseron
besseron at cse.ohio-state.edu
Tue Sep 27 12:07:12 EDT 2011
Hi,
Based on your error message, it looks like the cr_restart command is
not found. This command is provided by BLCR. It needs to be in your
PATH on all the nodes. You can check this using the command "which
cr_restart".
Can you try after adding the following line to your ~/.bashrc?
export PATH="${PATH}:/opt/blcr/bin"
Let us know if this solves your issue.
Thank you.
Xavier
2011/9/27 尹万旺 <yinwanwang at gmail.com>:
> Hi
> I have a problem with mvapich2-1.7rc1!
> Iconfigure the mvapich2-1.7rc1 like this:
>
> PREFIX=${PREFIX:-$HOME/mv2-normal}
>
> export CC=${CC:-icc}
>
> export CXX=${CXX:-icpc}
>
> export F77=${F77:-ifort}
>
> export FC=${FC:-ifort}
>
> ./configure --with-arch=LINUX -prefix=${PREFIX} --enable-a=all
> --enable-error-message=all --with-device=ch3:mrail --with-rdma=gen2
> --with-ib-include=/usr/local/ib_hpc/include
> --with-ib-libpath=/usr/local/ib_hpc/lib64 --disable-rdma-cm --enable-romio
> --enable-ckpt --with-blcr=/opt/blcr --enable-ckpt-migration
> --with-ftb=/opt/ftb --with-hydra-ckpointlib=blcr
>
> run the program like this:
>
> mpirun_rsh -np 4 -hostfile ./host MV2_CKPT_FILE=/tmp/app.ckpt
> MV2_CKPT_USE_AGGREGATION=0 ./bt.A.4
>
> my hostfile:
>
> cn01
> cn01
> cn02
> cn02
> That means the bt.A.4 will run on two nodes.
> I use the "mv2_checkpoint" and the checkpointing is successful,like this
>
> NAS Parallel Benchmarks 3.3 -- BT Benchmark
>
>
>
> No input file inputbt.data. Using compiled defaults
>
> Size: 64x 64x 64
>
> Iterations: 200 dt: 0.0008000
>
> Number of active processes: 4
>
>
>
> Time step 1
>
> Time step 20
>
> Time step 40
>
> Time step 60
>
> Time step 80
>
> [3]: CR completed...
>
> [0]: CR completed...
>
> Time step 100
>
> Time step 120
>
> Time step 140
>
> Time step 160
>
> Time step 180
>
> Time step 200
>
> When I want to restart the program, I use "cr_restart context.5826 ".
>
> But some error occur like this:
>
>
>
> [cn02:mpispawn_1][restart_mpi_process] execvp() failed: No such file or
> directory (2)
>
> [cn02:mpispawn_1][restart_mpi_process] execvp() failed: No such file or
> directory (2)
>
> [cn02:mpispawn_1][child_handler] MPI process (rank: 3, pid: 2587) exited
> with status 1
>
> [cn02:mpispawn_1][child_handler] MPI process (rank: 2, pid: 2586) exited
> with status 1
>
> open: No such file or directory
>
> Rank 0 cannot open /tmp/cr.session.2692250545826
>
> open: No such file or directory
>
> Rank 1 cannot open /tmp/cr.session.2692250545826
>
> CR_Callback: Rank[1] CR_FTB_Init() Failed
>
> [cli_1]: connect failed with connection refused
>
> [cli_1]: Unable to connect to cn01 on 34880
>
> [Rank 1][cr.c:918] PMI_Init failed
>
> _______________________________________________
> mvapich-discuss mailing list
> mvapich-discuss at cse.ohio-state.edu
> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>
>
More information about the mvapich-discuss
mailing list