[mvapich-discuss] mvapich2-1.7rc1 with BLCR

Xavier Besseron besseron at cse.ohio-state.edu
Tue Sep 27 12:07:12 EDT 2011


Hi,

Based on your error message, it looks like the cr_restart command is
not found. This command is provided by BLCR. It needs to be in your
PATH on all the nodes. You can check this using the command "which
cr_restart".

Can you try after adding the following line to your ~/.bashrc?

export PATH="${PATH}:/opt/blcr/bin"

Let us know if this solves your issue.
Thank you.


Xavier



2011/9/27 尹万旺 <yinwanwang at gmail.com>:
> Hi
> I have a problem with mvapich2-1.7rc1!
>         Iconfigure the mvapich2-1.7rc1 like this:
>
> PREFIX=${PREFIX:-$HOME/mv2-normal}
>
> export CC=${CC:-icc}
>
> export CXX=${CXX:-icpc}
>
> export F77=${F77:-ifort}
>
> export FC=${FC:-ifort}
>
> ./configure --with-arch=LINUX -prefix=${PREFIX} --enable-a=all
> --enable-error-message=all --with-device=ch3:mrail --with-rdma=gen2
> --with-ib-include=/usr/local/ib_hpc/include
> --with-ib-libpath=/usr/local/ib_hpc/lib64 --disable-rdma-cm --enable-romio
> --enable-ckpt --with-blcr=/opt/blcr --enable-ckpt-migration
> --with-ftb=/opt/ftb --with-hydra-ckpointlib=blcr
>
> run the program like this:
>
> mpirun_rsh  -np 4 -hostfile ./host MV2_CKPT_FILE=/tmp/app.ckpt
> MV2_CKPT_USE_AGGREGATION=0 ./bt.A.4
>
> my hostfile:
>
> cn01
> cn01
> cn02
> cn02
> That means the bt.A.4 will run on two nodes.
> I use the "mv2_checkpoint" and the checkpointing is successful,like this
>
> NAS Parallel Benchmarks 3.3 -- BT Benchmark
>
>
>
>  No input file inputbt.data. Using compiled defaults
>
>  Size:   64x  64x  64
>
>  Iterations:  200    dt:   0.0008000
>
>  Number of active processes:     4
>
>
>
>  Time step    1
>
>  Time step   20
>
>  Time step   40
>
>  Time step   60
>
>  Time step   80
>
> [3]:  CR completed...
>
> [0]:  CR completed...
>
>  Time step  100
>
>  Time step  120
>
>  Time step  140
>
>  Time step  160
>
>  Time step  180
>
>  Time step  200
>
> When I want to restart the program, I use "cr_restart context.5826 ".
>
> But some error occur like this:
>
>
>
> [cn02:mpispawn_1][restart_mpi_process] execvp() failed: No such file or
> directory (2)
>
> [cn02:mpispawn_1][restart_mpi_process] execvp() failed: No such file or
> directory (2)
>
> [cn02:mpispawn_1][child_handler] MPI process (rank: 3, pid: 2587) exited
> with status 1
>
> [cn02:mpispawn_1][child_handler] MPI process (rank: 2, pid: 2586) exited
> with status 1
>
> open: No such file or directory
>
> Rank 0 cannot open /tmp/cr.session.2692250545826
>
> open: No such file or directory
>
> Rank 1 cannot open /tmp/cr.session.2692250545826
>
> CR_Callback: Rank[1] CR_FTB_Init() Failed
>
> [cli_1]: connect failed with connection refused
>
> [cli_1]: Unable to connect to cn01 on 34880
>
> [Rank 1][cr.c:918] PMI_Init failed
>
> _______________________________________________
> mvapich-discuss mailing list
> mvapich-discuss at cse.ohio-state.edu
> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>
>



More information about the mvapich-discuss mailing list