[mvapich-discuss] mvapich2-1.7rc1 with BLCR
Xavier Besseron
besseron at cse.ohio-state.edu
Wed Sep 28 09:55:14 EDT 2011
For everybody's information,
the problem has been solved by setting the BLCR PATH in ~/.bashrc.
Xavier
2011/9/27 Xavier Besseron <besseron at cse.ohio-state.edu>:
> Hi,
>
> Based on your error message, it looks like the cr_restart command is
> not found. This command is provided by BLCR. It needs to be in your
> PATH on all the nodes. You can check this using the command "which
> cr_restart".
>
> Can you try after adding the following line to your ~/.bashrc?
>
> export PATH="${PATH}:/opt/blcr/bin"
>
> Let us know if this solves your issue.
> Thank you.
>
>
> Xavier
>
>
>
> 2011/9/27 尹万旺 <yinwanwang at gmail.com>:
>> Hi
>> I have a problem with mvapich2-1.7rc1!
>> Iconfigure the mvapich2-1.7rc1 like this:
>>
>> PREFIX=${PREFIX:-$HOME/mv2-normal}
>>
>> export CC=${CC:-icc}
>>
>> export CXX=${CXX:-icpc}
>>
>> export F77=${F77:-ifort}
>>
>> export FC=${FC:-ifort}
>>
>> ./configure --with-arch=LINUX -prefix=${PREFIX} --enable-a=all
>> --enable-error-message=all --with-device=ch3:mrail --with-rdma=gen2
>> --with-ib-include=/usr/local/ib_hpc/include
>> --with-ib-libpath=/usr/local/ib_hpc/lib64 --disable-rdma-cm --enable-romio
>> --enable-ckpt --with-blcr=/opt/blcr --enable-ckpt-migration
>> --with-ftb=/opt/ftb --with-hydra-ckpointlib=blcr
>>
>> run the program like this:
>>
>> mpirun_rsh -np 4 -hostfile ./host MV2_CKPT_FILE=/tmp/app.ckpt
>> MV2_CKPT_USE_AGGREGATION=0 ./bt.A.4
>>
>> my hostfile:
>>
>> cn01
>> cn01
>> cn02
>> cn02
>> That means the bt.A.4 will run on two nodes.
>> I use the "mv2_checkpoint" and the checkpointing is successful,like this
>>
>> NAS Parallel Benchmarks 3.3 -- BT Benchmark
>>
>>
>>
>> No input file inputbt.data. Using compiled defaults
>>
>> Size: 64x 64x 64
>>
>> Iterations: 200 dt: 0.0008000
>>
>> Number of active processes: 4
>>
>>
>>
>> Time step 1
>>
>> Time step 20
>>
>> Time step 40
>>
>> Time step 60
>>
>> Time step 80
>>
>> [3]: CR completed...
>>
>> [0]: CR completed...
>>
>> Time step 100
>>
>> Time step 120
>>
>> Time step 140
>>
>> Time step 160
>>
>> Time step 180
>>
>> Time step 200
>>
>> When I want to restart the program, I use "cr_restart context.5826 ".
>>
>> But some error occur like this:
>>
>>
>>
>> [cn02:mpispawn_1][restart_mpi_process] execvp() failed: No such file or
>> directory (2)
>>
>> [cn02:mpispawn_1][restart_mpi_process] execvp() failed: No such file or
>> directory (2)
>>
>> [cn02:mpispawn_1][child_handler] MPI process (rank: 3, pid: 2587) exited
>> with status 1
>>
>> [cn02:mpispawn_1][child_handler] MPI process (rank: 2, pid: 2586) exited
>> with status 1
>>
>> open: No such file or directory
>>
>> Rank 0 cannot open /tmp/cr.session.2692250545826
>>
>> open: No such file or directory
>>
>> Rank 1 cannot open /tmp/cr.session.2692250545826
>>
>> CR_Callback: Rank[1] CR_FTB_Init() Failed
>>
>> [cli_1]: connect failed with connection refused
>>
>> [cli_1]: Unable to connect to cn01 on 34880
>>
>> [Rank 1][cr.c:918] PMI_Init failed
>>
>> _______________________________________________
>> mvapich-discuss mailing list
>> mvapich-discuss at cse.ohio-state.edu
>> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>>
>>
>
More information about the mvapich-discuss
mailing list