[mvapich-discuss] mvapich2-1.7rc1 with BLCR

Xavier Besseron besseron at cse.ohio-state.edu
Wed Sep 28 09:55:14 EDT 2011


For everybody's information,
the problem has been solved by setting the BLCR PATH in ~/.bashrc.


Xavier


2011/9/27 Xavier Besseron <besseron at cse.ohio-state.edu>:
> Hi,
>
> Based on your error message, it looks like the cr_restart command is
> not found. This command is provided by BLCR. It needs to be in your
> PATH on all the nodes. You can check this using the command "which
> cr_restart".
>
> Can you try after adding the following line to your ~/.bashrc?
>
> export PATH="${PATH}:/opt/blcr/bin"
>
> Let us know if this solves your issue.
> Thank you.
>
>
> Xavier
>
>
>
> 2011/9/27 尹万旺 <yinwanwang at gmail.com>:
>> Hi
>> I have a problem with mvapich2-1.7rc1!
>>         Iconfigure the mvapich2-1.7rc1 like this:
>>
>> PREFIX=${PREFIX:-$HOME/mv2-normal}
>>
>> export CC=${CC:-icc}
>>
>> export CXX=${CXX:-icpc}
>>
>> export F77=${F77:-ifort}
>>
>> export FC=${FC:-ifort}
>>
>> ./configure --with-arch=LINUX -prefix=${PREFIX} --enable-a=all
>> --enable-error-message=all --with-device=ch3:mrail --with-rdma=gen2
>> --with-ib-include=/usr/local/ib_hpc/include
>> --with-ib-libpath=/usr/local/ib_hpc/lib64 --disable-rdma-cm --enable-romio
>> --enable-ckpt --with-blcr=/opt/blcr --enable-ckpt-migration
>> --with-ftb=/opt/ftb --with-hydra-ckpointlib=blcr
>>
>> run the program like this:
>>
>> mpirun_rsh  -np 4 -hostfile ./host MV2_CKPT_FILE=/tmp/app.ckpt
>> MV2_CKPT_USE_AGGREGATION=0 ./bt.A.4
>>
>> my hostfile:
>>
>> cn01
>> cn01
>> cn02
>> cn02
>> That means the bt.A.4 will run on two nodes.
>> I use the "mv2_checkpoint" and the checkpointing is successful,like this
>>
>> NAS Parallel Benchmarks 3.3 -- BT Benchmark
>>
>>
>>
>>  No input file inputbt.data. Using compiled defaults
>>
>>  Size:   64x  64x  64
>>
>>  Iterations:  200    dt:   0.0008000
>>
>>  Number of active processes:     4
>>
>>
>>
>>  Time step    1
>>
>>  Time step   20
>>
>>  Time step   40
>>
>>  Time step   60
>>
>>  Time step   80
>>
>> [3]:  CR completed...
>>
>> [0]:  CR completed...
>>
>>  Time step  100
>>
>>  Time step  120
>>
>>  Time step  140
>>
>>  Time step  160
>>
>>  Time step  180
>>
>>  Time step  200
>>
>> When I want to restart the program, I use "cr_restart context.5826 ".
>>
>> But some error occur like this:
>>
>>
>>
>> [cn02:mpispawn_1][restart_mpi_process] execvp() failed: No such file or
>> directory (2)
>>
>> [cn02:mpispawn_1][restart_mpi_process] execvp() failed: No such file or
>> directory (2)
>>
>> [cn02:mpispawn_1][child_handler] MPI process (rank: 3, pid: 2587) exited
>> with status 1
>>
>> [cn02:mpispawn_1][child_handler] MPI process (rank: 2, pid: 2586) exited
>> with status 1
>>
>> open: No such file or directory
>>
>> Rank 0 cannot open /tmp/cr.session.2692250545826
>>
>> open: No such file or directory
>>
>> Rank 1 cannot open /tmp/cr.session.2692250545826
>>
>> CR_Callback: Rank[1] CR_FTB_Init() Failed
>>
>> [cli_1]: connect failed with connection refused
>>
>> [cli_1]: Unable to connect to cn01 on 34880
>>
>> [Rank 1][cr.c:918] PMI_Init failed
>>
>> _______________________________________________
>> mvapich-discuss mailing list
>> mvapich-discuss at cse.ohio-state.edu
>> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>>
>>
>



More information about the mvapich-discuss mailing list