[mvapich-discuss] mvapich2-1.7rc1 with BLCR

=?GB2312?B?0vzN8s36?= yinwanwang at gmail.com
Tue Sep 27 11:10:21 EDT 2011


Hi
	I have a problem with mvapich2-1.7rc1!
        Iconfigure the mvapich2-1.7rc1 like this:

PREFIX=${PREFIX:-$HOME/mv2-normal}

export CC=${CC:-icc}

export CXX=${CXX:-icpc}

export F77=${F77:-ifort}

export FC=${FC:-ifort}

./configure --with-arch=LINUX -prefix=${PREFIX} --enable-a=all --enable-error-message=all --with-device=ch3:mrail --with-rdma=gen2 --with-ib-include=/usr/local/ib_hpc/include --with-ib-libpath=/usr/local/ib_hpc/lib64 --disable-rdma-cm --enable-romio --enable-ckpt --with-blcr=/opt/blcr --enable-ckpt-migration --with-ftb=/opt/ftb --with-hydra-ckpointlib=blcr

run the program like this:

mpirun_rsh  -np 4 -hostfile ./host MV2_CKPT_FILE=/tmp/app.ckpt  MV2_CKPT_USE_AGGREGATION=0 ./bt.A.4

my hostfile:

cn01
cn01
cn02
cn02

That means the bt.A.4 will run on two nodes.

I use the "mv2_checkpoint" and the checkpointing is successful,like this

NAS Parallel Benchmarks 3.3 -- BT Benchmark

 

 No input file inputbt.data. Using compiled defaults

 Size:   64x  64x  64

 Iterations:  200    dt:   0.0008000

 Number of active processes:     4

 

 Time step    1

 Time step   20

 Time step   40

 Time step   60

 Time step   80

[3]:  CR completed...

[0]:  CR completed...

 Time step  100

 Time step  120

 Time step  140

 Time step  160

 Time step  180

 Time step  200



When I want to restart the program, I use "cr_restart context.5826 ".

But some error occur like this:

 

[cn02:mpispawn_1][restart_mpi_process] execvp() failed: No such file or directory (2)

[cn02:mpispawn_1][restart_mpi_process] execvp() failed: No such file or directory (2)

[cn02:mpispawn_1][child_handler] MPI process (rank: 3, pid: 2587) exited with status 1

[cn02:mpispawn_1][child_handler] MPI process (rank: 2, pid: 2586) exited with status 1

open: No such file or directory

Rank 0 cannot open /tmp/cr.session.2692250545826

open: No such file or directory

Rank 1 cannot open /tmp/cr.session.2692250545826

CR_Callback: Rank[1] CR_FTB_Init() Failed

[cli_1]: connect failed with connection refused

[cli_1]: Unable to connect to cn01 on 34880

[Rank 1][cr.c:918] PMI_Init failed

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20110927/9ecab35b/attachment-0001.html


More information about the mvapich-discuss mailing list