[mvapich-discuss] BLCR+FTB+MVAPICH2:problems about migration between two nodes

hljgqz 15776869853 at 163.com
Wed Apr 1 22:48:52 EDT 2015


 To whom it may concern,
        I have two problems when using BLCR+FTB+MVAPICH2 .I can restart mpi job in the original nodes,
But I can't restart mpi job no different sets of nodes.
1)I can't restart the mpi job on a different node. I do as the userguide said.
I have two nodes named node1 and node3 .
Firstly, I run " mpirun_rsh -np 2 -hostfile hosts ./proc  "on node3, hosts file is consist of node1:4
Then I run cr_checkpoint -p <pid> .AND I can restart it using the original hosts file.
BUT if I change hosts( the hostfile ) to node3:4 as the userguide said and restart the job but it doesn't work ,and report error :
[node3:mpispawn_0][child_handler] MPI process (rank: 1, pid: 5174) terminated with signal 4 -> abort job
[node3:mpirun_rsh][process_mpispawn_connection] mpispawn_0 from node node3 aborted: MPI process error (1)
2)I tried mpirun_rsh -np 4 -hostfile ./hosts -sparehosts ./spare_hosts ./prog , BUT it doesn't work ,
and give back the messages :
[root at node3 node0]# mpirun_rsh -np 4 -hostfile ./hosts -sparehosts ./spare_hosts ./hello
[root at node3 node0]# [FTB_WARNING][ftb_agent.c: line 46]FTBM_Wait failed 1
my configuration:
[root at node3 node0]# mpiname -a
MVAPICH2 2.1rc2 Thu Mar 12 20:00:00 EDT 2014 ch3:mrail
Compilation
CC: gcc    -DNDEBUG -DNVALGRIND -O2
CXX: g++   -DNDEBUG -DNVALGRIND -O2
F77: gfortran -L/home/node0/16/blcr/lib -L/lib   -O2
FC: gfortran   -O2
Configuration
--enable-ckpt-migration --with-blcr=/home/node0/16/blcr --without-hwloc

Hope for write back!
sincerely yours
Gong qingze
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20150402/599764c6/attachment.html>


More information about the mvapich-discuss mailing list