[mvapich-discuss] BLCR+FTB+MVAPICH2:problems about migration between two nodes

Jian Lin lin.2180 at osu.edu
Thu Apr 9 09:54:58 EDT 2015


The user has solved the problem, so we will close this issue on
discuss. 

Begin forwarded message:

Date: Thu, 9 Apr 2015 19:06:13 +0800
From: hljgqz <15776869853 at 163.com>
To: Jian Lin <lin.2180 at osu.edu>
Subject: Re:Re: [mvapich-discuss] BLCR+FTB+MVAPICH2:problems about
migration between two nodes

Hi Lin,
    Thank you for your help again! I have disable the prelinking but I
also can't make migration. However,I find I didn't set up NFS , Lustre
or PVFS to share checkpoint files between the original nodes and target
nodes.Finally, I set up a NFS in my cluster to share checkpoint files
between nodes and I can restart mpi jobs on different nodes now.
sincerely yours Gong qingze


On Fri, 3 Apr 2015 10:16:39 -0400
Jian Lin <lin.2180 at osu.edu> wrote:

> Hi, Qingze,
> 
> The problems should be related to the prelinking feature of your OS. 
> It is required to disable prelinking to allow restart job on the other
> nodes. Please refer to the following note of BLCR, modify your OS
> setting, and try again.
> <https://upc-bugs.lbl.gov/blcr/doc/html/FAQ.html#prelink>
> 
> Thanks!
> 
> 
> On Thu, 2 Apr 2015 10:48:52 +0800
> hljgqz <15776869853 at 163.com> wrote:
> 
> >  To whom it may concern,
> >         I have two problems when using BLCR+FTB+MVAPICH2 .I can
> > restart mpi job in the original nodes, But I can't restart mpi job
> > no different sets of nodes. 1)I can't restart the mpi job on a
> > different node. I do as the userguide said. I have two nodes named
> > node1 and node3 . Firstly, I run " mpirun_rsh -np 2 -hostfile
> > hosts ./proc  "on node3, hosts file is consist of node1:4 Then I
> > run cr_checkpoint -p <pid> .AND I can restart it using the original
> > hosts file. BUT if I change hosts( the hostfile ) to node3:4 as the
> > userguide said and restart the job but it doesn't work ,and report
> > error : [node3:mpispawn_0][child_handler] MPI process (rank: 1,
> > pid: 5174) terminated with signal 4 -> abort job
> > [node3:mpirun_rsh][process_mpispawn_connection] mpispawn_0 from node
> > node3 aborted: MPI process error (1) 2)I tried mpirun_rsh -np 4
> > -hostfile ./hosts -sparehosts ./spare_hosts ./prog , BUT it doesn't
> > work , and give back the messages : [root at node3 node0]# mpirun_rsh
> > -np 4 -hostfile ./hosts -sparehosts ./spare_hosts ./hello
> > [root at node3 node0]# [FTB_WARNING][ftb_agent.c: line 46]FTBM_Wait
> > failed 1 my configuration: [root at node3 node0]# mpiname -a MVAPICH2
> > 2.1rc2 Thu Mar 12 20:00:00 EDT 2014 ch3:mrail Compilation CC:
> > gcc    -DNDEBUG -DNVALGRIND -O2 CXX: g++   -DNDEBUG -DNVALGRIND -O2
> > F77: gfortran -L/home/node0/16/blcr/lib -L/lib   -O2 FC: gfortran
> > -O2 Configuration	
> > --enable-ckpt-migration --with-blcr=/home/node0/16/blcr
> > --without-hwloc
> > 
> > Hope for write back!
> > sincerely yours
> > Gong qingze
> > 
> 
> 
> 




-- 
Jian Lin
http://linjian.org



More information about the mvapich-discuss mailing list