[mvapich-discuss] BLCR+MAVPICH2

Jian Lin lin.2180 at osu.edu
Wed Mar 11 15:53:55 EDT 2015


Hi, 

Thanks for your note.

Are you using the cpi program came with MPICH without any modification?
This program runs very fast, and there may be no enough time for
capturing a snapshot. When trying to checkpoint a job that has
completed, cr_checkpoint will dump the errors as you post.

Besides the output of "mpiname -a" and the output with
"MV2_DEBUG_FT_VERBOSE=1", can you please also provide the last few lines
of dmesg output after the error occurs? It will be helpful for us to
understand what happens.

On Wed, 11 Mar 2015 14:55:09 +0000
Jonathan Perkins <perkinjo at cse.ohio-state.edu> wrote:

> Hi thanks for your note.
> 
> Can you provide us the output of mpiname -a?  Also can you rerun the
> job(s) but also set MV2_DEBUG_FT_VERBOSE equal to 1?
> 
> On Wed, Mar 11, 2015 at 8:35 AM hljgqz <15776869853 at 163.com> wrote:
> 
> > Dear all,
> >     I have a problem on using Checkpoint/Restart on mvapich2-2.1 .
> > My cluster nodes use centos6.6 x86_64 , mallenox infiniband  , BLCR
> > is well installed , I can use it ckpt normal programs.
> > and I configure the mvapich2 with ./configure --enable-ckpt
> > --disable-shared .
> >    However , I can't checkpoint when use mpirun_rsh -np 4 -hostfile
> > hosts ./cpi (or other mpi program like lu.A.4 ). When the program
> > finished ,here came :
> > [root at node3 node0]# cr_checkpoint -p 3366
> > - chkpt_watchdog: 'mpirun_rsh' (tgid/pid 3366/3367) exited with
> > code 0 during checkpoint
> > - chkpt_watchdog: 'mpirun_rsh' (tgid/pid 3366/3371) exited with
> > code 0 during checkpoint
> > - chkpt_watchdog: 'mpirun_rsh' (tgid/pid 3366/3366) exited with
> > code 1 during checkpoint
> > - chkpt_watchdog: 'mpirun_rsh' (tgid/pid 3366/3368) exited with
> > code 1 during checkpoint
> > - chkpt_watchdog: 'mpirun_rsh' (tgid/pid 3366/3369) exited with
> > code 1 during checkpoint
> > Checkpoint failed: no processes checkpointed
> >
> >  And ,if I use mpiexec -n 4 ./cpi , I can run cr_checkpoint to get a
> > context , but I can't restart . here come :
> > [root at node3 node0]# cr_restart context.3436
> > [mpiexec at node3] HYDT_dmxu_poll_wait_for_event
> > (tools/demux/demux_poll.c:70): assert (!(pollfds[i].revents &
> > ~POLLIN & ~POLLOUT & ~POLLHUP & ~POLLERR)) failed
> > [mpiexec at node3] HYD_pmci_wait_for_completion
> > (pm/pmiserv/pmiserv_pmci.c:198): error waiting for event
> > [mpiexec at node3] main (ui/mpich/mpiexec.c:344): process manager error
> > waiting for completion
> >
> >
> >
> >
> >
> > 来自网易手机号码邮箱了解更多 <http://shouji.163.com>
> > _______________________________________________
> > mvapich-discuss mailing list
> > mvapich-discuss at cse.ohio-state.edu
> > http://mailman.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
> >



-- 
Jian Lin
http://linjian.org



More information about the mvapich-discuss mailing list