[mvapich-discuss] BLCR+MAVPICH2

Jonathan Perkins perkinjo at cse.ohio-state.edu
Wed Mar 11 10:55:09 EDT 2015


Hi thanks for your note.

Can you provide us the output of mpiname -a?  Also can you rerun the job(s)
but also set MV2_DEBUG_FT_VERBOSE equal to 1?

On Wed, Mar 11, 2015 at 8:35 AM hljgqz <15776869853 at 163.com> wrote:

> Dear all,
>     I have a problem on using Checkpoint/Restart on mvapich2-2.1 . My
> cluster nodes use centos6.6 x86_64 , mallenox infiniband  , BLCR is well
> installed , I can use it ckpt normal programs.
> and I configure the mvapich2 with ./configure --enable-ckpt
> --disable-shared .
>    However , I can't checkpoint when use mpirun_rsh -np 4 -hostfile hosts
> ./cpi (or other mpi program like lu.A.4 ). When the program finished ,here
> came :
> [root at node3 node0]# cr_checkpoint -p 3366
> - chkpt_watchdog: 'mpirun_rsh' (tgid/pid 3366/3367) exited with code 0
> during checkpoint
> - chkpt_watchdog: 'mpirun_rsh' (tgid/pid 3366/3371) exited with code 0
> during checkpoint
> - chkpt_watchdog: 'mpirun_rsh' (tgid/pid 3366/3366) exited with code 1
> during checkpoint
> - chkpt_watchdog: 'mpirun_rsh' (tgid/pid 3366/3368) exited with code 1
> during checkpoint
> - chkpt_watchdog: 'mpirun_rsh' (tgid/pid 3366/3369) exited with code 1
> during checkpoint
> Checkpoint failed: no processes checkpointed
>
>  And ,if I use mpiexec -n 4 ./cpi , I can run cr_checkpoint to get a
> context , but I can't restart . here come :
> [root at node3 node0]# cr_restart context.3436
> [mpiexec at node3] HYDT_dmxu_poll_wait_for_event
> (tools/demux/demux_poll.c:70): assert (!(pollfds[i].revents & ~POLLIN &
> ~POLLOUT & ~POLLHUP & ~POLLERR)) failed
> [mpiexec at node3] HYD_pmci_wait_for_completion
> (pm/pmiserv/pmiserv_pmci.c:198): error waiting for event
> [mpiexec at node3] main (ui/mpich/mpiexec.c:344): process manager error
> waiting for completion
>
>
>
>
>
> 来自网易手机号码邮箱了解更多 <http://shouji.163.com>
> _______________________________________________
> mvapich-discuss mailing list
> mvapich-discuss at cse.ohio-state.edu
> http://mailman.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20150311/cc8c132e/attachment.html>


More information about the mvapich-discuss mailing list