[mvapich-discuss] BLCR+MAVPICH2

hljgqz 15776869853 at 163.com
Wed Mar 11 08:34:55 EDT 2015


Dear all,
    I have a problem on using Checkpoint/Restart on mvapich2-2.1 . My cluster nodes use centos6.6 x86_64 , mallenox infiniband  , BLCR is well installed , I can use it ckpt normal programs.
and I configure the mvapich2 with ./configure --enable-ckpt --disable-shared .
   However , I can't checkpoint when use mpirun_rsh -np 4 -hostfile hosts ./cpi (or other mpi program like lu.A.4 ). When the program finished ,here came :
[root at node3 node0]# cr_checkpoint -p 3366
- chkpt_watchdog: 'mpirun_rsh' (tgid/pid 3366/3367) exited with code 0 during checkpoint
- chkpt_watchdog: 'mpirun_rsh' (tgid/pid 3366/3371) exited with code 0 during checkpoint
- chkpt_watchdog: 'mpirun_rsh' (tgid/pid 3366/3366) exited with code 1 during checkpoint
- chkpt_watchdog: 'mpirun_rsh' (tgid/pid 3366/3368) exited with code 1 during checkpoint
- chkpt_watchdog: 'mpirun_rsh' (tgid/pid 3366/3369) exited with code 1 during checkpoint
Checkpoint failed: no processes checkpointed

 And ,if I use mpiexec -n 4 ./cpi , I can run cr_checkpoint to get a context , but I can't restart . here come :
[root at node3 node0]# cr_restart context.3436
[mpiexec at node3] HYDT_dmxu_poll_wait_for_event (tools/demux/demux_poll.c:70): assert (!(pollfds[i].revents & ~POLLIN & ~POLLOUT & ~POLLHUP & ~POLLERR)) failed
[mpiexec at node3] HYD_pmci_wait_for_completion (pm/pmiserv/pmiserv_pmci.c:198): error waiting for event
[mpiexec at node3] main (ui/mpich/mpiexec.c:344): process manager error waiting for completion



-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20150311/0107fbb5/attachment.html>


More information about the mvapich-discuss mailing list