[mvapich-discuss] BLCR+MAVPICH2

hljgqz 15776869853 at 163.com
Wed Mar 11 22:54:49 EDT 2015


Thank you for your help ,but I haven't solve the problem .Here comes more detials:
[root at node3 mvapich2-2.1rc1]# mpiname -a
MVAPICH2 2.1rc1 Thu Dec 18 20:00:00 EDT 2014 ch3:mrail

Compilation
CC: gcc -DNDEBUG -DNVALGRIND -O2
CXX: g++ -DNDEBUG -DNVALGRIND -O2
F77: gfortran -L/lib -L/lib -O2
FC: gfortran -O2

Configuration
--enable-ckpt --disable-shared
And then ,I run NPB3.3-MPI program bt.A.4
[root at node3 NPB3.3-MPI]# mpirun_rsh -np 4 -hostfile hosts MV2_DEBUG_FT_VERBOSE=1 bin/bt.A.4
on another console I run cr_checkpoint -p <PID> try to make a checkpoint , but I can't checkpoint the mpirun_rsh.
here is messages come from console of mpirun_rsh
[node3:mpirun_rsh][CR_Callback] Unexpected results from 0: ""
[node3:mpirun_rsh][CR_Callback] Some processes failed to checkpoint. Abort checkpoint...
if I use mpiexec to run the program ,I can get a checkpoint but Ican't restart the context file.
[root at node3 bin]# mpiexec -n 4 -f hosts ./bt.A.4
[root at node3 mvapich2-2.1rc1]# cr_restart context.465
[mpiexec at node3] HYDT_dmxu_poll_wait_for_event (tools/demux/demux_poll.c:70): assert (!(pollfds[i].revents & ~POLLIN & ~POLLOUT & ~POLLHUP & ~POLLERR)) failed
[mpiexec at node3] HYD_pmci_wait_for_completion (pm/pmiserv/pmiserv_pmci.c:198): error waiting for event
[mpiexec at node3] main (ui/mpich/mpiexec.c:344): process manager error waiting for completion

在 2015-03-11 22:55:09,"Jonathan Perkins" <perkinjo at cse.ohio-state.edu> 写道:

Hi thanks for your note.


Can you provide us the output of mpiname -a?  Also can you rerun the job(s) but also set MV2_DEBUG_FT_VERBOSE equal to 1?


On Wed, Mar 11, 2015 at 8:35 AM hljgqz <15776869853 at 163.com> wrote:

Dear all,
    I have a problem on using Checkpoint/Restart on mvapich2-2.1 . My cluster nodes use centos6.6 x86_64 , mallenox infiniband  , BLCR is well installed , I can use it ckpt normal programs.
and I configure the mvapich2 with ./configure --enable-ckpt --disable-shared .
   However , I can't checkpoint when use mpirun_rsh -np 4 -hostfile hosts ./cpi (or other mpi program like lu.A.4 ). When the program finished ,here came :
[root at node3 node0]# cr_checkpoint -p 3366
- chkpt_watchdog: 'mpirun_rsh' (tgid/pid 3366/3367) exited with code 0 during checkpoint
- chkpt_watchdog: 'mpirun_rsh' (tgid/pid 3366/3371) exited with code 0 during checkpoint
- chkpt_watchdog: 'mpirun_rsh' (tgid/pid 3366/3366) exited with code 1 during checkpoint
- chkpt_watchdog: 'mpirun_rsh' (tgid/pid 3366/3368) exited with code 1 during checkpoint
- chkpt_watchdog: 'mpirun_rsh' (tgid/pid 3366/3369) exited with code 1 during checkpoint
Checkpoint failed: no processes checkpointed

 And ,if I use mpiexec -n 4 ./cpi , I can run cr_checkpoint to get a context , but I can't restart . here come :
[root at node3 node0]# cr_restart context.3436
[mpiexec at node3] HYDT_dmxu_poll_wait_for_event (tools/demux/demux_poll.c:70): assert (!(pollfds[i].revents & ~POLLIN & ~POLLOUT & ~POLLHUP & ~POLLERR)) failed
[mpiexec at node3] HYD_pmci_wait_for_completion (pm/pmiserv/pmiserv_pmci.c:198): error waiting for event
[mpiexec at node3] main (ui/mpich/mpiexec.c:344): process manager error waiting for completion







来自网易手机号码邮箱了解更多
_______________________________________________
mvapich-discuss mailing list
mvapich-discuss at cse.ohio-state.edu
http://mailman.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20150312/56ad52f6/attachment-0003.html>


More information about the mvapich-discuss mailing list