[mvapich-discuss] BLCR checkpoint support

Sourav Chakraborty chakraborty.52 at buckeyemail.osu.edu
Mon Aug 10 11:20:33 EDT 2015


Hi Maksym,

Thanks for the note. We are investigating the issue. Can you please let us
know the version and configuration of MVAPICH2 you are using? The output
from mpiname -a would be helpful.

Thanks,
Sourav Chakraborty


On Mon, Aug 10, 2015 at 5:11 AM, Maksym Planeta <
mplaneta at os.inf.tu-dresden.de> wrote:

> Hello,
>
> I'm trying to find out if BLCR is still working with MVAPICH2. I have
> installed debian on my 4-core machine. To test how checkpoints work I
> compiled an lu benchmark from NAS performance benchmark suite. But
> unfortunately the application always fails while doing checkpoints. I see
> only the first checkpoint created, but I didn't manage to restart the
> application from my checkpoint.
>
> Could you acknowledge please, that the checkpoint/restart mechanism is
> still supposed to work, because the latest release of BLCR is from 2013?
>
> And if yes, could you please tell me what am I doing wrong?
>
> The details:
>
> # uname -a
> Linux planeta7 3.16.0-4-amd64 #1 SMP Debian 3.16.7-ckt11-1+deb8u2
> (2015-07-17) x86_64 GNU/Linux
>
> blcr version: blcr-0.8.6~b3 this is the version available in debian
> experimental repository
>
> NPB application: lu class C nprocs 4
>
> MVAPICH is compiled from scratch and it is installed in ~/opt directory
>
> How I start the application:
>
> $ MV2_CKPT_NO_SYNC=1 ~/opt/bin/mpiexec -np 4  -verbose -ckpoint-interval
> 120 -ckpoint-prefix /tmp/chkpt/ ./bin/lu.C.4
>
> How it fails (note that after first checkpoint completed, time steps do
> not advance):
>
>  Time step   80
> [proxy:0:0 at planeta7] requesting checkpoint
> [proxy:0:0 at planeta7] checkpoint completed
> [proxy:0:0 at planeta7] requesting checkpoint
> [proxy:0:0 at planeta7] checkpoint completed
> [proxy:0:0 at planeta7] requesting checkpoint
> [proxy:0:0 at planeta7] HYDT_ckpoint_checkpoint
> (tools/ckpoint/ckpoint.c:115): Previous checkpoint has not
> completed.[proxy:0:0 at planeta7] HYD_pmcd_pmip_control_cmd_cb
> (pm/pmiserv/pmip_cb.c:931): checkpoint suspend failed
> [proxy:0:0 at planeta7] HYDT_dmxu_poll_wait_for_event
> (tools/demux/demux_poll.c:76): callback returned error status
> [proxy:0:0 at planeta7] main (pm/pmiserv/pmip.c:206): demux engine error
> waiting for event
> [mpiexec at planeta7] control_cb (pm/pmiserv/pmiserv_cb.c:200): assert
> (!closed) failed
> [mpiexec at planeta7] HYDT_dmxu_poll_wait_for_event
> (tools/demux/demux_poll.c:76): callback returned error status
> [mpiexec at planeta7] HYD_pmci_wait_for_completion
> (pm/pmiserv/pmiserv_pmci.c:198): error waiting for event
> [mpiexec at planeta7] main (ui/mpich/mpiexec.c:344): process manager error
> waiting for completion
>
> Cpu model : Intel(R) Core(TM) i7-4790 CPU @ 3.60GHz
> With 4 CPU cores
>
> dmesg output:
>
> # dmesg -c
> [425658.981022] blcr: warning: skipped a socket.
> [425658.981031] blcr: warning: skipped a socket.
> [425658.981036] blcr: warning: skipped a socket.
> [425658.981063] blcr: warning: skipped a socket.
> [425660.135281] blcr: warning: skipped a socket.
> [425661.269592] blcr: warning: skipped a socket.
> [425662.698799] blcr: warning: skipped a socket.
> [425896.865836] blcr: chkpt_watchdog: 'lu.C.4' (tgid/pid 2589/2589) exited
> with signal 9 during checkpoint
> [425896.865839] blcr: chkpt_watchdog: 'lu.C.4' (tgid/pid 2589/2599) exited
> with signal 9 during checkpoint
> [425896.865841] blcr: chkpt_watchdog: 'lu.C.4' (tgid/pid 2591/2591) exited
> with signal 9 during checkpoint
> [425896.865842] blcr: chkpt_watchdog: 'lu.C.4' (tgid/pid 2591/2598) exited
> with signal 9 during checkpoint
> [425896.881038] blcr: warning: skipped a socket.
> [425896.881042] blcr: warning: skipped a socket.
> [425896.881043] blcr: warning: skipped a socket.
> [425896.881051] blcr: warning: skipped a socket.
> [425896.881149] blcr: cr_freeze_threads failed (-4)
> [425898.414107] blcr: warning: skipped a socket.
>
> Complete log of mpiexec:
>
> http://paste.debian.net/290916/
>
> I also tried to run the application with mpiexec.mpirun_rsh, but the
> behavior there was pretty similar. If you think it is worth to show the
> results of mpiexec.mpirun_rsh, tell me please.
>
> --
> With best regards,
> Maksym Planeta.
>
>
> _______________________________________________
> mvapich-discuss mailing list
> mvapich-discuss at cse.ohio-state.edu
> http://mailman.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20150810/d887e4fc/attachment-0001.html>


More information about the mvapich-discuss mailing list