[mvapich-discuss] BLCR checkpoint support

Maksym Planeta mplaneta at os.inf.tu-dresden.de
Mon Aug 10 11:53:06 EDT 2015


Thank you for you reaction, the output of mpiname is following:
$ ~/opt/bin/mpiname -a
MVAPICH2 2.1 Fri Apr 03 20:00:00 EDT 2015 ch3:mrail

Compilation
CC: gcc    -DNDEBUG -DNVALGRIND -O2
CXX: g++   -DNDEBUG -DNVALGRIND -O2
F77: gfortran -L/lib -L/lib   -O2
FC: gfortran   -O2

Configuration
--prefix=/home/planeta/opt --enable-fortran=all --enable-ckpt

I was thinking about this issue and got following thoughts. I run this
benchmark on a single machine with the only ethernet network card. Thus,
most likely different ranks use sockets to communicate with each other.
BLCR manual says, that it does not record socket state and the
application itself should restore socket based connection. I pretty much
doubt that NAS benchmarks try to do this (they seem to be unaware of
BLCR). This means that when a rank processor is cloned, its socket
connection is broken, and hence computation cannot continue. Could it be
the case?

And if yes, could you suggest an application or, more preferably, a
benchmark, which can work with either BLCR or other kind of
checkpoint/restart or migration framework?


On 08/10/2015 05:20 PM, Sourav Chakraborty wrote:
> Hi Maksym,
>
> Thanks for the note. We are investigating the issue. Can you please let
> us know the version and configuration of MVAPICH2 you are using? The
> output from mpiname -a would be helpful.
>
> Thanks,
> Sourav Chakraborty
>
>
> On Mon, Aug 10, 2015 at 5:11 AM, Maksym Planeta
> <mplaneta at os.inf.tu-dresden.de <mailto:mplaneta at os.inf.tu-dresden.de>>
> wrote:
>
>     Hello,
>
>     I'm trying to find out if BLCR is still working with MVAPICH2. I
>     have installed debian on my 4-core machine. To test how checkpoints
>     work I compiled an lu benchmark from NAS performance benchmark
>     suite. But unfortunately the application always fails while doing
>     checkpoints. I see only the first checkpoint created, but I didn't
>     manage to restart the application from my checkpoint.
>
>     Could you acknowledge please, that the checkpoint/restart mechanism
>     is still supposed to work, because the latest release of BLCR is
>     from 2013?
>
>     And if yes, could you please tell me what am I doing wrong?
>
>     The details:
>
>     # uname -a
>     Linux planeta7 3.16.0-4-amd64 #1 SMP Debian 3.16.7-ckt11-1+deb8u2
>     (2015-07-17) x86_64 GNU/Linux
>
>     blcr version: blcr-0.8.6~b3 this is the version available in debian
>     experimental repository
>
>     NPB application: lu class C nprocs 4
>
>     MVAPICH is compiled from scratch and it is installed in ~/opt directory
>
>     How I start the application:
>
>     $ MV2_CKPT_NO_SYNC=1 ~/opt/bin/mpiexec -np 4  -verbose
>     -ckpoint-interval 120 -ckpoint-prefix /tmp/chkpt/ ./bin/lu.C.4
>
>     How it fails (note that after first checkpoint completed, time steps
>     do not advance):
>
>       Time step   80
>     [proxy:0:0 at planeta7] requesting checkpoint
>     [proxy:0:0 at planeta7] checkpoint completed
>     [proxy:0:0 at planeta7] requesting checkpoint
>     [proxy:0:0 at planeta7] checkpoint completed
>     [proxy:0:0 at planeta7] requesting checkpoint
>     [proxy:0:0 at planeta7] HYDT_ckpoint_checkpoint
>     (tools/ckpoint/ckpoint.c:115): Previous checkpoint has not
>     completed.[proxy:0:0 at planeta7] HYD_pmcd_pmip_control_cmd_cb
>     (pm/pmiserv/pmip_cb.c:931): checkpoint suspend failed
>     [proxy:0:0 at planeta7] HYDT_dmxu_poll_wait_for_event
>     (tools/demux/demux_poll.c:76): callback returned error status
>     [proxy:0:0 at planeta7] main (pm/pmiserv/pmip.c:206): demux engine
>     error waiting for event
>     [mpiexec at planeta7] control_cb (pm/pmiserv/pmiserv_cb.c:200): assert
>     (!closed) failed
>     [mpiexec at planeta7] HYDT_dmxu_poll_wait_for_event
>     (tools/demux/demux_poll.c:76): callback returned error status
>     [mpiexec at planeta7] HYD_pmci_wait_for_completion
>     (pm/pmiserv/pmiserv_pmci.c:198): error waiting for event
>     [mpiexec at planeta7] main (ui/mpich/mpiexec.c:344): process manager
>     error waiting for completion
>
>     Cpu model : Intel(R) Core(TM) i7-4790 CPU @ 3.60GHz
>     With 4 CPU cores
>
>     dmesg output:
>
>     # dmesg -c
>     [425658.981022] blcr: warning: skipped a socket.
>     [425658.981031] blcr: warning: skipped a socket.
>     [425658.981036] blcr: warning: skipped a socket.
>     [425658.981063] blcr: warning: skipped a socket.
>     [425660.135281] blcr: warning: skipped a socket.
>     [425661.269592] blcr: warning: skipped a socket.
>     [425662.698799] blcr: warning: skipped a socket.
>     [425896.865836] blcr: chkpt_watchdog: 'lu.C.4' (tgid/pid 2589/2589)
>     exited with signal 9 during checkpoint
>     [425896.865839] blcr: chkpt_watchdog: 'lu.C.4' (tgid/pid 2589/2599)
>     exited with signal 9 during checkpoint
>     [425896.865841] blcr: chkpt_watchdog: 'lu.C.4' (tgid/pid 2591/2591)
>     exited with signal 9 during checkpoint
>     [425896.865842] blcr: chkpt_watchdog: 'lu.C.4' (tgid/pid 2591/2598)
>     exited with signal 9 during checkpoint
>     [425896.881038] blcr: warning: skipped a socket.
>     [425896.881042] blcr: warning: skipped a socket.
>     [425896.881043] blcr: warning: skipped a socket.
>     [425896.881051] blcr: warning: skipped a socket.
>     [425896.881149] blcr: cr_freeze_threads failed (-4)
>     [425898.414107] blcr: warning: skipped a socket.
>
>     Complete log of mpiexec:
>
>     http://paste.debian.net/290916/
>
>     I also tried to run the application with mpiexec.mpirun_rsh, but the
>     behavior there was pretty similar. If you think it is worth to show
>     the results of mpiexec.mpirun_rsh, tell me please.
>
>     --
>     With best regards,
>     Maksym Planeta.
>
>
>     _______________________________________________
>     mvapich-discuss mailing list
>     mvapich-discuss at cse.ohio-state.edu
>     <mailto:mvapich-discuss at cse.ohio-state.edu>
>     http://mailman.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>
>

-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 5154 bytes
Desc: S/MIME Cryptographic Signature
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20150810/f3ec374a/attachment.p7s>


More information about the mvapich-discuss mailing list