[mvapich-discuss] BLCR checkpoint support

Sourav Chakraborty chakraborty.52 at buckeyemail.osu.edu
Mon Aug 10 12:45:57 EDT 2015


Hi Maksym,

Thanks for providing the mpiname output.

Regarding your query, by default MVAPICH2 uses high-performance shared
memory based channels for intra-node communication instead of TCP/IP
sockets. MVAPICH2 also takes care of setting up the intra-node
communication channels after restart automatically. We are investigating
the issue and will get back to you soon.

Thanks,
Sourav


On Mon, Aug 10, 2015 at 8:53 AM, Maksym Planeta <
mplaneta at os.inf.tu-dresden.de> wrote:

> Thank you for you reaction, the output of mpiname is following:
> $ ~/opt/bin/mpiname -a
> MVAPICH2 2.1 Fri Apr 03 20:00:00 EDT 2015 ch3:mrail
>
> Compilation
> CC: gcc    -DNDEBUG -DNVALGRIND -O2
> CXX: g++   -DNDEBUG -DNVALGRIND -O2
> F77: gfortran -L/lib -L/lib   -O2
> FC: gfortran   -O2
>
> Configuration
> --prefix=/home/planeta/opt --enable-fortran=all --enable-ckpt
>
> I was thinking about this issue and got following thoughts. I run this
> benchmark on a single machine with the only ethernet network card. Thus,
> most likely different ranks use sockets to communicate with each other.
> BLCR manual says, that it does not record socket state and the
> application itself should restore socket based connection. I pretty much
> doubt that NAS benchmarks try to do this (they seem to be unaware of
> BLCR). This means that when a rank processor is cloned, its socket
> connection is broken, and hence computation cannot continue. Could it be
> the case?
>
> And if yes, could you suggest an application or, more preferably, a
> benchmark, which can work with either BLCR or other kind of
> checkpoint/restart or migration framework?
>
>
> On 08/10/2015 05:20 PM, Sourav Chakraborty wrote:
>
>> Hi Maksym,
>>
>> Thanks for the note. We are investigating the issue. Can you please let
>> us know the version and configuration of MVAPICH2 you are using? The
>> output from mpiname -a would be helpful.
>>
>> Thanks,
>> Sourav Chakraborty
>>
>>
>> On Mon, Aug 10, 2015 at 5:11 AM, Maksym Planeta
>> <mplaneta at os.inf.tu-dresden.de <mailto:mplaneta at os.inf.tu-dresden.de>>
>>
>> wrote:
>>
>>     Hello,
>>
>>     I'm trying to find out if BLCR is still working with MVAPICH2. I
>>     have installed debian on my 4-core machine. To test how checkpoints
>>     work I compiled an lu benchmark from NAS performance benchmark
>>     suite. But unfortunately the application always fails while doing
>>     checkpoints. I see only the first checkpoint created, but I didn't
>>     manage to restart the application from my checkpoint.
>>
>>     Could you acknowledge please, that the checkpoint/restart mechanism
>>     is still supposed to work, because the latest release of BLCR is
>>     from 2013?
>>
>>     And if yes, could you please tell me what am I doing wrong?
>>
>>     The details:
>>
>>     # uname -a
>>     Linux planeta7 3.16.0-4-amd64 #1 SMP Debian 3.16.7-ckt11-1+deb8u2
>>     (2015-07-17) x86_64 GNU/Linux
>>
>>     blcr version: blcr-0.8.6~b3 this is the version available in debian
>>     experimental repository
>>
>>     NPB application: lu class C nprocs 4
>>
>>     MVAPICH is compiled from scratch and it is installed in ~/opt
>> directory
>>
>>     How I start the application:
>>
>>     $ MV2_CKPT_NO_SYNC=1 ~/opt/bin/mpiexec -np 4  -verbose
>>     -ckpoint-interval 120 -ckpoint-prefix /tmp/chkpt/ ./bin/lu.C.4
>>
>>     How it fails (note that after first checkpoint completed, time steps
>>     do not advance):
>>
>>       Time step   80
>>     [proxy:0:0 at planeta7] requesting checkpoint
>>     [proxy:0:0 at planeta7] checkpoint completed
>>     [proxy:0:0 at planeta7] requesting checkpoint
>>     [proxy:0:0 at planeta7] checkpoint completed
>>     [proxy:0:0 at planeta7] requesting checkpoint
>>     [proxy:0:0 at planeta7] HYDT_ckpoint_checkpoint
>>     (tools/ckpoint/ckpoint.c:115): Previous checkpoint has not
>>     completed.[proxy:0:0 at planeta7] HYD_pmcd_pmip_control_cmd_cb
>>     (pm/pmiserv/pmip_cb.c:931): checkpoint suspend failed
>>     [proxy:0:0 at planeta7] HYDT_dmxu_poll_wait_for_event
>>     (tools/demux/demux_poll.c:76): callback returned error status
>>     [proxy:0:0 at planeta7] main (pm/pmiserv/pmip.c:206): demux engine
>>     error waiting for event
>>     [mpiexec at planeta7] control_cb (pm/pmiserv/pmiserv_cb.c:200): assert
>>     (!closed) failed
>>     [mpiexec at planeta7] HYDT_dmxu_poll_wait_for_event
>>     (tools/demux/demux_poll.c:76): callback returned error status
>>     [mpiexec at planeta7] HYD_pmci_wait_for_completion
>>     (pm/pmiserv/pmiserv_pmci.c:198): error waiting for event
>>     [mpiexec at planeta7] main (ui/mpich/mpiexec.c:344): process manager
>>     error waiting for completion
>>
>>     Cpu model : Intel(R) Core(TM) i7-4790 CPU @ 3.60GHz
>>     With 4 CPU cores
>>
>>     dmesg output:
>>
>>     # dmesg -c
>>     [425658.981022] blcr: warning: skipped a socket.
>>     [425658.981031] blcr: warning: skipped a socket.
>>     [425658.981036] blcr: warning: skipped a socket.
>>     [425658.981063] blcr: warning: skipped a socket.
>>     [425660.135281] blcr: warning: skipped a socket.
>>     [425661.269592] blcr: warning: skipped a socket.
>>     [425662.698799] blcr: warning: skipped a socket.
>>     [425896.865836] blcr: chkpt_watchdog: 'lu.C.4' (tgid/pid 2589/2589)
>>     exited with signal 9 during checkpoint
>>     [425896.865839] blcr: chkpt_watchdog: 'lu.C.4' (tgid/pid 2589/2599)
>>     exited with signal 9 during checkpoint
>>     [425896.865841] blcr: chkpt_watchdog: 'lu.C.4' (tgid/pid 2591/2591)
>>     exited with signal 9 during checkpoint
>>     [425896.865842] blcr: chkpt_watchdog: 'lu.C.4' (tgid/pid 2591/2598)
>>     exited with signal 9 during checkpoint
>>     [425896.881038] blcr: warning: skipped a socket.
>>     [425896.881042] blcr: warning: skipped a socket.
>>     [425896.881043] blcr: warning: skipped a socket.
>>     [425896.881051] blcr: warning: skipped a socket.
>>     [425896.881149] blcr: cr_freeze_threads failed (-4)
>>     [425898.414107] blcr: warning: skipped a socket.
>>
>>     Complete log of mpiexec:
>>
>>     http://paste.debian.net/290916/
>>
>>     I also tried to run the application with mpiexec.mpirun_rsh, but the
>>     behavior there was pretty similar. If you think it is worth to show
>>     the results of mpiexec.mpirun_rsh, tell me please.
>>
>>     --
>>     With best regards,
>>     Maksym Planeta.
>>
>>
>>     _______________________________________________
>>     mvapich-discuss mailing list
>>     mvapich-discuss at cse.ohio-state.edu
>>     <mailto:mvapich-discuss at cse.ohio-state.edu>
>>     http://mailman.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>>
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20150810/01c24d7f/attachment-0001.html>


More information about the mvapich-discuss mailing list