[mvapich-discuss] BLCR checkpoint support
Maksym Planeta
mplaneta at os.inf.tu-dresden.de
Mon Aug 10 08:11:03 EDT 2015
Hello,
I'm trying to find out if BLCR is still working with MVAPICH2. I have
installed debian on my 4-core machine. To test how checkpoints work I
compiled an lu benchmark from NAS performance benchmark suite. But
unfortunately the application always fails while doing checkpoints. I
see only the first checkpoint created, but I didn't manage to restart
the application from my checkpoint.
Could you acknowledge please, that the checkpoint/restart mechanism is
still supposed to work, because the latest release of BLCR is from 2013?
And if yes, could you please tell me what am I doing wrong?
The details:
# uname -a
Linux planeta7 3.16.0-4-amd64 #1 SMP Debian 3.16.7-ckt11-1+deb8u2
(2015-07-17) x86_64 GNU/Linux
blcr version: blcr-0.8.6~b3 this is the version available in debian
experimental repository
NPB application: lu class C nprocs 4
MVAPICH is compiled from scratch and it is installed in ~/opt directory
How I start the application:
$ MV2_CKPT_NO_SYNC=1 ~/opt/bin/mpiexec -np 4 -verbose -ckpoint-interval
120 -ckpoint-prefix /tmp/chkpt/ ./bin/lu.C.4
How it fails (note that after first checkpoint completed, time steps do
not advance):
Time step 80
[proxy:0:0 at planeta7] requesting checkpoint
[proxy:0:0 at planeta7] checkpoint completed
[proxy:0:0 at planeta7] requesting checkpoint
[proxy:0:0 at planeta7] checkpoint completed
[proxy:0:0 at planeta7] requesting checkpoint
[proxy:0:0 at planeta7] HYDT_ckpoint_checkpoint
(tools/ckpoint/ckpoint.c:115): Previous checkpoint has not
completed.[proxy:0:0 at planeta7] HYD_pmcd_pmip_control_cmd_cb
(pm/pmiserv/pmip_cb.c:931): checkpoint suspend failed
[proxy:0:0 at planeta7] HYDT_dmxu_poll_wait_for_event
(tools/demux/demux_poll.c:76): callback returned error status
[proxy:0:0 at planeta7] main (pm/pmiserv/pmip.c:206): demux engine error
waiting for event
[mpiexec at planeta7] control_cb (pm/pmiserv/pmiserv_cb.c:200): assert
(!closed) failed
[mpiexec at planeta7] HYDT_dmxu_poll_wait_for_event
(tools/demux/demux_poll.c:76): callback returned error status
[mpiexec at planeta7] HYD_pmci_wait_for_completion
(pm/pmiserv/pmiserv_pmci.c:198): error waiting for event
[mpiexec at planeta7] main (ui/mpich/mpiexec.c:344): process manager error
waiting for completion
Cpu model : Intel(R) Core(TM) i7-4790 CPU @ 3.60GHz
With 4 CPU cores
dmesg output:
# dmesg -c
[425658.981022] blcr: warning: skipped a socket.
[425658.981031] blcr: warning: skipped a socket.
[425658.981036] blcr: warning: skipped a socket.
[425658.981063] blcr: warning: skipped a socket.
[425660.135281] blcr: warning: skipped a socket.
[425661.269592] blcr: warning: skipped a socket.
[425662.698799] blcr: warning: skipped a socket.
[425896.865836] blcr: chkpt_watchdog: 'lu.C.4' (tgid/pid 2589/2589)
exited with signal 9 during checkpoint
[425896.865839] blcr: chkpt_watchdog: 'lu.C.4' (tgid/pid 2589/2599)
exited with signal 9 during checkpoint
[425896.865841] blcr: chkpt_watchdog: 'lu.C.4' (tgid/pid 2591/2591)
exited with signal 9 during checkpoint
[425896.865842] blcr: chkpt_watchdog: 'lu.C.4' (tgid/pid 2591/2598)
exited with signal 9 during checkpoint
[425896.881038] blcr: warning: skipped a socket.
[425896.881042] blcr: warning: skipped a socket.
[425896.881043] blcr: warning: skipped a socket.
[425896.881051] blcr: warning: skipped a socket.
[425896.881149] blcr: cr_freeze_threads failed (-4)
[425898.414107] blcr: warning: skipped a socket.
Complete log of mpiexec:
http://paste.debian.net/290916/
I also tried to run the application with mpiexec.mpirun_rsh, but the
behavior there was pretty similar. If you think it is worth to show the
results of mpiexec.mpirun_rsh, tell me please.
--
With best regards,
Maksym Planeta.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 5154 bytes
Desc: S/MIME Cryptographic Signature
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20150810/50886872/attachment.p7s>
More information about the mvapich-discuss
mailing list