[mvapich-discuss] BLCR checkpoint support

Maksym Planeta mplaneta at os.inf.tu-dresden.de
Mon Aug 10 08:11:03 EDT 2015


Hello,

I'm trying to find out if BLCR is still working with MVAPICH2. I have 
installed debian on my 4-core machine. To test how checkpoints work I 
compiled an lu benchmark from NAS performance benchmark suite. But 
unfortunately the application always fails while doing checkpoints. I 
see only the first checkpoint created, but I didn't manage to restart 
the application from my checkpoint.

Could you acknowledge please, that the checkpoint/restart mechanism is 
still supposed to work, because the latest release of BLCR is from 2013?

And if yes, could you please tell me what am I doing wrong?

The details:

# uname -a
Linux planeta7 3.16.0-4-amd64 #1 SMP Debian 3.16.7-ckt11-1+deb8u2 
(2015-07-17) x86_64 GNU/Linux

blcr version: blcr-0.8.6~b3 this is the version available in debian 
experimental repository

NPB application: lu class C nprocs 4

MVAPICH is compiled from scratch and it is installed in ~/opt directory

How I start the application:

$ MV2_CKPT_NO_SYNC=1 ~/opt/bin/mpiexec -np 4  -verbose -ckpoint-interval 
120 -ckpoint-prefix /tmp/chkpt/ ./bin/lu.C.4

How it fails (note that after first checkpoint completed, time steps do 
not advance):

  Time step   80
[proxy:0:0 at planeta7] requesting checkpoint
[proxy:0:0 at planeta7] checkpoint completed
[proxy:0:0 at planeta7] requesting checkpoint
[proxy:0:0 at planeta7] checkpoint completed
[proxy:0:0 at planeta7] requesting checkpoint
[proxy:0:0 at planeta7] HYDT_ckpoint_checkpoint 
(tools/ckpoint/ckpoint.c:115): Previous checkpoint has not 
completed.[proxy:0:0 at planeta7] HYD_pmcd_pmip_control_cmd_cb 
(pm/pmiserv/pmip_cb.c:931): checkpoint suspend failed
[proxy:0:0 at planeta7] HYDT_dmxu_poll_wait_for_event 
(tools/demux/demux_poll.c:76): callback returned error status
[proxy:0:0 at planeta7] main (pm/pmiserv/pmip.c:206): demux engine error 
waiting for event
[mpiexec at planeta7] control_cb (pm/pmiserv/pmiserv_cb.c:200): assert 
(!closed) failed
[mpiexec at planeta7] HYDT_dmxu_poll_wait_for_event 
(tools/demux/demux_poll.c:76): callback returned error status
[mpiexec at planeta7] HYD_pmci_wait_for_completion 
(pm/pmiserv/pmiserv_pmci.c:198): error waiting for event
[mpiexec at planeta7] main (ui/mpich/mpiexec.c:344): process manager error 
waiting for completion

Cpu model : Intel(R) Core(TM) i7-4790 CPU @ 3.60GHz
With 4 CPU cores

dmesg output:

# dmesg -c
[425658.981022] blcr: warning: skipped a socket.
[425658.981031] blcr: warning: skipped a socket.
[425658.981036] blcr: warning: skipped a socket.
[425658.981063] blcr: warning: skipped a socket.
[425660.135281] blcr: warning: skipped a socket.
[425661.269592] blcr: warning: skipped a socket.
[425662.698799] blcr: warning: skipped a socket.
[425896.865836] blcr: chkpt_watchdog: 'lu.C.4' (tgid/pid 2589/2589) 
exited with signal 9 during checkpoint
[425896.865839] blcr: chkpt_watchdog: 'lu.C.4' (tgid/pid 2589/2599) 
exited with signal 9 during checkpoint
[425896.865841] blcr: chkpt_watchdog: 'lu.C.4' (tgid/pid 2591/2591) 
exited with signal 9 during checkpoint
[425896.865842] blcr: chkpt_watchdog: 'lu.C.4' (tgid/pid 2591/2598) 
exited with signal 9 during checkpoint
[425896.881038] blcr: warning: skipped a socket.
[425896.881042] blcr: warning: skipped a socket.
[425896.881043] blcr: warning: skipped a socket.
[425896.881051] blcr: warning: skipped a socket.
[425896.881149] blcr: cr_freeze_threads failed (-4)
[425898.414107] blcr: warning: skipped a socket.

Complete log of mpiexec:

http://paste.debian.net/290916/

I also tried to run the application with mpiexec.mpirun_rsh, but the 
behavior there was pretty similar. If you think it is worth to show the 
results of mpiexec.mpirun_rsh, tell me please.

-- 
With best regards,
Maksym Planeta.

-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 5154 bytes
Desc: S/MIME Cryptographic Signature
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20150810/50886872/attachment.p7s>


More information about the mvapich-discuss mailing list