[mvapich-discuss] Problem with checkpoint/restart - MVAPICH2-2.0b+BLCR+SLURM

miaocb miaocb at sina.cn
Mon Feb 10 00:50:50 EST 2014


Hi, everyone
    I use SLURM (job resource manager), BLCR and MVAPICH2 to test the checkpoint/restart support.  
The job was successfully checkpointed, when the job was restarted, the MPI process seems to be restarted, 
and the state of the process was "R",  but was not really "running" . 

   The blcr documents says that "BLCR will not checkpoint and/or restore open sockets (TCP/IP, Unix domain, etc.). 
At restart time any sockets will appear to have been closed."  And "BLCR support checkpointing parallel/distributed applications only by using checkpoint callbacks, many MPI implementations have made themselves checkpointable by BLCR". 

I noticed that, when the MPI program was started the first time,  a socked connection was established. (see the following)
[root at node82 ~]# ll /proc/8953/fd
total 0
lr-x------ 1 wrf users 64 Feb  4 06:43 0 -> pipe:[28127264]
l-wx------ 1 wrf users 64 Feb  4 06:43 1 -> pipe:[28127265]
lrwx------ 1 wrf users 64 Feb  4 06:43 10 -> /public/sourcecode/slurm/test/Estuary/tstinp/RIVERS_NAMELIST.nml
lr-x------ 1 wrf users 64 Feb  4 06:43 16 -> /public/sourcecode/slurm/test/Estuary/tstinp/River_data.nc
lr-x------ 1 wrf users 64 Feb  4 06:43 17 -> /public/sourcecode/slurm/test/Estuary/tstinp/m2_only_1m.nc
l-wx------ 1 wrf users 64 Feb  4 06:43 2 -> pipe:[28127266]
lrwx------ 1 wrf users 64 Feb  4 06:43 3 -> socket:[28127330]
lr-x------ 1 wrf users 64 Feb  4 06:43 4 -> /
l-wx------ 1 wrf users 64 Feb  4 06:43 5 -> /proc/checkpoint/ctrl
lrwx------ 1 wrf users 64 Feb  4 06:43 6 -> /dev/shm/ib_shmem-42.0-node82-503.tmp (deleted)
lrwx------ 1 wrf users 64 Feb  4 06:43 7 -> /dev/shm/ib_pool-42.0-node82-503.tmp (deleted)
lrwx------ 1 wrf users 64 Feb  4 06:43 8 -> /dev/shm/ib_shmem_coll-42.0-node82-503.tmp (deleted)
lrwx------ 1 wrf users 64 Feb  4 06:43 9 -> /dev/shm/slot_shmem-coll-42.0-node82-0-503.tmp (deleted)

When the job was checkpointed, blcr gives 3 warnings saying that socket connections were skipped during the checkpoint. 
Feb  4 06:44:29 node82 kernel: blcr: warning: skipped a socket.
Feb  4 06:44:29 node82 kernel: blcr: warning: skipped a socket.
Feb  4 06:44:29 node82 kernel: blcr: warning: skipped a socket.

But when the MPI process was restarted, the socket connection was not restored (see the following results).
[root at node82 ~]# ll /proc/8953/fd
total 0
lr-x------ 1 wrf users 64 Feb  4 06:47 0 -> pipe:[28130247]
l-wx------ 1 wrf users 64 Feb  4 06:47 1 -> pipe:[28130248]
lrwx------ 1 wrf users 64 Feb  4 06:47 10 -> /public/sourcecode/slurm/test/Estuary/tstinp/RIVERS_NAMELIST.nml
lr-x------ 1 wrf users 64 Feb  4 06:47 11 -> /
lrwx------ 1 wrf users 64 Feb  4 06:47 12 -> /dev/shm/ib_shmem-42.0-node82-503.tmp (deleted)
lrwx------ 1 wrf users 64 Feb  4 06:47 13 -> /dev/shm/ib_pool-42.0-node82-503.tmp (deleted)
lrwx------ 1 wrf users 64 Feb  4 06:47 14 -> /dev/shm/ib_shmem_coll-42.0-node82-503.tmp (deleted)
lr-x------ 1 wrf users 64 Feb  4 06:47 16 -> /public/sourcecode/slurm/test/Estuary/tstinp/River_data.nc
lr-x------ 1 wrf users 64 Feb  4 06:47 17 -> /public/sourcecode/slurm/test/Estuary/tstinp/m2_only_1m.nc
l-wx------ 1 wrf users 64 Feb  4 06:47 2 -> pipe:[28130249]
lr-x------ 1 wrf users 64 Feb  4 06:47 4 -> /
l-wx------ 1 wrf users 64 Feb  4 06:47 5 -> /proc/checkpoint/ctrl
lrwx------ 1 wrf users 64 Feb  4 06:47 6 -> /dev/zero (deleted)
lrwx------ 1 wrf users 64 Feb  4 06:47 7 -> /dev/zero (deleted)
lrwx------ 1 wrf users 64 Feb  4 06:47 8 -> /dev/zero (deleted)
lrwx------ 1 wrf users 64 Feb  4 06:47 9 -> /dev/zero (deleted)

 The configure options is:
  ./configure CC=icc CXX=icpc FC=ifort F77=ifort --with-device=ch3:mrail --with-rdma=gen2 --enable-ckpt --with-blcr=/usr --prefix=/public/software/mpi/mvapich2-20b-intel-blcr

Any idea?




miaocb
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20140210/c951415b/attachment.html>


More information about the mvapich-discuss mailing list