[mvapich-discuss] Problem with checkpoint/restart - MVAPICH2-2.0b+BLCR+SLURM

Raghu rajachan at cse.ohio-state.edu
Mon Feb 10 10:29:22 EST 2014


Hi Miao,

Although you mentioned that you are using SLURM to take a checkpoint,
I see that you have not configured MVAPICH2 to work well with SLURM
(--with-pmi=slurm and --with-pm=no). Which launcher did you use to run
the application, and how exactly did you checkpoint and restart the
job?

Raghu


On Mon, Feb 10, 2014 at 12:50 AM, miaocb <miaocb at sina.cn> wrote:
> Hi, everyone
>     I use SLURM (job resource manager), BLCR and MVAPICH2 to test the
> checkpoint/restart support.
> The job was successfully checkpointed, when the job was restarted, the MPI
> process seems to be restarted,
> and the state of the process was "R",  but was not really "running" .
>
>    The blcr documents says that "BLCR will not checkpoint and/or restore
> open sockets (TCP/IP, Unix domain, etc.).
> At restart time any sockets will appear to have been closed."  And "BLCR
> support checkpointing parallel/distributed applications only by using
> checkpoint callbacks, many MPI implementations have made themselves
> checkpointable by BLCR".
>
> I noticed that, when the MPI program was started the first time,  a socked
> connection was established. (see the following)
> [root at node82 ~]# ll /proc/8953/fd
> total 0
> lr-x------ 1 wrf users 64 Feb  4 06:43 0 -> pipe:[28127264]
> l-wx------ 1 wrf users 64 Feb  4 06:43 1 -> pipe:[28127265]
> lrwx------ 1 wrf users 64 Feb  4 06:43 10 ->
> /public/sourcecode/slurm/test/Estuary/tstinp/RIVERS_NAMELIST.nml
> lr-x------ 1 wrf users 64 Feb  4 06:43 16 ->
> /public/sourcecode/slurm/test/Estuary/tstinp/River_data.nc
> lr-x------ 1 wrf users 64 Feb  4 06:43 17 ->
> /public/sourcecode/slurm/test/Estuary/tstinp/m2_only_1m.nc
> l-wx------ 1 wrf users 64 Feb  4 06:43 2 -> pipe:[28127266]
> lrwx------ 1 wrf users 64 Feb  4 06:43 3 -> socket:[28127330]
> lr-x------ 1 wrf users 64 Feb  4 06:43 4 -> /
> l-wx------ 1 wrf users 64 Feb  4 06:43 5 -> /proc/checkpoint/ctrl
> lrwx------ 1 wrf users 64 Feb  4 06:43 6 ->
> /dev/shm/ib_shmem-42.0-node82-503.tmp (deleted)
> lrwx------ 1 wrf users 64 Feb  4 06:43 7 ->
> /dev/shm/ib_pool-42.0-node82-503.tmp (deleted)
> lrwx------ 1 wrf users 64 Feb  4 06:43 8 ->
> /dev/shm/ib_shmem_coll-42.0-node82-503.tmp (deleted)
> lrwx------ 1 wrf users 64 Feb  4 06:43 9 ->
> /dev/shm/slot_shmem-coll-42.0-node82-0-503.tmp (deleted)
>
> When the job was checkpointed, blcr gives 3 warnings saying that socket
> connections were skipped during the checkpoint.
> Feb  4 06:44:29 node82 kernel: blcr: warning: skipped a socket.
> Feb  4 06:44:29 node82 kernel: blcr: warning: skipped a socket.
> Feb  4 06:44:29 node82 kernel: blcr: warning: skipped a socket.
>
> But when the MPI process was restarted, the socket connection was not
> restored (see the following results).
> [root at node82 ~]# ll /proc/8953/fd
> total 0
> lr-x------ 1 wrf users 64 Feb  4 06:47 0 -> pipe:[28130247]
> l-wx------ 1 wrf users 64 Feb  4 06:47 1 -> pipe:[28130248]
> lrwx------ 1 wrf users 64 Feb  4 06:47 10 ->
> /public/sourcecode/slurm/test/Estuary/tstinp/RIVERS_NAMELIST.nml
> lr-x------ 1 wrf users 64 Feb  4 06:47 11 -> /
> lrwx------ 1 wrf users 64 Feb  4 06:47 12 ->
> /dev/shm/ib_shmem-42.0-node82-503.tmp (deleted)
> lrwx------ 1 wrf users 64 Feb  4 06:47 13 ->
> /dev/shm/ib_pool-42.0-node82-503.tmp (deleted)
> lrwx------ 1 wrf users 64 Feb  4 06:47 14 ->
> /dev/shm/ib_shmem_coll-42.0-node82-503.tmp (deleted)
> lr-x------ 1 wrf users 64 Feb  4 06:47 16 ->
> /public/sourcecode/slurm/test/Estuary/tstinp/River_data.nc
> lr-x------ 1 wrf users 64 Feb  4 06:47 17 ->
> /public/sourcecode/slurm/test/Estuary/tstinp/m2_only_1m.nc
> l-wx------ 1 wrf users 64 Feb  4 06:47 2 -> pipe:[28130249]
> lr-x------ 1 wrf users 64 Feb  4 06:47 4 -> /
> l-wx------ 1 wrf users 64 Feb  4 06:47 5 -> /proc/checkpoint/ctrl
> lrwx------ 1 wrf users 64 Feb  4 06:47 6 -> /dev/zero (deleted)
> lrwx------ 1 wrf users 64 Feb  4 06:47 7 -> /dev/zero (deleted)
> lrwx------ 1 wrf users 64 Feb  4 06:47 8 -> /dev/zero (deleted)
> lrwx------ 1 wrf users 64 Feb  4 06:47 9 -> /dev/zero (deleted)
>
>  The configure options is:
>   ./configure CC=icc CXX=icpc FC=ifort F77=ifort --with-device=ch3:mrail
> --with-rdma=gen2 --enable-ckpt --with-blcr=/usr
> --prefix=/public/software/mpi/mvapich2-20b-intel-blcr
>
> Any idea?
>
> ________________________________
> miaocb
>
> _______________________________________________
> mvapich-discuss mailing list
> mvapich-discuss at cse.ohio-state.edu
> http://mailman.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>



More information about the mvapich-discuss mailing list