[mvapich-discuss] Problem with checkpoint/restart - MVAPICH2-2.0b+BLCR+SLURM

Mon Feb 10 22:08:38 EST 2014

Hi Raghu,
    Thanks for your reply, and sorry for not enough details. 
    In order to use slurm's command "srun" to run the MPI program, I add the option "-lpmi" while compiling the program, such as
   mpicc -o test_mpi test_mpi.c -lpmi
   I also tried the way of adding the option "--with-pmi=slurm  --with-pm=no" during the configuration of MVAPICH2. MVAPICH2 is configured as follows (--enable-fast=none is added to get the debug information):

[wrf at node81 bin]$ mpichversion
MVAPICH2 Version:       2.0b
MVAPICH2 Release date:  Fri Nov  8 11:17:40 EST 2013
MVAPICH2 Device:        ch3:mrail
MVAPICH2 configure:     CC=icc CXX=icpc FC=ifort F77=ifort --with-device=ch3:mrail --with-rdma=gen2 --enable-ckpt --with-blcr=/usr --with-pmi=slurm --with-pm=no --enable-fast=none --prefix=/public/sourcecode/slurm/mpi/mvapich2-20b-intel-slurm-blcr
MVAPICH2 CC:    icc   
MVAPICH2 CXX:   icpc  
MVAPICH2 F77:   ifort -L/usr/lib  
MVAPICH2 FC:    ifort  

The slurm script used to run the program is:

#!/bin/bash
#SBATCH -J count
#SBATCH -n 2
#SBATCH -p low
#SBATCH -o out.%j
#SBATCH -e err.%j
source mvapich2-20b-intel-slurm-blcr-env.sh
export MV2_DEBUG_CR_VERBOSE=10
srun_cr ./counting_mpi

The environment variable MV2_DEBUG_CR_VERBOSE is used to turn on the debug information of checkpoint/restart.

Slurm utility "sbatch" is used to submit the job, after the job is started and beforce a checkpoint is made, the checkpoint/restart debug information is:
[node82:mpi_rank_1][MPIDI_CH3I_CR_Init] Initializing SMC locks
[node82:mpi_rank_1][MPIDI_CH3I_CR_Init] Initialized SMC locks
[node82:mpi_rank_0][MPIDI_CH3I_CR_Init] Creating a new thread for running cr controller: MPICR_cs_lock
[node82:mpi_rank_0][MPIDI_CH3I_CR_Init] Creating a new thread for running cr controller
[node82:mpi_rank_0][MPIDI_CH3I_CR_Init] Initializing SMC locks
[node82:mpi_rank_0][MPIDI_CH3I_CR_Init] Initialized SMC locks
[node82:mpi_rank_1][CR_Set_state] MPICR_STATE_RUNNING
[node82:mpi_rank_1][CR_Thread_entry] Finished initialization for the non-mpirun_rsh case
[node82:mpi_rank_0][CR_Set_state] MPICR_STATE_RUNNING
[node82:mpi_rank_0][CR_Thread_entry] Finished initialization for the non-mpirun_rsh case

The command "scontrol checkpoint create jobid" is used to take a checkpoint, and the checkpoint/restart debug information is:

[node82:mpi_rank_0][CR_Set_state] MPICR_STATE_REQUESTED
[node82:mpi_rank_0][norsh_cr_callback] Locking the shmem collectives
[node82:mpi_rank_0][norsh_cr_callback] Locking the critical-section
[node82:mpi_rank_0][norsh_cr_callback] Locking the channel manager
[node82:mpi_rank_0][CR_Set_state] MPICR_STATE_PRE_COORDINATION
[node82:mpi_rank_0][norsh_cr_callback] Suspending communication channels
[node82:mpi_rank_0][CR_IBU_Suspend_channels] PG_get_vc=1
[node82:mpi_rank_1][CR_Set_state] MPICR_STATE_REQUESTED
[node82:mpi_rank_1][norsh_cr_callback] Locking the shmem collectives
[node82:mpi_rank_1][norsh_cr_callback] Locking the critical-section
[node82:mpi_rank_1][norsh_cr_callback] Locking the channel manager
[node82:mpi_rank_1][CR_Set_state] MPICR_STATE_PRE_COORDINATION
[node82:mpi_rank_1][norsh_cr_callback] Suspending communication channels
[node82:mpi_rank_1][CR_IBU_Suspend_channels] PG_get_vc=0
[node82:mpi_rank_1][CR_IBU_Suspend_channels] fin:  MPIDI_CH3I_CM_suspend
[node82:mpi_rank_0][CR_IBU_Suspend_channels] fin:  MPIDI_CH3I_CM_suspend
[node82:mpi_rank_0][CR_IBU_Suspend_channels] fin:  MPICM_unlock
[node82:mpi_rank_0][CR_Set_state] MPICR_STATE_CHECKPOINTING
[node82:mpi_rank_0][norsh_cr_callback] Calling cr_checkpoint()
[node82:mpi_rank_1][CR_IBU_Suspend_channels] fin:  MPICM_unlock
[node82:mpi_rank_1][CR_Set_state] MPICR_STATE_CHECKPOINTING
[node82:mpi_rank_1][norsh_cr_callback] Calling cr_checkpoint()
[node82:mpi_rank_1][norsh_cr_callback] Reactivating the communication channels
[node82:mpi_rank_0][norsh_cr_callback] Reactivating the communication channels
[node82:mpi_rank_0][CR_IBU_Reactivate_channels] MPIDI_CH3I_SMP_init()
[node82:mpi_rank_1][CR_IBU_Reactivate_channels] MPIDI_CH3I_SMP_init()
[node82:mpi_rank_0][CR_IBU_Reactivate_channels] Attempting to reattach SMP pool used for large-message exchange before checkpoint.
[node82:mpi_rank_1][CR_IBU_Reactivate_channels] Attempting to reattach SMP pool used for large-message exchange before checkpoint.
[node82:mpi_rank_0][CR_IBU_Reactivate_channels] SMP pool init and attached.
[node82:mpi_rank_0][CR_IBU_Reactivate_channels] CR_IBU_Prep_remote_update
[node82:mpi_rank_0][CR_IBU_Reactivate_channels] MPIDI_CH3I_CM_Reactivate
[node82:mpi_rank_1][CR_IBU_Reactivate_channels] SMP pool init and attached.
[node82:mpi_rank_1][CR_IBU_Reactivate_channels] CR_IBU_Prep_remote_update
[node82:mpi_rank_1][CR_IBU_Reactivate_channels] MPIDI_CH3I_CM_Reactivate
[node82:mpi_rank_0][norsh_cr_callback] Unlocked shmem collectives!
[node82:mpi_rank_0][norsh_cr_callback] Unlocking the CH3 critical section lock
[node82:mpi_rank_0][CR_Set_state] MPICR_STATE_RUNNING
[node82:mpi_rank_0][norsh_cr_callback] Exiting norsh_cr_callback
[node82:mpi_rank_1][norsh_cr_callback] Unlocked shmem collectives!
[node82:mpi_rank_1][norsh_cr_callback] Unlocking the CH3 critical section lock
[node82:mpi_rank_1][CR_Set_state] MPICR_STATE_RUNNING
[node82:mpi_rank_1][norsh_cr_callback] Exiting norsh_cr_callback

Then the job is canced using Slurm utility "scancel", and was restarted using "scontrol checkpoint restart jobid", and the checkpoint/restart debug information is :

[node82:mpi_rank_1][norsh_cr_callback] Reactivating the communication channels
[node82:mpi_rank_0][CR_IBU_Reactivate_channels] MPIDI_CH3I_SMP_init()
[node82:mpi_rank_1][CR_IBU_Reactivate_channels] MPIDI_CH3I_SMP_init()
[node82:mpi_rank_0][CR_IBU_Reactivate_channels] Attempting to reattach SMP pool used for large-message exchange before checkpoint.
[node82:mpi_rank_1][CR_IBU_Reactivate_channels] Attempting to reattach SMP pool used for large-message exchange before checkpoint.
[node82:mpi_rank_0][CR_IBU_Reactivate_channels] SMP pool init and attached.
[node82:mpi_rank_0][CR_IBU_Reactivate_channels] CR_IBU_Prep_remote_update
[node82:mpi_rank_1][CR_IBU_Reactivate_channels] SMP pool init and attached.
[node82:mpi_rank_1][CR_IBU_Reactivate_channels] CR_IBU_Prep_remote_update
[node82:mpi_rank_0][CR_IBU_Reactivate_channels] MPIDI_CH3I_CM_Reactivate
[node82:mpi_rank_1][CR_IBU_Reactivate_channels] MPIDI_CH3I_CM_Reactivate
[node82:mpi_rank_0][norsh_cr_callback] Unlocked shmem collectives!
[node82:mpi_rank_0][norsh_cr_callback] Unlocking the CH3 critical section lock
[node82:mpi_rank_0][CR_Set_state] MPICR_STATE_RUNNING
[node82:mpi_rank_0][norsh_cr_callback] Exiting norsh_cr_callback
[node82:mpi_rank_1][norsh_cr_callback] Unlocked shmem collectives!
[node82:mpi_rank_1][norsh_cr_callback] Unlocking the CH3 critical section lock
[node82:mpi_rank_1][CR_Set_state] MPICR_STATE_RUNNING
[node82:mpi_rank_1][norsh_cr_callback] Exiting norsh_cr_callback

Are these information enough? Is there anything wrong?  Thanks.

miaocb

发件人： Raghu
发送时间： 2014-02-10 23:29
收件人： miaocb
抄送： mvapich-discuss
主题： Re: [mvapich-discuss] Problem with checkpoint/restart - MVAPICH2-2.0b+BLCR+SLURM
Hi Miao,

Although you mentioned that you are using SLURM to take a checkpoint,
I see that you have not configured MVAPICH2 to work well with SLURM
(--with-pmi=slurm and --with-pm=no). Which launcher did you use to run
the application, and how exactly did you checkpoint and restart the
job?

Raghu

On Mon, Feb 10, 2014 at 12:50 AM, miaocb <miaocb at sina.cn> wrote:
> Hi, everyone
>     I use SLURM (job resource manager), BLCR and MVAPICH2 to test the
> checkpoint/restart support.
> The job was successfully checkpointed, when the job was restarted, the MPI
> process seems to be restarted,
> and the state of the process was "R",  but was not really "running" .
>
>    The blcr documents says that "BLCR will not checkpoint and/or restore
> open sockets (TCP/IP, Unix domain, etc.).
> At restart time any sockets will appear to have been closed."  And "BLCR
> support checkpointing parallel/distributed applications only by using
> checkpoint callbacks, many MPI implementations have made themselves
> checkpointable by BLCR".
>
> I noticed that, when the MPI program was started the first time,  a socked
> connection was established. (see the following)
> [root at node82 ~]# ll /proc/8953/fd
> total 0
> lr-x------ 1 wrf users 64 Feb  4 06:43 0 -> pipe:[28127264]
> l-wx------ 1 wrf users 64 Feb  4 06:43 1 -> pipe:[28127265]
> lrwx------ 1 wrf users 64 Feb  4 06:43 10 ->
> /public/sourcecode/slurm/test/Estuary/tstinp/RIVERS_NAMELIST.nml
> lr-x------ 1 wrf users 64 Feb  4 06:43 16 ->
> /public/sourcecode/slurm/test/Estuary/tstinp/River_data.nc
> lr-x------ 1 wrf users 64 Feb  4 06:43 17 ->
> /public/sourcecode/slurm/test/Estuary/tstinp/m2_only_1m.nc
> l-wx------ 1 wrf users 64 Feb  4 06:43 2 -> pipe:[28127266]
> lrwx------ 1 wrf users 64 Feb  4 06:43 3 -> socket:[28127330]
> lr-x------ 1 wrf users 64 Feb  4 06:43 4 -> /
> l-wx------ 1 wrf users 64 Feb  4 06:43 5 -> /proc/checkpoint/ctrl
> lrwx------ 1 wrf users 64 Feb  4 06:43 6 ->
> /dev/shm/ib_shmem-42.0-node82-503.tmp (deleted)
> lrwx------ 1 wrf users 64 Feb  4 06:43 7 ->
> /dev/shm/ib_pool-42.0-node82-503.tmp (deleted)
> lrwx------ 1 wrf users 64 Feb  4 06:43 8 ->
> /dev/shm/ib_shmem_coll-42.0-node82-503.tmp (deleted)
> lrwx------ 1 wrf users 64 Feb  4 06:43 9 ->
> /dev/shm/slot_shmem-coll-42.0-node82-0-503.tmp (deleted)
>
> When the job was checkpointed, blcr gives 3 warnings saying that socket
> connections were skipped during the checkpoint.
> Feb  4 06:44:29 node82 kernel: blcr: warning: skipped a socket.
> Feb  4 06:44:29 node82 kernel: blcr: warning: skipped a socket.
> Feb  4 06:44:29 node82 kernel: blcr: warning: skipped a socket.
>
> But when the MPI process was restarted, the socket connection was not
> restored (see the following results).
> [root at node82 ~]# ll /proc/8953/fd
> total 0
> lr-x------ 1 wrf users 64 Feb  4 06:47 0 -> pipe:[28130247]
> l-wx------ 1 wrf users 64 Feb  4 06:47 1 -> pipe:[28130248]
> lrwx------ 1 wrf users 64 Feb  4 06:47 10 ->
> /public/sourcecode/slurm/test/Estuary/tstinp/RIVERS_NAMELIST.nml
> lr-x------ 1 wrf users 64 Feb  4 06:47 11 -> /
> lrwx------ 1 wrf users 64 Feb  4 06:47 12 ->
> /dev/shm/ib_shmem-42.0-node82-503.tmp (deleted)
> lrwx------ 1 wrf users 64 Feb  4 06:47 13 ->
> /dev/shm/ib_pool-42.0-node82-503.tmp (deleted)
> lrwx------ 1 wrf users 64 Feb  4 06:47 14 ->
> /dev/shm/ib_shmem_coll-42.0-node82-503.tmp (deleted)
> lr-x------ 1 wrf users 64 Feb  4 06:47 16 ->
> /public/sourcecode/slurm/test/Estuary/tstinp/River_data.nc
> lr-x------ 1 wrf users 64 Feb  4 06:47 17 ->
> /public/sourcecode/slurm/test/Estuary/tstinp/m2_only_1m.nc
> l-wx------ 1 wrf users 64 Feb  4 06:47 2 -> pipe:[28130249]
> lr-x------ 1 wrf users 64 Feb  4 06:47 4 -> /
> l-wx------ 1 wrf users 64 Feb  4 06:47 5 -> /proc/checkpoint/ctrl
> lrwx------ 1 wrf users 64 Feb  4 06:47 6 -> /dev/zero (deleted)
> lrwx------ 1 wrf users 64 Feb  4 06:47 7 -> /dev/zero (deleted)
> lrwx------ 1 wrf users 64 Feb  4 06:47 8 -> /dev/zero (deleted)
> lrwx------ 1 wrf users 64 Feb  4 06:47 9 -> /dev/zero (deleted)
>
>  The configure options is:
>   ./configure CC=icc CXX=icpc FC=ifort F77=ifort --with-device=ch3:mrail
> --with-rdma=gen2 --enable-ckpt --with-blcr=/usr
> --prefix=/public/software/mpi/mvapich2-20b-intel-blcr
>
> Any idea?
>
> ________________________________
> miaocb
>
> _______________________________________________
> mvapich-discuss mailing list
> mvapich-discuss at cse.ohio-state.edu
> http://mailman.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20140211/d9cc9794/attachment-0001.html>