[mvapich-discuss] Problems BLCR-checkpointing MVAPICH2-1.5.1 application to Lustre parallel filesystem

xiangyong ouyang ouyangx at cse.ohio-state.edu
Thu Jan 20 13:11:15 EST 2011


Hello Thomas,

Thanks for reporting this issue.

Using gcc as the compiler, we have conducted some testing with similar
config as yours, and we are able to checkpoint/restart the application
bt.A.4 on two nodes successfully, using Lustre filesystem to store the
checkpoint files.

Here is the config flags we are using:

./configure --prefix=/tmp/ouyangx/install-1.5.1p1      --enable-blcr
--with-file-system=lustre   --enable-g=dbg --enable-debuginfo
--enable-sharedlibs=gcc

Here is how we run the application:
./install-1.5.1p1/bin/mpirun_rsh -np 4 -hostfile ./hostfile
MV2_CKPT_FILE=/tmp/lustre/d1/ckpt
/home/ouyangx/benchmark/NPB3.2-MPI/bin/bt.A.4

The we use mvapich2's helper script to initiate a checkpoint:
./install-1.5.1p1/bin/mv2_checkpoint


In this testing we use mvapich2-1.5.1p1,  and Lustre-kernel
2.6.18-128.7.1.el5-lustre.1.8.1.1smp.


We'll go further to build mvapich2 with icc to see if there is any
difference.  Meanwhile is it possible that you try your testing using
gcc?



-Xiangyong Ouyang





On Wed, Jan 19, 2011 at 1:01 PM, Thomas Zeiser
<thomas.zeiser at rrze.uni-erlangen.de> wrote:
> Dear All,
>
> we are facing a quite strange situation when trying to
> BLCR-checkpoint applications built with MVAPICH2-1.5.1:
> (see the very end of the email for how mvapich2 was compiled)
>
> - writing the BLCR-checkpoint of a _serial_ application to a
>  mounted Lustre filesystem works
>
> - writing the BLCR-checkpoints of a MVAPICH2-1.5.1-compiled
>  parallel application to the local /tmp directories works
>
> - writing the BLCR-checkpoints of a MVAPICH2-1.5.1-compiled
>  parallel application to an NFS mounted directory works
>
> - HOWEVER, writing the BLCR-checkpoints of a
>  MVAPICH2-1.5.1-compiled parallel application to a mounted Lustre
>  filesystem FAILS
>
> #on the shell calling cr_checkpoint#  cr_checkpoint -p PID-OF-MPIRUN
>
> - chkpt_watchdog: 'mpirun_rsh' (tgid/pid 21615/21617) exited with
>  signal 6 during checkpoint
> Checkpoint cancelled by application: try again later
>
>
> #on the shell with the application#  mpirun_rsh -ssh -np 2 -hostfile $PBS_NODEFILE  MV2_CKPT_FILE=/lustre/ckp-job ./a.out
>
> [0]: begin checkpoint...
> [Rank 0][/apps/mvapich2-1.5.1p1/src/mpid/ch3/channels/mrail/src/gen2/cr.c: line 721]cr_checkpoint failed
> [CR_Callback] Checkpoint of a Process Failed
> cr_core.c:244 cr_checkpoint: Unexpected return from
> CR_OP_HAND_ABORT
> Abort
>
>
> Doning an "ls -l" to the Lustre filesystem shows that the first few
> MB of the checkpoint have been written ...
> -rw------- 1 uz uz 68157440 Jan 19 16:20 ckp-job.1.0
> -rw------- 1 uz uz 68157440 Jan 19 16:20 ckp-job.1.1
>
> The "correct" checkpoint files as written to NFS or /tmp are much
> larger
> -rw------- 1 uz uz 175164237 Jan 19 16:33 ckpt.1.0
> -rw------- 1 uz uz  99699381 Jan 19 16:33 ckpt.1.1
>
> The number of MPI processes does not matter. Same behavior for
> 1 MPI process on 1 node; 2 MPI processes on 2 nodes; 4 MPI
> processes on 2 nodes. Writing the checkpoint hangs/aborts as soon
> as it's MPI and the checkpoint is supposed to be written to Lustre.
>
>
> If OpenMPI-1.4.3 is used instead, very much the same behavior is
> observed; except that OpenMPI just hangs instead of throwing a
> checkpoint failure error message. Thus, my initial guess was that it
> might be a BLCR issue but Paul Hargrove redirected me to the MPI
> people ...
> cf. https://hpcrdm.lbl.gov/pipermail/checkpoint/2011-January/000135.html
>
>
> Initial tests with writing the checkpoint to a mounted GPFS
> parallel filesystem also produced aborted checkpoints.
>
>
> I tested it on two different clusters
> - both BLCR-0.8.2
> - both Lustre-1.8.x as far as I know; mounted over Infiniband
> - Debian-based kernel 2.6.32.21 or kernel from CentOS-5.5
> - Infiniband interconnect with Mellanox HCAs
>
>
> MVAPICH2-1.5.1 was configured as follow:
>
> env CC=icc F77=ifort F90=ifort CXX=icpc ../mvapich2-1.5.1p1/configure --prefix=/apps/mvapich2/1.5.1p1-intel11.1up8-blcr --enable-blcr --with-file-system=lustre
>
> i.e. only BLCR was enabled; but non of the other fault tolerance
> features;
>
>
> Any ideas or hints?
>
>
> Should checkpointing of MVAPICH2 applications to mounted Lustre
> filesystems work?
>
>
> Best regards,
>
> Thomas
> _______________________________________________
> mvapich-discuss mailing list
> mvapich-discuss at cse.ohio-state.edu
> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>



More information about the mvapich-discuss mailing list