[mvapich-discuss] Problems BLCR-checkpointing MVAPICH2-1.5.1 application to Lustre parallel filesystem

Wed Jan 19 13:01:23 EST 2011

Dear All,

we are facing a quite strange situation when trying to
BLCR-checkpoint applications built with MVAPICH2-1.5.1:
(see the very end of the email for how mvapich2 was compiled)

- writing the BLCR-checkpoint of a _serial_ application to a
  mounted Lustre filesystem works

- writing the BLCR-checkpoints of a MVAPICH2-1.5.1-compiled
  parallel application to the local /tmp directories works

- writing the BLCR-checkpoints of a MVAPICH2-1.5.1-compiled
  parallel application to an NFS mounted directory works

- HOWEVER, writing the BLCR-checkpoints of a
  MVAPICH2-1.5.1-compiled parallel application to a mounted Lustre
  filesystem FAILS

#on the shell calling cr_checkpoint#  cr_checkpoint -p PID-OF-MPIRUN

- chkpt_watchdog: 'mpirun_rsh' (tgid/pid 21615/21617) exited with
  signal 6 during checkpoint
Checkpoint cancelled by application: try again later

#on the shell with the application#  mpirun_rsh -ssh -np 2 -hostfile $PBS_NODEFILE  MV2_CKPT_FILE=/lustre/ckp-job ./a.out

[0]: begin checkpoint...
[Rank 0][/apps/mvapich2-1.5.1p1/src/mpid/ch3/channels/mrail/src/gen2/cr.c: line 721]cr_checkpoint failed
[CR_Callback] Checkpoint of a Process Failed
cr_core.c:244 cr_checkpoint: Unexpected return from
CR_OP_HAND_ABORT
Abort

Doning an "ls -l" to the Lustre filesystem shows that the first few
MB of the checkpoint have been written ...
-rw------- 1 uz uz 68157440 Jan 19 16:20 ckp-job.1.0
-rw------- 1 uz uz 68157440 Jan 19 16:20 ckp-job.1.1

The "correct" checkpoint files as written to NFS or /tmp are much
larger
-rw------- 1 uz uz 175164237 Jan 19 16:33 ckpt.1.0
-rw------- 1 uz uz  99699381 Jan 19 16:33 ckpt.1.1

The number of MPI processes does not matter. Same behavior for
1 MPI process on 1 node; 2 MPI processes on 2 nodes; 4 MPI
processes on 2 nodes. Writing the checkpoint hangs/aborts as soon
as it's MPI and the checkpoint is supposed to be written to Lustre.

If OpenMPI-1.4.3 is used instead, very much the same behavior is
observed; except that OpenMPI just hangs instead of throwing a
checkpoint failure error message. Thus, my initial guess was that it
might be a BLCR issue but Paul Hargrove redirected me to the MPI
people ...
cf. https://hpcrdm.lbl.gov/pipermail/checkpoint/2011-January/000135.html

Initial tests with writing the checkpoint to a mounted GPFS
parallel filesystem also produced aborted checkpoints.

I tested it on two different clusters
- both BLCR-0.8.2
- both Lustre-1.8.x as far as I know; mounted over Infiniband
- Debian-based kernel 2.6.32.21 or kernel from CentOS-5.5
- Infiniband interconnect with Mellanox HCAs

MVAPICH2-1.5.1 was configured as follow:

env CC=icc F77=ifort F90=ifort CXX=icpc ../mvapich2-1.5.1p1/configure --prefix=/apps/mvapich2/1.5.1p1-intel11.1up8-blcr --enable-blcr --with-file-system=lustre

i.e. only BLCR was enabled; but non of the other fault tolerance
features; 

Any ideas or hints?

Should checkpointing of MVAPICH2 applications to mounted Lustre
filesystems work?

Best regards,

Thomas