[mvapich-discuss] Problems BLCR-checkpointing MVAPICH2-1.5.1 application to Lustre parallel filesystem

xiangyong ouyang ouyangx at cse.ohio-state.edu
Fri Jan 21 16:12:03 EST 2011


Hello Thomas,

As a followup to my previous reply,  we have built MVAPIHC2-1.5.1p1
using icc.   We are able to checkpoint/restart a NPB benchmark bt.B.4
on two nodes successfully using Lustre as the backend filesystem.


Here is my config flag (pretty much the same as yours) :

export PATH=/opt/intel/Compiler/11.1/069/bin/intel64/:$PATH

env CC=icc F77=ifort F90=ifort CXX=icpc ./configure
--prefix=/home/ouyangx/mvapich2/mvapich2/branches/install-1.5.1p1
--enable-blcr --with-blcr=/home/ouyangx/blcr/install-0.8.2-orig
--with-file-system=lustre  --enable-sharedlibs=gcc  --enable-g=dbg
--enable-debuginfo;


I'm using the following settings on all the compute nodes:
- BLCR-0.8.2,
- Lustre 1.8.1.1
- Lustre-patched kernel 2.6.18-128.7.1.el5-lustre.1.8.1.1smp
- Mellanox  HCA  MT25208
- OFED-1.5.1


I have attached my "config.log" file for your reference.


Here is how I ran the benchmark:
[s5:~/benchmark/NPB3.2-MPI]$/home/ouyangx/mvapich2/mvapich2/branches/install-1.5.1p1/bin/mpirun_rsh
-np 4 -hostfile ./hostfile  MV2_CKPT_FILE=/tmp/lustre/d1/ckpt
./bin/bt.B.4


Then I used the built-in helper script to take a checkpoint:
[s5:~/mvapich2/mvapich2/branches]$/home/ouyangx/mvapich2/mvapich2/branches/install-1.5.1p1/bin/mv2_checkpoint

  PID USER     TT       COMMAND     %CPU    VSZ  START CMD
 7306 ouyangx  pts/2    mpirun_rsh   0.0  49212  15:48
/home/ouyangx/mvapich2/mvapich2/branches/install-1.5.1p1/bin/mpirun_rsh
-np 4 -hostfile ./hostfile MV2_CKPT_FILE=/tmp/lustre/d1/ckpt
./bin/bt.B.4

Enter PID to checkpoint or Control-C to exit: 7306
Checkpointing PID 7306
Checkpoint file: context.7306




If you still experience problems when doing CR with Lustre, then could
you tell me what's the application you are running when encountering
the CR problem?  Are you making any MPI_IO calls in that program?
And,  is it possible that you provide us core-dump / backtrace files
about the failure?   That will help us investigate your case.




-Xiangyong Ouyang





On Thu, Jan 20, 2011 at 1:11 PM, xiangyong ouyang
<ouyangx at cse.ohio-state.edu> wrote:
> Hello Thomas,
>
> Thanks for reporting this issue.
>
> Using gcc as the compiler, we have conducted some testing with similar
> config as yours, and we are able to checkpoint/restart the application
> bt.A.4 on two nodes successfully, using Lustre filesystem to store the
> checkpoint files.
>
> Here is the config flags we are using:
>
> ./configure --prefix=/tmp/ouyangx/install-1.5.1p1      --enable-blcr
> --with-file-system=lustre   --enable-g=dbg --enable-debuginfo
> --enable-sharedlibs=gcc
>
> Here is how we run the application:
> ./install-1.5.1p1/bin/mpirun_rsh -np 4 -hostfile ./hostfile
> MV2_CKPT_FILE=/tmp/lustre/d1/ckpt
> /home/ouyangx/benchmark/NPB3.2-MPI/bin/bt.A.4
>
> The we use mvapich2's helper script to initiate a checkpoint:
> ./install-1.5.1p1/bin/mv2_checkpoint
>
>
> In this testing we use mvapich2-1.5.1p1,  and Lustre-kernel
> 2.6.18-128.7.1.el5-lustre.1.8.1.1smp.
>
>
> We'll go further to build mvapich2 with icc to see if there is any
> difference.  Meanwhile is it possible that you try your testing using
> gcc?
>
>
>
> -Xiangyong Ouyang
>
>
>
>
>
> On Wed, Jan 19, 2011 at 1:01 PM, Thomas Zeiser
> <thomas.zeiser at rrze.uni-erlangen.de> wrote:
>> Dear All,
>>
>> we are facing a quite strange situation when trying to
>> BLCR-checkpoint applications built with MVAPICH2-1.5.1:
>> (see the very end of the email for how mvapich2 was compiled)
>>
>> - writing the BLCR-checkpoint of a _serial_ application to a
>>  mounted Lustre filesystem works
>>
>> - writing the BLCR-checkpoints of a MVAPICH2-1.5.1-compiled
>>  parallel application to the local /tmp directories works
>>
>> - writing the BLCR-checkpoints of a MVAPICH2-1.5.1-compiled
>>  parallel application to an NFS mounted directory works
>>
>> - HOWEVER, writing the BLCR-checkpoints of a
>>  MVAPICH2-1.5.1-compiled parallel application to a mounted Lustre
>>  filesystem FAILS
>>
>> #on the shell calling cr_checkpoint#  cr_checkpoint -p PID-OF-MPIRUN
>>
>> - chkpt_watchdog: 'mpirun_rsh' (tgid/pid 21615/21617) exited with
>>  signal 6 during checkpoint
>> Checkpoint cancelled by application: try again later
>>
>>
>> #on the shell with the application#  mpirun_rsh -ssh -np 2 -hostfile $PBS_NODEFILE  MV2_CKPT_FILE=/lustre/ckp-job ./a.out
>>
>> [0]: begin checkpoint...
>> [Rank 0][/apps/mvapich2-1.5.1p1/src/mpid/ch3/channels/mrail/src/gen2/cr.c: line 721]cr_checkpoint failed
>> [CR_Callback] Checkpoint of a Process Failed
>> cr_core.c:244 cr_checkpoint: Unexpected return from
>> CR_OP_HAND_ABORT
>> Abort
>>
>>
>> Doning an "ls -l" to the Lustre filesystem shows that the first few
>> MB of the checkpoint have been written ...
>> -rw------- 1 uz uz 68157440 Jan 19 16:20 ckp-job.1.0
>> -rw------- 1 uz uz 68157440 Jan 19 16:20 ckp-job.1.1
>>
>> The "correct" checkpoint files as written to NFS or /tmp are much
>> larger
>> -rw------- 1 uz uz 175164237 Jan 19 16:33 ckpt.1.0
>> -rw------- 1 uz uz  99699381 Jan 19 16:33 ckpt.1.1
>>
>> The number of MPI processes does not matter. Same behavior for
>> 1 MPI process on 1 node; 2 MPI processes on 2 nodes; 4 MPI
>> processes on 2 nodes. Writing the checkpoint hangs/aborts as soon
>> as it's MPI and the checkpoint is supposed to be written to Lustre.
>>
>>
>> If OpenMPI-1.4.3 is used instead, very much the same behavior is
>> observed; except that OpenMPI just hangs instead of throwing a
>> checkpoint failure error message. Thus, my initial guess was that it
>> might be a BLCR issue but Paul Hargrove redirected me to the MPI
>> people ...
>> cf. https://hpcrdm.lbl.gov/pipermail/checkpoint/2011-January/000135.html
>>
>>
>> Initial tests with writing the checkpoint to a mounted GPFS
>> parallel filesystem also produced aborted checkpoints.
>>
>>
>> I tested it on two different clusters
>> - both BLCR-0.8.2
>> - both Lustre-1.8.x as far as I know; mounted over Infiniband
>> - Debian-based kernel 2.6.32.21 or kernel from CentOS-5.5
>> - Infiniband interconnect with Mellanox HCAs
>>
>>
>> MVAPICH2-1.5.1 was configured as follow:
>>
>> env CC=icc F77=ifort F90=ifort CXX=icpc ../mvapich2-1.5.1p1/configure --prefix=/apps/mvapich2/1.5.1p1-intel11.1up8-blcr --enable-blcr --with-file-system=lustre
>>
>> i.e. only BLCR was enabled; but non of the other fault tolerance
>> features;
>>
>>
>> Any ideas or hints?
>>
>>
>> Should checkpointing of MVAPICH2 applications to mounted Lustre
>> filesystems work?
>>
>>
>> Best regards,
>>
>> Thomas
>> _______________________________________________
>> mvapich-discuss mailing list
>> mvapich-discuss at cse.ohio-state.edu
>> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>>
>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: config.log
Type: text/x-log
Size: 315957 bytes
Desc: not available
Url : http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20110121/fd5dac90/config-0001.bin


More information about the mvapich-discuss mailing list