[mvapich-discuss] Problems BLCR-checkpointing MVAPICH2-1.5.1 application to Lustre parallel filesystem

Fri Jan 28 00:12:12 EST 2011

Hello Thomas,

We re-ran the test with IMB, and we were able to checkpoint to Lustre
successfully.

First of all, please make sure you have the right permission in the
Lustre filesystem, enough free space available in Lustre,  not run out
of quote, etc.  Some users experienced problems with checkpoint which
were caused by these filesystem issues.

It would be helpful to collect more information about the failure you
encountered.  I have attached a small patch that will print some error
messages when checkpoint fails.  Can you apply this patch to your
mvapich2?  I'm assuming you are using MVAPICH2-1.5.1p1.  Please re-run
the checkpoint test and tell us the error print outs. Thanks!

We have made some improvements since MVAPICH2-1.5.1p1.   If possible
can you try our latest MVAPICH2-1.6RC2 which is available at:
http://mvapich.cse.ohio-state.edu/download/mvapich2/mvapich2-1.6rc2.tgz

------------
FYI,  here is what we did in the successful checkpoint of IMB to
Lustre filesystem:

config:

CC=icc F77=ifort CXX=icpc./configure
--prefix=/home/ouyangx/lustre-blcr-debug --enable-blcr
--with-file-system=lustre

launch:

../bin/mpirun_rsh -np 4 ws7 ws7 ws7 ws7 MV2_CKPT_FILE=/tmp/lustre/ckpt
./IMB-MPI1 bcast

# List of Benchmarks to run:

# Bcast

#----------------------------------------------------------------
# Benchmarking Bcast
# #processes = 2
# ( 2 additional processes waiting in MPI_Barrier)
#----------------------------------------------------------------
       #bytes #repetitions  t_min[usec]  t_max[usec]  t_avg[usec]

        16384         1000        10.59        10.60        10.60
        32768         1000        15.27        15.28        15.28
        65536          640        28.19        28.21        28.20
       131072          320        41.13        41.17        41.15
       262144          160        79.64        79.74        79.69
       524288           80       364.01       364.47       364.24
      1048576           40       890.75       891.82       891.29
      2097152           20      1778.65      1779.85      1779.25
      4194304           10      3475.28      3478.98      3477.13

[0]:  CR completed...
[3]:  CR completed...

<snip>
 ./mv2_checkpoint

  PID USER     TT       COMMAND     %CPU    VSZ  START CMD

30754 ouyangx  pts/3    mpirun_rsh   0.0  49200  19:06
../bin/mpirun_rsh -np 4 ws7 ws7 ws7 ws7 MV2_CKPT_FILE=/tmp/lustre/ckpt
./IMB-EXT

Enter PID to checkpoint or Control-C to exit: 30754
Checkpointing PID 30754
Checkpoint file: context.30754

-Xiangyong Ouyang

On Sun, Jan 23, 2011 at 9:42 AM, Thomas Zeiser
<thomas.zeiser at rrze.uni-erlangen.de> wrote:
> Hello Xiangyong,
>
> On Fri, Jan 21, 2011 at 04:12:03PM -0500, xiangyong ouyang wrote:
>> Hello Thomas,
>>
>> As a followup to my previous reply,  we have built MVAPIHC2-1.5.1p1
>> using icc.   We are able to checkpoint/restart a NPB benchmark bt.B.4
>> on two nodes successfully using Lustre as the backend filesystem.
>
> Checkpointing bt.B.4 to a Lustre filesystem works perfectly fine
> for me, too - never managed to fail the checkpoint. I also tried
> bt.D.16 on 4 nodes. No problem either.
>
> However, our home-grown CFD solver still refuses to be checkpointed to Lustre.
>
> Thus, I looked at different communication patterns available in the
> Pallas/Intel MPI benchmarks (IMB) ...
>
> ====================
>
> mpirun_rsh -ssh -np 2 l1401 l1348 MV2_CKPT_FILE=/lxfs/unrz/uz/chk-imb-bcast ./IMB-MPI1 bcast bcast bcast bcast bcast bcast bcast
> => fails similar to my application; but o.k. to NFS
>
>        32768         1000        23.17        23.17        23.17
>        65536          640        34.78        34.78        34.78
> [0]: begin checkpoint...
> [Rank 0][cr.c: line 721]cr_checkpoint failed
> [CR_Callback] Checkpoint of a Process Failed
> MPI process (rank: 0) terminated unexpectedly on l1401
> cr_core.c:244 cr_checkpoint: Unexpected return from CR_OP_HAND_ABORT
> Abort
> [Rank 1][cr.c: line 721]cr_checkpoint failed
> MPI process (rank: 1) terminated unexpectedly on l1348
> ----------
> mv2_checkpoint
>  PID USER     TT       COMMAND     %CPU    VSZ  START CMD
> 20630 uz  pts/0    mpirun_rsh   0.0  25268  14:48 mpirun_rsh -ssh -np 2 l1401 l1348 MV2_CKPT_FILE=/lxfs/unrz/uz/chk-imb-bcast ./IMB-MPI1 bcast bcast bcast bcast bcast bcast bcast bcast
> Enter PID to checkpoint or Control-C to exit: 20630
> Checkpointing PID 20630
> Checkpoint cancelled by application: try again later
> cr_checkpoint failed
>
> ====================
>
> mpirun_rsh -ssh -np 2 l1401 l1348 MV2_CKPT_FILE=/lxfs/unrz/uz/chk-imb-pingpong ./IMB-MPI1 pingpong pingpong pingpong pingpong pingpong
> => fails somewhat later (i.e. fsync, reactivate channels is shown); but o.k. to NFS
>
>       262144          160       102.47      2439.75
>       524288           80       191.31      2613.62
> [0]: begin checkpoint...
> [0]: fsync...
> [0]: Reactivate channels...
> [CR_Callback] Checkpoint of a Process Failed
> [Rank 1][cr.c: line 721]cr_checkpoint failed
> cr_core.c:244 cr_checkpoint: Unexpected return from CR_OP_HAND_ABORT
> Abort
> MPI process (rank: 1) terminated unexpectedly on l1348
> ----------
> mv2_checkpoint
>  PID USER     TT       COMMAND     %CPU    VSZ  START CMD
> 20675 uz  pts/0    mpirun_rsh   0.0  25268  14:52 mpirun_rsh -ssh -np 2 l1401 l1348 MV2_CKPT_FILE=/lxfs/unrz/uz/chk-imb-pingpong ./IMB-MPI1 pingpong pingpong pingpong pingpong pingpong
> Enter PID to checkpoint or Control-C to exit: 20675
> Checkpointing PID 20675
> Checkpoint cancelled by application: try again later
> cr_checkpoint failed
>
> ====================
>
> mpirun_rsh -ssh -np 2 l1401 l1348 MV2_CKPT_FILE=/lxfs/unrz/uz/chk-imb-alltoall ./IMB-MPI1 alltoall alltoall alltoall alltoall alltoall
> => fails similar to my application; but o.k. to NFS
>
>       524288           80       259.43       259.46       259.44
>      1048576           40       505.32       505.37       505.35
> [0]: begin checkpoint...
> [Rank 1][cr.c: line 721]cr_checkpoint failed
> [CR_Callback] Checkpoint of a Process Failed
> cr_core.c:244 cr_checkpoint: Unexpected return from CR_OP_HAND_ABORT
> Abort
> MPI process (rank: 1) terminated unexpectedly on l1348
> [Rank 0][cr.c: line 721]cr_checkpoint failed
> MPI process (rank: 0) terminated unexpectedly on l1401
> ----------
> mv2_checkpoint
>  PID USER     TT       COMMAND     %CPU    VSZ  START CMD
> 20794 uz  pts/0    mpirun_rsh   0.0  25268  15:02 mpirun_rsh -ssh -np 2 l1401 l1348 MV2_CKPT_FILE=/lxfs/unrz/uz/chk-imb-alltoall ./IMB-MPI1 alltoall alltoall alltoall alltoall alltoall
> Enter PID to checkpoint or Control-C to exit: 20794
> Checkpointing PID 20794
> Checkpoint cancelled by application: try again later
> cr_checkpoint failed
>
> ====================
>
> I did not have any IMB run which did not fail checkpointing to the Lustre file
> system :-(
>
>> If you still experience problems when doing CR with Lustre, then could
>> you tell me what's the application you are running when encountering
>> the CR problem?  Are you making any MPI_IO calls in that program?
>
> My home-grown application can use MPI_IO, but at the time I try checkpointing
> no MPI-IO is active, i.e no file is opened with MPI_IO calls.
>
> The IMB tests do not use any MPI-IO at all; but the chance for being in an MPI
> call which cr_checkpoint is called is almost 100%. Maybe, that's the difference
> to bt.B.4?
>
>> And,  is it possible that you provide us core-dump / backtrace files
>> about the failure?   That will help us investigate your case.
>
> So far, I have not been successful in generating core-dumps.
>
> Also generating backtraces was not really successful: (IMB bcast follows)
>
> (gdb) bt # of master
> #0  0x00002b2afc205256 in pthread_rwlock_rdlock () from /lib64/libpthread.so.0
> #1  0x00002b2afbe92ce7 in MPIDI_CH3I_CR_lock () at cr.c:400
> #2  0x00002b2afbe6a637 in MPIDI_CH3I_Progress (is_blocking=-65343520, state=0x80) at ch3_progress.c:169
> #3  0x00002b2afbeb015a in MPIC_Wait (request_ptr=0x2b2afc1aefe0) at helper_fns.c:512
> #4  0x00002b2afbeafd23 in MPIC_Send (buf=0x2b2afc1aefe0, count=128, datatype=0, dest=-1, tag=-65127456, comm=1377023856) at helper_fns.c:40
> #5  0x00002b2afbe5de7c in MPIR_Bcast (buffer=0x2b2afc1aefe0, count=128, datatype=0, root=-1, comm_ptr=0x2b2afc1e3be0) at bcast_osu.c:336
> #6  0x00002b2afbe5dad9 in PMPI_Bcast (buffer=0x2b2afc1aefe0, count=128, datatype=0, root=-1, comm=-65127456) at bcast_osu.c:1174
> #7  0x000000000040ade4 in IMB_bcast (c_info=0x2b2afc1aefe0, size=128, ITERATIONS=0x0, RUN_MODE=0xffffffffffffffff, time=0x2b2afc1e3be0) at IMB_bcast.c:157
> #8  0x00000000004064a5 in IMB_init_buffers_iter (c_info=0x2b2afc1aefe0, ITERATIONS=0x80, Bmark=0x0, BMODE=0xffffffffffffffff, iter=-65127456, size=1377023856) at IMB_mem_manager.c:798
> #9  0x0000000000402edf in main (argc=17, argv=0x7fff5213bd68) at IMB.c:262
>
> on the slave node, gdb hangs while attaching to the process ...
>
> ====================
>
> Next I attached to the two processes before issueing ths mv2_checkpoint
>
> # master
> (gdb) c
> Continuing.
> [Thread 0x41b7b940 (LWP 22755) exited]
> [Thread 0x41c7c940 (LWP 22756) exited]
> [Thread 0x41a7a940 (LWP 22754) exited]
>
> Program received signal SIGBUS, Bus error.
> [Switching to Thread 0x414d5940 (LWP 22752)]
> 0x00002b676042b2b1 in cr_poll_checkpoint_msg () from /usr/lib64/libcr.so.0
> (gdb) bt
> #0  0x00002b676042b2b1 in cr_poll_checkpoint_msg () from /usr/lib64/libcr.so.0
> #1  0x00002b675fea3b3d in CR_Thread_loop () at cr.c:559
> #2  0x00002b675fea37e8 in CR_Thread_entry (arg=0x0) at cr.c:813
> #3  0x00002b676021173d in start_thread () from /lib64/libpthread.so.0
> #4  0x00002b67613b6f6d in clone () from /lib64/libc.so.6
> ----------
> # slave
> (gdb) c
> Continuing.
> [Thread 0x40fb0940 (LWP 10321) exited]
> [Thread 0x410b1940 (LWP 10322) exited]
> [Thread 0x40eaf940 (LWP 10320) exited]
>
> [Thread 0x409c8940 (LWP 10318) exited]
> [Thread 0x40bc9940 (LWP 10319) exited]
>
> Program exited with code 0377.
> (gdb)
> The program is not being run.
>
> On the slave, gdb only becomes reactive again once the program exited.
>
> Probably not helpful, neither.
>
>> -Xiangyong Ouyang
>
> Thanks for your help,
>
> thomas
> --
> Dr.-Ing. Thomas Zeiser, HPC Services
> Friedrich-Alexander-Universitaet Erlangen-Nuernberg
> Regionales Rechenzentrum Erlangen (RRZE)
> Martensstrasse 1, 91058 Erlangen, Germany
> Tel.: +49 9131 85-28737, Fax: +49 9131 302941
> thomas.zeiser at rrze.uni-erlangen.de
> http://www.rrze.uni-erlangen.de/hpc/
>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: cr_ckpt_retcode.patch
Type: application/octet-stream
Size: 723 bytes
Desc: not available
Url : http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20110128/bdca1271/cr_ckpt_retcode.obj