[mvapich-discuss] Problems BLCR-checkpointing MVAPICH2-1.5.1
application to Lustre parallel filesystem
xiangyong ouyang
ouyangx at cse.ohio-state.edu
Fri Jan 28 00:12:12 EST 2011
Hello Thomas,
We re-ran the test with IMB, and we were able to checkpoint to Lustre
successfully.
First of all, please make sure you have the right permission in the
Lustre filesystem, enough free space available in Lustre, not run out
of quote, etc. Some users experienced problems with checkpoint which
were caused by these filesystem issues.
It would be helpful to collect more information about the failure you
encountered. I have attached a small patch that will print some error
messages when checkpoint fails. Can you apply this patch to your
mvapich2? I'm assuming you are using MVAPICH2-1.5.1p1. Please re-run
the checkpoint test and tell us the error print outs. Thanks!
We have made some improvements since MVAPICH2-1.5.1p1. If possible
can you try our latest MVAPICH2-1.6RC2 which is available at:
http://mvapich.cse.ohio-state.edu/download/mvapich2/mvapich2-1.6rc2.tgz
------------
FYI, here is what we did in the successful checkpoint of IMB to
Lustre filesystem:
config:
CC=icc F77=ifort CXX=icpc./configure
--prefix=/home/ouyangx/lustre-blcr-debug --enable-blcr
--with-file-system=lustre
launch:
../bin/mpirun_rsh -np 4 ws7 ws7 ws7 ws7 MV2_CKPT_FILE=/tmp/lustre/ckpt
./IMB-MPI1 bcast
# List of Benchmarks to run:
# Bcast
#----------------------------------------------------------------
# Benchmarking Bcast
# #processes = 2
# ( 2 additional processes waiting in MPI_Barrier)
#----------------------------------------------------------------
#bytes #repetitions t_min[usec] t_max[usec] t_avg[usec]
16384 1000 10.59 10.60 10.60
32768 1000 15.27 15.28 15.28
65536 640 28.19 28.21 28.20
131072 320 41.13 41.17 41.15
262144 160 79.64 79.74 79.69
524288 80 364.01 364.47 364.24
1048576 40 890.75 891.82 891.29
2097152 20 1778.65 1779.85 1779.25
4194304 10 3475.28 3478.98 3477.13
[0]: CR completed...
[3]: CR completed...
<snip>
./mv2_checkpoint
PID USER TT COMMAND %CPU VSZ START CMD
30754 ouyangx pts/3 mpirun_rsh 0.0 49200 19:06
../bin/mpirun_rsh -np 4 ws7 ws7 ws7 ws7 MV2_CKPT_FILE=/tmp/lustre/ckpt
./IMB-EXT
Enter PID to checkpoint or Control-C to exit: 30754
Checkpointing PID 30754
Checkpoint file: context.30754
-Xiangyong Ouyang
On Sun, Jan 23, 2011 at 9:42 AM, Thomas Zeiser
<thomas.zeiser at rrze.uni-erlangen.de> wrote:
> Hello Xiangyong,
>
> On Fri, Jan 21, 2011 at 04:12:03PM -0500, xiangyong ouyang wrote:
>> Hello Thomas,
>>
>> As a followup to my previous reply, we have built MVAPIHC2-1.5.1p1
>> using icc. We are able to checkpoint/restart a NPB benchmark bt.B.4
>> on two nodes successfully using Lustre as the backend filesystem.
>
> Checkpointing bt.B.4 to a Lustre filesystem works perfectly fine
> for me, too - never managed to fail the checkpoint. I also tried
> bt.D.16 on 4 nodes. No problem either.
>
> However, our home-grown CFD solver still refuses to be checkpointed to Lustre.
>
> Thus, I looked at different communication patterns available in the
> Pallas/Intel MPI benchmarks (IMB) ...
>
> ====================
>
> mpirun_rsh -ssh -np 2 l1401 l1348 MV2_CKPT_FILE=/lxfs/unrz/uz/chk-imb-bcast ./IMB-MPI1 bcast bcast bcast bcast bcast bcast bcast
> => fails similar to my application; but o.k. to NFS
>
> 32768 1000 23.17 23.17 23.17
> 65536 640 34.78 34.78 34.78
> [0]: begin checkpoint...
> [Rank 0][cr.c: line 721]cr_checkpoint failed
> [CR_Callback] Checkpoint of a Process Failed
> MPI process (rank: 0) terminated unexpectedly on l1401
> cr_core.c:244 cr_checkpoint: Unexpected return from CR_OP_HAND_ABORT
> Abort
> [Rank 1][cr.c: line 721]cr_checkpoint failed
> MPI process (rank: 1) terminated unexpectedly on l1348
> ----------
> mv2_checkpoint
> PID USER TT COMMAND %CPU VSZ START CMD
> 20630 uz pts/0 mpirun_rsh 0.0 25268 14:48 mpirun_rsh -ssh -np 2 l1401 l1348 MV2_CKPT_FILE=/lxfs/unrz/uz/chk-imb-bcast ./IMB-MPI1 bcast bcast bcast bcast bcast bcast bcast bcast
> Enter PID to checkpoint or Control-C to exit: 20630
> Checkpointing PID 20630
> Checkpoint cancelled by application: try again later
> cr_checkpoint failed
>
> ====================
>
> mpirun_rsh -ssh -np 2 l1401 l1348 MV2_CKPT_FILE=/lxfs/unrz/uz/chk-imb-pingpong ./IMB-MPI1 pingpong pingpong pingpong pingpong pingpong
> => fails somewhat later (i.e. fsync, reactivate channels is shown); but o.k. to NFS
>
> 262144 160 102.47 2439.75
> 524288 80 191.31 2613.62
> [0]: begin checkpoint...
> [0]: fsync...
> [0]: Reactivate channels...
> [CR_Callback] Checkpoint of a Process Failed
> [Rank 1][cr.c: line 721]cr_checkpoint failed
> cr_core.c:244 cr_checkpoint: Unexpected return from CR_OP_HAND_ABORT
> Abort
> MPI process (rank: 1) terminated unexpectedly on l1348
> ----------
> mv2_checkpoint
> PID USER TT COMMAND %CPU VSZ START CMD
> 20675 uz pts/0 mpirun_rsh 0.0 25268 14:52 mpirun_rsh -ssh -np 2 l1401 l1348 MV2_CKPT_FILE=/lxfs/unrz/uz/chk-imb-pingpong ./IMB-MPI1 pingpong pingpong pingpong pingpong pingpong
> Enter PID to checkpoint or Control-C to exit: 20675
> Checkpointing PID 20675
> Checkpoint cancelled by application: try again later
> cr_checkpoint failed
>
> ====================
>
> mpirun_rsh -ssh -np 2 l1401 l1348 MV2_CKPT_FILE=/lxfs/unrz/uz/chk-imb-alltoall ./IMB-MPI1 alltoall alltoall alltoall alltoall alltoall
> => fails similar to my application; but o.k. to NFS
>
> 524288 80 259.43 259.46 259.44
> 1048576 40 505.32 505.37 505.35
> [0]: begin checkpoint...
> [Rank 1][cr.c: line 721]cr_checkpoint failed
> [CR_Callback] Checkpoint of a Process Failed
> cr_core.c:244 cr_checkpoint: Unexpected return from CR_OP_HAND_ABORT
> Abort
> MPI process (rank: 1) terminated unexpectedly on l1348
> [Rank 0][cr.c: line 721]cr_checkpoint failed
> MPI process (rank: 0) terminated unexpectedly on l1401
> ----------
> mv2_checkpoint
> PID USER TT COMMAND %CPU VSZ START CMD
> 20794 uz pts/0 mpirun_rsh 0.0 25268 15:02 mpirun_rsh -ssh -np 2 l1401 l1348 MV2_CKPT_FILE=/lxfs/unrz/uz/chk-imb-alltoall ./IMB-MPI1 alltoall alltoall alltoall alltoall alltoall
> Enter PID to checkpoint or Control-C to exit: 20794
> Checkpointing PID 20794
> Checkpoint cancelled by application: try again later
> cr_checkpoint failed
>
> ====================
>
> I did not have any IMB run which did not fail checkpointing to the Lustre file
> system :-(
>
>> If you still experience problems when doing CR with Lustre, then could
>> you tell me what's the application you are running when encountering
>> the CR problem? Are you making any MPI_IO calls in that program?
>
> My home-grown application can use MPI_IO, but at the time I try checkpointing
> no MPI-IO is active, i.e no file is opened with MPI_IO calls.
>
> The IMB tests do not use any MPI-IO at all; but the chance for being in an MPI
> call which cr_checkpoint is called is almost 100%. Maybe, that's the difference
> to bt.B.4?
>
>> And, is it possible that you provide us core-dump / backtrace files
>> about the failure? That will help us investigate your case.
>
> So far, I have not been successful in generating core-dumps.
>
> Also generating backtraces was not really successful: (IMB bcast follows)
>
> (gdb) bt # of master
> #0 0x00002b2afc205256 in pthread_rwlock_rdlock () from /lib64/libpthread.so.0
> #1 0x00002b2afbe92ce7 in MPIDI_CH3I_CR_lock () at cr.c:400
> #2 0x00002b2afbe6a637 in MPIDI_CH3I_Progress (is_blocking=-65343520, state=0x80) at ch3_progress.c:169
> #3 0x00002b2afbeb015a in MPIC_Wait (request_ptr=0x2b2afc1aefe0) at helper_fns.c:512
> #4 0x00002b2afbeafd23 in MPIC_Send (buf=0x2b2afc1aefe0, count=128, datatype=0, dest=-1, tag=-65127456, comm=1377023856) at helper_fns.c:40
> #5 0x00002b2afbe5de7c in MPIR_Bcast (buffer=0x2b2afc1aefe0, count=128, datatype=0, root=-1, comm_ptr=0x2b2afc1e3be0) at bcast_osu.c:336
> #6 0x00002b2afbe5dad9 in PMPI_Bcast (buffer=0x2b2afc1aefe0, count=128, datatype=0, root=-1, comm=-65127456) at bcast_osu.c:1174
> #7 0x000000000040ade4 in IMB_bcast (c_info=0x2b2afc1aefe0, size=128, ITERATIONS=0x0, RUN_MODE=0xffffffffffffffff, time=0x2b2afc1e3be0) at IMB_bcast.c:157
> #8 0x00000000004064a5 in IMB_init_buffers_iter (c_info=0x2b2afc1aefe0, ITERATIONS=0x80, Bmark=0x0, BMODE=0xffffffffffffffff, iter=-65127456, size=1377023856) at IMB_mem_manager.c:798
> #9 0x0000000000402edf in main (argc=17, argv=0x7fff5213bd68) at IMB.c:262
>
> on the slave node, gdb hangs while attaching to the process ...
>
> ====================
>
> Next I attached to the two processes before issueing ths mv2_checkpoint
>
> # master
> (gdb) c
> Continuing.
> [Thread 0x41b7b940 (LWP 22755) exited]
> [Thread 0x41c7c940 (LWP 22756) exited]
> [Thread 0x41a7a940 (LWP 22754) exited]
>
> Program received signal SIGBUS, Bus error.
> [Switching to Thread 0x414d5940 (LWP 22752)]
> 0x00002b676042b2b1 in cr_poll_checkpoint_msg () from /usr/lib64/libcr.so.0
> (gdb) bt
> #0 0x00002b676042b2b1 in cr_poll_checkpoint_msg () from /usr/lib64/libcr.so.0
> #1 0x00002b675fea3b3d in CR_Thread_loop () at cr.c:559
> #2 0x00002b675fea37e8 in CR_Thread_entry (arg=0x0) at cr.c:813
> #3 0x00002b676021173d in start_thread () from /lib64/libpthread.so.0
> #4 0x00002b67613b6f6d in clone () from /lib64/libc.so.6
> ----------
> # slave
> (gdb) c
> Continuing.
> [Thread 0x40fb0940 (LWP 10321) exited]
> [Thread 0x410b1940 (LWP 10322) exited]
> [Thread 0x40eaf940 (LWP 10320) exited]
>
> [Thread 0x409c8940 (LWP 10318) exited]
> [Thread 0x40bc9940 (LWP 10319) exited]
>
> Program exited with code 0377.
> (gdb)
> The program is not being run.
>
> On the slave, gdb only becomes reactive again once the program exited.
>
> Probably not helpful, neither.
>
>> -Xiangyong Ouyang
>
> Thanks for your help,
>
> thomas
> --
> Dr.-Ing. Thomas Zeiser, HPC Services
> Friedrich-Alexander-Universitaet Erlangen-Nuernberg
> Regionales Rechenzentrum Erlangen (RRZE)
> Martensstrasse 1, 91058 Erlangen, Germany
> Tel.: +49 9131 85-28737, Fax: +49 9131 302941
> thomas.zeiser at rrze.uni-erlangen.de
> http://www.rrze.uni-erlangen.de/hpc/
>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: cr_ckpt_retcode.patch
Type: application/octet-stream
Size: 723 bytes
Desc: not available
Url : http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20110128/bdca1271/cr_ckpt_retcode.obj
More information about the mvapich-discuss
mailing list