[mvapich-discuss] Problems BLCR-checkpointing MVAPICH2-1.5.1 application to Lustre parallel filesystem

Sun Jan 23 09:42:32 EST 2011

Hello Xiangyong,

On Fri, Jan 21, 2011 at 04:12:03PM -0500, xiangyong ouyang wrote:
> Hello Thomas,
> 
> As a followup to my previous reply,  we have built MVAPIHC2-1.5.1p1
> using icc.   We are able to checkpoint/restart a NPB benchmark bt.B.4
> on two nodes successfully using Lustre as the backend filesystem.

Checkpointing bt.B.4 to a Lustre filesystem works perfectly fine
for me, too - never managed to fail the checkpoint. I also tried
bt.D.16 on 4 nodes. No problem either.

However, our home-grown CFD solver still refuses to be checkpointed to Lustre.

Thus, I looked at different communication patterns available in the
Pallas/Intel MPI benchmarks (IMB) ...

====================

mpirun_rsh -ssh -np 2 l1401 l1348 MV2_CKPT_FILE=/lxfs/unrz/uz/chk-imb-bcast ./IMB-MPI1 bcast bcast bcast bcast bcast bcast bcast
=> fails similar to my application; but o.k. to NFS

        32768         1000        23.17        23.17        23.17
        65536          640        34.78        34.78        34.78
[0]: begin checkpoint...
[Rank 0][cr.c: line 721]cr_checkpoint failed
[CR_Callback] Checkpoint of a Process Failed
MPI process (rank: 0) terminated unexpectedly on l1401
cr_core.c:244 cr_checkpoint: Unexpected return from CR_OP_HAND_ABORT
Abort
[Rank 1][cr.c: line 721]cr_checkpoint failed
MPI process (rank: 1) terminated unexpectedly on l1348
----------
mv2_checkpoint
  PID USER     TT       COMMAND     %CPU    VSZ  START CMD
20630 uz  pts/0    mpirun_rsh   0.0  25268  14:48 mpirun_rsh -ssh -np 2 l1401 l1348 MV2_CKPT_FILE=/lxfs/unrz/uz/chk-imb-bcast ./IMB-MPI1 bcast bcast bcast bcast bcast bcast bcast bcast
Enter PID to checkpoint or Control-C to exit: 20630
Checkpointing PID 20630
Checkpoint cancelled by application: try again later
cr_checkpoint failed

====================

mpirun_rsh -ssh -np 2 l1401 l1348 MV2_CKPT_FILE=/lxfs/unrz/uz/chk-imb-pingpong ./IMB-MPI1 pingpong pingpong pingpong pingpong pingpong
=> fails somewhat later (i.e. fsync, reactivate channels is shown); but o.k. to NFS

       262144          160       102.47      2439.75
       524288           80       191.31      2613.62
[0]: begin checkpoint...
[0]: fsync...
[0]: Reactivate channels...
[CR_Callback] Checkpoint of a Process Failed
[Rank 1][cr.c: line 721]cr_checkpoint failed
cr_core.c:244 cr_checkpoint: Unexpected return from CR_OP_HAND_ABORT
Abort
MPI process (rank: 1) terminated unexpectedly on l1348
----------
mv2_checkpoint
  PID USER     TT       COMMAND     %CPU    VSZ  START CMD
20675 uz  pts/0    mpirun_rsh   0.0  25268  14:52 mpirun_rsh -ssh -np 2 l1401 l1348 MV2_CKPT_FILE=/lxfs/unrz/uz/chk-imb-pingpong ./IMB-MPI1 pingpong pingpong pingpong pingpong pingpong
Enter PID to checkpoint or Control-C to exit: 20675
Checkpointing PID 20675
Checkpoint cancelled by application: try again later
cr_checkpoint failed

====================

mpirun_rsh -ssh -np 2 l1401 l1348 MV2_CKPT_FILE=/lxfs/unrz/uz/chk-imb-alltoall ./IMB-MPI1 alltoall alltoall alltoall alltoall alltoall
=> fails similar to my application; but o.k. to NFS

       524288           80       259.43       259.46       259.44
      1048576           40       505.32       505.37       505.35
[0]: begin checkpoint...
[Rank 1][cr.c: line 721]cr_checkpoint failed
[CR_Callback] Checkpoint of a Process Failed
cr_core.c:244 cr_checkpoint: Unexpected return from CR_OP_HAND_ABORT
Abort
MPI process (rank: 1) terminated unexpectedly on l1348
[Rank 0][cr.c: line 721]cr_checkpoint failed
MPI process (rank: 0) terminated unexpectedly on l1401
----------
mv2_checkpoint
  PID USER     TT       COMMAND     %CPU    VSZ  START CMD
20794 uz  pts/0    mpirun_rsh   0.0  25268  15:02 mpirun_rsh -ssh -np 2 l1401 l1348 MV2_CKPT_FILE=/lxfs/unrz/uz/chk-imb-alltoall ./IMB-MPI1 alltoall alltoall alltoall alltoall alltoall
Enter PID to checkpoint or Control-C to exit: 20794
Checkpointing PID 20794
Checkpoint cancelled by application: try again later
cr_checkpoint failed

====================

I did not have any IMB run which did not fail checkpointing to the Lustre file
system :-(

> If you still experience problems when doing CR with Lustre, then could
> you tell me what's the application you are running when encountering
> the CR problem?  Are you making any MPI_IO calls in that program?

My home-grown application can use MPI_IO, but at the time I try checkpointing
no MPI-IO is active, i.e no file is opened with MPI_IO calls.

The IMB tests do not use any MPI-IO at all; but the chance for being in an MPI
call which cr_checkpoint is called is almost 100%. Maybe, that's the difference
to bt.B.4?

> And,  is it possible that you provide us core-dump / backtrace files
> about the failure?   That will help us investigate your case.

So far, I have not been successful in generating core-dumps.

Also generating backtraces was not really successful: (IMB bcast follows)

(gdb) bt # of master
#0  0x00002b2afc205256 in pthread_rwlock_rdlock () from /lib64/libpthread.so.0
#1  0x00002b2afbe92ce7 in MPIDI_CH3I_CR_lock () at cr.c:400
#2  0x00002b2afbe6a637 in MPIDI_CH3I_Progress (is_blocking=-65343520, state=0x80) at ch3_progress.c:169
#3  0x00002b2afbeb015a in MPIC_Wait (request_ptr=0x2b2afc1aefe0) at helper_fns.c:512
#4  0x00002b2afbeafd23 in MPIC_Send (buf=0x2b2afc1aefe0, count=128, datatype=0, dest=-1, tag=-65127456, comm=1377023856) at helper_fns.c:40
#5  0x00002b2afbe5de7c in MPIR_Bcast (buffer=0x2b2afc1aefe0, count=128, datatype=0, root=-1, comm_ptr=0x2b2afc1e3be0) at bcast_osu.c:336
#6  0x00002b2afbe5dad9 in PMPI_Bcast (buffer=0x2b2afc1aefe0, count=128, datatype=0, root=-1, comm=-65127456) at bcast_osu.c:1174
#7  0x000000000040ade4 in IMB_bcast (c_info=0x2b2afc1aefe0, size=128, ITERATIONS=0x0, RUN_MODE=0xffffffffffffffff, time=0x2b2afc1e3be0) at IMB_bcast.c:157
#8  0x00000000004064a5 in IMB_init_buffers_iter (c_info=0x2b2afc1aefe0, ITERATIONS=0x80, Bmark=0x0, BMODE=0xffffffffffffffff, iter=-65127456, size=1377023856) at IMB_mem_manager.c:798
#9  0x0000000000402edf in main (argc=17, argv=0x7fff5213bd68) at IMB.c:262

on the slave node, gdb hangs while attaching to the process ...

====================

Next I attached to the two processes before issueing ths mv2_checkpoint

# master
(gdb) c
Continuing.
[Thread 0x41b7b940 (LWP 22755) exited]
[Thread 0x41c7c940 (LWP 22756) exited]
[Thread 0x41a7a940 (LWP 22754) exited]

Program received signal SIGBUS, Bus error.
[Switching to Thread 0x414d5940 (LWP 22752)]
0x00002b676042b2b1 in cr_poll_checkpoint_msg () from /usr/lib64/libcr.so.0
(gdb) bt
#0  0x00002b676042b2b1 in cr_poll_checkpoint_msg () from /usr/lib64/libcr.so.0
#1  0x00002b675fea3b3d in CR_Thread_loop () at cr.c:559
#2  0x00002b675fea37e8 in CR_Thread_entry (arg=0x0) at cr.c:813
#3  0x00002b676021173d in start_thread () from /lib64/libpthread.so.0
#4  0x00002b67613b6f6d in clone () from /lib64/libc.so.6
----------
# slave
(gdb) c
Continuing.
[Thread 0x40fb0940 (LWP 10321) exited]
[Thread 0x410b1940 (LWP 10322) exited]
[Thread 0x40eaf940 (LWP 10320) exited]

[Thread 0x409c8940 (LWP 10318) exited]
[Thread 0x40bc9940 (LWP 10319) exited]

Program exited with code 0377.
(gdb)
The program is not being run.

On the slave, gdb only becomes reactive again once the program exited.

Probably not helpful, neither.

> -Xiangyong Ouyang

Thanks for your help,

thomas
-- 
Dr.-Ing. Thomas Zeiser, HPC Services
Friedrich-Alexander-Universitaet Erlangen-Nuernberg
Regionales Rechenzentrum Erlangen (RRZE)
Martensstrasse 1, 91058 Erlangen, Germany
Tel.: +49 9131 85-28737, Fax: +49 9131 302941
thomas.zeiser at rrze.uni-erlangen.de
http://www.rrze.uni-erlangen.de/hpc/