[mvapich-discuss] Problems BLCR-checkpointing MVAPICH2-1.5.1
application to Lustre parallel filesystem
Thomas Zeiser
thomas.zeiser at rrze.uni-erlangen.de
Sun Jan 23 09:42:32 EST 2011
Hello Xiangyong,
On Fri, Jan 21, 2011 at 04:12:03PM -0500, xiangyong ouyang wrote:
> Hello Thomas,
>
> As a followup to my previous reply, we have built MVAPIHC2-1.5.1p1
> using icc. We are able to checkpoint/restart a NPB benchmark bt.B.4
> on two nodes successfully using Lustre as the backend filesystem.
Checkpointing bt.B.4 to a Lustre filesystem works perfectly fine
for me, too - never managed to fail the checkpoint. I also tried
bt.D.16 on 4 nodes. No problem either.
However, our home-grown CFD solver still refuses to be checkpointed to Lustre.
Thus, I looked at different communication patterns available in the
Pallas/Intel MPI benchmarks (IMB) ...
====================
mpirun_rsh -ssh -np 2 l1401 l1348 MV2_CKPT_FILE=/lxfs/unrz/uz/chk-imb-bcast ./IMB-MPI1 bcast bcast bcast bcast bcast bcast bcast
=> fails similar to my application; but o.k. to NFS
32768 1000 23.17 23.17 23.17
65536 640 34.78 34.78 34.78
[0]: begin checkpoint...
[Rank 0][cr.c: line 721]cr_checkpoint failed
[CR_Callback] Checkpoint of a Process Failed
MPI process (rank: 0) terminated unexpectedly on l1401
cr_core.c:244 cr_checkpoint: Unexpected return from CR_OP_HAND_ABORT
Abort
[Rank 1][cr.c: line 721]cr_checkpoint failed
MPI process (rank: 1) terminated unexpectedly on l1348
----------
mv2_checkpoint
PID USER TT COMMAND %CPU VSZ START CMD
20630 uz pts/0 mpirun_rsh 0.0 25268 14:48 mpirun_rsh -ssh -np 2 l1401 l1348 MV2_CKPT_FILE=/lxfs/unrz/uz/chk-imb-bcast ./IMB-MPI1 bcast bcast bcast bcast bcast bcast bcast bcast
Enter PID to checkpoint or Control-C to exit: 20630
Checkpointing PID 20630
Checkpoint cancelled by application: try again later
cr_checkpoint failed
====================
mpirun_rsh -ssh -np 2 l1401 l1348 MV2_CKPT_FILE=/lxfs/unrz/uz/chk-imb-pingpong ./IMB-MPI1 pingpong pingpong pingpong pingpong pingpong
=> fails somewhat later (i.e. fsync, reactivate channels is shown); but o.k. to NFS
262144 160 102.47 2439.75
524288 80 191.31 2613.62
[0]: begin checkpoint...
[0]: fsync...
[0]: Reactivate channels...
[CR_Callback] Checkpoint of a Process Failed
[Rank 1][cr.c: line 721]cr_checkpoint failed
cr_core.c:244 cr_checkpoint: Unexpected return from CR_OP_HAND_ABORT
Abort
MPI process (rank: 1) terminated unexpectedly on l1348
----------
mv2_checkpoint
PID USER TT COMMAND %CPU VSZ START CMD
20675 uz pts/0 mpirun_rsh 0.0 25268 14:52 mpirun_rsh -ssh -np 2 l1401 l1348 MV2_CKPT_FILE=/lxfs/unrz/uz/chk-imb-pingpong ./IMB-MPI1 pingpong pingpong pingpong pingpong pingpong
Enter PID to checkpoint or Control-C to exit: 20675
Checkpointing PID 20675
Checkpoint cancelled by application: try again later
cr_checkpoint failed
====================
mpirun_rsh -ssh -np 2 l1401 l1348 MV2_CKPT_FILE=/lxfs/unrz/uz/chk-imb-alltoall ./IMB-MPI1 alltoall alltoall alltoall alltoall alltoall
=> fails similar to my application; but o.k. to NFS
524288 80 259.43 259.46 259.44
1048576 40 505.32 505.37 505.35
[0]: begin checkpoint...
[Rank 1][cr.c: line 721]cr_checkpoint failed
[CR_Callback] Checkpoint of a Process Failed
cr_core.c:244 cr_checkpoint: Unexpected return from CR_OP_HAND_ABORT
Abort
MPI process (rank: 1) terminated unexpectedly on l1348
[Rank 0][cr.c: line 721]cr_checkpoint failed
MPI process (rank: 0) terminated unexpectedly on l1401
----------
mv2_checkpoint
PID USER TT COMMAND %CPU VSZ START CMD
20794 uz pts/0 mpirun_rsh 0.0 25268 15:02 mpirun_rsh -ssh -np 2 l1401 l1348 MV2_CKPT_FILE=/lxfs/unrz/uz/chk-imb-alltoall ./IMB-MPI1 alltoall alltoall alltoall alltoall alltoall
Enter PID to checkpoint or Control-C to exit: 20794
Checkpointing PID 20794
Checkpoint cancelled by application: try again later
cr_checkpoint failed
====================
I did not have any IMB run which did not fail checkpointing to the Lustre file
system :-(
> If you still experience problems when doing CR with Lustre, then could
> you tell me what's the application you are running when encountering
> the CR problem? Are you making any MPI_IO calls in that program?
My home-grown application can use MPI_IO, but at the time I try checkpointing
no MPI-IO is active, i.e no file is opened with MPI_IO calls.
The IMB tests do not use any MPI-IO at all; but the chance for being in an MPI
call which cr_checkpoint is called is almost 100%. Maybe, that's the difference
to bt.B.4?
> And, is it possible that you provide us core-dump / backtrace files
> about the failure? That will help us investigate your case.
So far, I have not been successful in generating core-dumps.
Also generating backtraces was not really successful: (IMB bcast follows)
(gdb) bt # of master
#0 0x00002b2afc205256 in pthread_rwlock_rdlock () from /lib64/libpthread.so.0
#1 0x00002b2afbe92ce7 in MPIDI_CH3I_CR_lock () at cr.c:400
#2 0x00002b2afbe6a637 in MPIDI_CH3I_Progress (is_blocking=-65343520, state=0x80) at ch3_progress.c:169
#3 0x00002b2afbeb015a in MPIC_Wait (request_ptr=0x2b2afc1aefe0) at helper_fns.c:512
#4 0x00002b2afbeafd23 in MPIC_Send (buf=0x2b2afc1aefe0, count=128, datatype=0, dest=-1, tag=-65127456, comm=1377023856) at helper_fns.c:40
#5 0x00002b2afbe5de7c in MPIR_Bcast (buffer=0x2b2afc1aefe0, count=128, datatype=0, root=-1, comm_ptr=0x2b2afc1e3be0) at bcast_osu.c:336
#6 0x00002b2afbe5dad9 in PMPI_Bcast (buffer=0x2b2afc1aefe0, count=128, datatype=0, root=-1, comm=-65127456) at bcast_osu.c:1174
#7 0x000000000040ade4 in IMB_bcast (c_info=0x2b2afc1aefe0, size=128, ITERATIONS=0x0, RUN_MODE=0xffffffffffffffff, time=0x2b2afc1e3be0) at IMB_bcast.c:157
#8 0x00000000004064a5 in IMB_init_buffers_iter (c_info=0x2b2afc1aefe0, ITERATIONS=0x80, Bmark=0x0, BMODE=0xffffffffffffffff, iter=-65127456, size=1377023856) at IMB_mem_manager.c:798
#9 0x0000000000402edf in main (argc=17, argv=0x7fff5213bd68) at IMB.c:262
on the slave node, gdb hangs while attaching to the process ...
====================
Next I attached to the two processes before issueing ths mv2_checkpoint
# master
(gdb) c
Continuing.
[Thread 0x41b7b940 (LWP 22755) exited]
[Thread 0x41c7c940 (LWP 22756) exited]
[Thread 0x41a7a940 (LWP 22754) exited]
Program received signal SIGBUS, Bus error.
[Switching to Thread 0x414d5940 (LWP 22752)]
0x00002b676042b2b1 in cr_poll_checkpoint_msg () from /usr/lib64/libcr.so.0
(gdb) bt
#0 0x00002b676042b2b1 in cr_poll_checkpoint_msg () from /usr/lib64/libcr.so.0
#1 0x00002b675fea3b3d in CR_Thread_loop () at cr.c:559
#2 0x00002b675fea37e8 in CR_Thread_entry (arg=0x0) at cr.c:813
#3 0x00002b676021173d in start_thread () from /lib64/libpthread.so.0
#4 0x00002b67613b6f6d in clone () from /lib64/libc.so.6
----------
# slave
(gdb) c
Continuing.
[Thread 0x40fb0940 (LWP 10321) exited]
[Thread 0x410b1940 (LWP 10322) exited]
[Thread 0x40eaf940 (LWP 10320) exited]
[Thread 0x409c8940 (LWP 10318) exited]
[Thread 0x40bc9940 (LWP 10319) exited]
Program exited with code 0377.
(gdb)
The program is not being run.
On the slave, gdb only becomes reactive again once the program exited.
Probably not helpful, neither.
> -Xiangyong Ouyang
Thanks for your help,
thomas
--
Dr.-Ing. Thomas Zeiser, HPC Services
Friedrich-Alexander-Universitaet Erlangen-Nuernberg
Regionales Rechenzentrum Erlangen (RRZE)
Martensstrasse 1, 91058 Erlangen, Germany
Tel.: +49 9131 85-28737, Fax: +49 9131 302941
thomas.zeiser at rrze.uni-erlangen.de
http://www.rrze.uni-erlangen.de/hpc/
More information about the mvapich-discuss
mailing list