[mvapich-discuss] Problems BLCR-checkpointing MVAPICH2-1.5.1
application to Lustre parallel filesystem
Thomas Zeiser
thomas.zeiser at rrze.uni-erlangen.de
Fri Jan 28 08:38:00 EST 2011
Hello Xiangyong,
On Fri, Jan 28, 2011 at 12:12:12AM -0500, xiangyong ouyang wrote:
> Hello Thomas,
>
> We re-ran the test with IMB, and we were able to checkpoint to Lustre
> successfully.
>
> First of all, please make sure you have the right permission in the
> Lustre filesystem, enough free space available in Lustre, not run out
> of quote, etc. Some users experienced problems with checkpoint which
> were caused by these filesystem issues.
Permissions cannot be the problem as the first few MBs of the
checkpoint always get written.
There are no quotas on the file system and creating a 100 GB file
with "dd" works fine:
uz at l1446: 13:47 [~] $ dd if=/dev/zero of=/lxfs/unrz/uz/chk-imb-45283.ladm1.dd bs=1M count=100000
100000+0 records in
100000+0 records out
104857600000 bytes (105 GB) copied, 306.241 seconds, 342 MB/s
uz at l1446: 13:52 [~] $ ls -lh /lxfs/unrz/uz/chk-imb-45283.ladm1.dd
-rw------- 1 uz unrz 98G Jan 28 13:52 /lxfs/unrz/uz/chk-imb-45283.ladm1.dd
And of course also normal MPI-IO works fine.
> It would be helpful to collect more information about the failure you
> encountered. I have attached a small patch that will print some error
> messages when checkpoint fails. Can you apply this patch to your
> mvapich2? I'm assuming you are using MVAPICH2-1.5.1p1. Please re-run
> the checkpoint test and tell us the error print outs. Thanks!
It's error code -14: (full logs are in the attachment)
cr_checkpoint failed with return code=-14
> We have made some improvements since MVAPICH2-1.5.1p1. If possible
> can you try our latest MVAPICH2-1.6RC2 which is available at:
> http://mvapich.cse.ohio-state.edu/download/mvapich2/mvapich2-1.6rc2.tgz
Basically the same storry; the main differences are
- it seems to take slightly longer (i.e. the process stalls longer) until
the abort message comes (but that may be subjective)
- the way how STDOUT/STDERROR messages are generated must have
changed as the "[0]: begin checkpoint..." message does not appear
With your patch from 1.5.1p1 applied to 1.6rc2 I again get error
-14.
[Rank 0][cr.c: line 955]cr_checkpoint failed with return code=-14
[CR_Callback] Checkpoint of a Process Failed
MPI process (rank: 0) terminated unexpectedly on l1440
cr_core.c:244 cr_checkpoint: Unexpected return from CR_OP_HAND_ABORT
Abort
handle_mt_peer: fail to read...: Success
[Rank 1][cr.c: line 955]cr_checkpoint failed with return code=-14
MPI process (rank: 1) terminated unexpectedly on l1439
To tell the complete story, I have to admit that in today's tests I
had very few IMB runs which were successfully checkpointed to the
Lustre filesystem. However, I could not detect any systematics. The
good cases are also included in the attachment.
To me, it looks like some timeing race condition of some sort ...
Thanks for your help,
thomas
--
Dr.-Ing. Thomas Zeiser, HPC Services
Friedrich-Alexander-Universitaet Erlangen-Nuernberg
Regionales Rechenzentrum Erlangen (RRZE)
Martensstrasse 1, 91058 Erlangen, Germany
Tel.: +49 9131 85-28737, Fax: +49 9131 302941
thomas.zeiser at rrze.uni-erlangen.de
http://www.rrze.uni-erlangen.de/hpc/
-------------- next part --------------
CC=icc F77=ifort CXX=icpc. /configure --prefix=/home/cluster64/unrz/uz/xx-inst-mvapich2-1.5.1p1-patched --enable-blcr --with-file-system=lustre
######################################################################
env PATH=/home/cluster64/unrz/uz/xx-inst-mvapich2-1.5.1p1-patched/bin:$PATH mv2_checkpoint
PID USER TT COMMAND %CPU VSZ START CMD
1659 uz pts/0 mpirun_rsh 0.0 25268 13:39 mpirun_rsh -ssh -np 2 l1446 l1445 MV2_CKPT_FILE=/lxfs/unrz/uz/chk-imb-45283.ladm1 ./IMB-MPI1
Enter PID to checkpoint or Control-C to exit: 1659
Checkpointing PID 1659
Checkpoint cancelled by application: try again later
cr_checkpoint failed
====================
env PATH=/home/cluster64/unrz/uz/xx-inst-mvapich2-1.5.1p1-patched/bin:$PATH mpirun_rsh -ssh -np 2 `uniq $PBS_NODEFILE` MV2_CKPT_FILE=/lxfs/unrz/uz/chk-imb-$PBS_JOBID ./IMB-MPI1
*** output of several tests follows ***
#----------------------------------------------------------------
# Benchmarking Reduce_scatter
# #processes = 2
#----------------------------------------------------------------
#bytes #repetitions t_min[usec] t_max[usec] t_avg[usec]
0 1000 0.08 0.08 0.08
4 1000 0.97 0.97 0.97
8 1000 2.72 2.73 2.73
16 1000 2.69 2.69 2.69
32 1000 2.76 2.76 2.76
64 1000 2.78 2.79 2.79
128 1000 3.01 3.01 3.01
256 1000 4.02 4.03 4.02
512 1000 4.32 4.32 4.32
1024 1000 4.78 4.78 4.78
2048 1000 5.52 5.52 5.52
4096 1000 6.84 6.85 6.84
8192 1000 8.63 8.64 8.63
16384 1000 12.36 12.37 12.36
32768 1000 26.71 26.71 26.71
65536 640 40.20 40.21 40.20
131072 320 72.94 72.94 72.94
262144 160 135.31 135.31 135.31
[0]: begin checkpoint...
[0]: fsync...
[0]: Reactivate channels...
[0]: CR completed...
[1]: CR completed...
524288 80 272.74 272.77 272.76
1048576 40 445.50 445.80 445.65
2097152 20 777.59 778.04 777.82
4194304 10 4265.79 4266.21 4266.00
#----------------------------------------------------------------
# Benchmarking Reduce
# #processes = 2
#----------------------------------------------------------------
#bytes #repetitions t_min[usec] t_max[usec] t_avg[usec]
0 1000 0.29 0.29 0.29
4 1000 2.75 2.75 2.75
8 1000 2.71 2.71 2.71
16 1000 2.81 2.81 2.81
32 1000 2.82 2.82 2.82
64 1000 2.94 2.94 2.94
128 1000 4.04 4.04 4.04
256 1000 4.32 4.32 4.32
512 1000 4.80 4.80 4.80
1024 1000 5.13 5.14 5.14
2048 1000 6.32 6.33 6.32
4096 1000 8.28 8.29 8.29
8192 1000 10.70 10.70 10.70
16384 1000 14.79 14.80 14.80
32768 1000 39.59 39.59 39.59
65536 640 59.75 59.76 59.75
131072 320 102.02 102.02 102.02
262144 160 183.92 183.93 183.92
524288 80 338.54 338.55 338.54
1048576 40 896.40 900.67 898.54
[0]: begin checkpoint...
[Rank 0][cr.c: line 722]cr_checkpoint failed with return code=-14
[CR_Callback] Checkpoint of a Process Failed
MPI process (rank: 0) terminated unexpectedly on l1446
cr_core.c:244 cr_checkpoint: Unexpected return from CR_OP_HAND_ABORT
Abort
[Rank 1][cr.c: line 722]cr_checkpoint failed with return code=-14
MPI process (rank: 1) terminated unexpectedly on l1445
uz at l1446: 13:41 [~] $ ll /lxfs/unrz/uz/chk-imb-45283.ladm1*
-rw------- 1 uz unrz 8777728 Jan 28 13:39 /lxfs/unrz/uz/chk-imb-45283.ladm1.1.0
-rw------- 1 uz unrz 9764864 Jan 28 13:39 /lxfs/unrz/uz/chk-imb-45283.ladm1.1.1
#----------------------------------------------------------------
# Benchmarking Reduce
# #processes = 2
#----------------------------------------------------------------
#bytes #repetitions t_min[usec] t_max[usec] t_avg[usec]
0 1000 0.29 0.31 0.30
4 1000 2.81 2.81 2.81
8 1000 2.76 2.76 2.76
16 1000 2.82 2.82 2.82
32 1000 2.87 2.87 2.87
64 1000 2.97 2.97 2.97
128 1000 4.06 4.06 4.06
256 1000 4.33 4.33 4.33
512 1000 4.82 4.82 4.82
[0]: begin checkpoint...
[Rank 0][cr.c: line 722]cr_checkpoint failed with return code=-14
[CR_Callback] Checkpoint of a Process Failed
MPI process (rank: 0) terminated unexpectedly on l1446
cr_core.c:244 cr_checkpoint: Unexpected return from CR_OP_HAND_ABORT
Abort
[Rank 1][cr.c: line 722]cr_checkpoint failed with return code=-14
MPI process (rank: 1) terminated unexpectedly on l1445
uz at l1446: 13:46 [~] $ ll /lxfs/unrz/uz/chk-imb-45283.ladm1*
-rw------- 1 uz unrz 9543680 Jan 28 13:43 /lxfs/unrz/uz/chk-imb-45283.ladm1.1.0
-rw------- 1 uz unrz 9768960 Jan 28 13:43 /lxfs/unrz/uz/chk-imb-45283.ladm1.1.1
######################################################################
###
### MVAPICH2-1.6RC2
###
env CC=icc F77=ifort CXX=icpc ./configure --prefix=/home/cluster64/unrz/uz/xx-inst-mvapich2-1.6rc2 --enable-blcr --with-file-system=lustre
====================
env PATH=/home/cluster64/unrz/uz/xx-inst-mvapich2-1.6rc2/bin:$PATH mpirun_rsh -ssh -np 2 `uniq $PBS_NODEFILE` MV2_CKPT_FILE=/lxfs/unrz/uz/chk-imb-$PBS_JOBID ./IMB-MPI1
*** output of several tests follows ***
#----------------------------------------------------------------
# Benchmarking Allgather
# #processes = 2
#----------------------------------------------------------------
#bytes #repetitions t_min[usec] t_max[usec] t_avg[usec]
0 1000 0.03 0.03 0.03
1 1000 2.54 2.55 2.55
2 1000 2.54 2.54 2.54
4 1000 2.57 2.57 2.57
8 1000 2.58 2.58 2.58
16 1000 2.53 2.54 2.54
32 1000 2.59 2.59 2.59
64 1000 2.65 2.65 2.65
128 1000 3.69 3.69 3.69
256 1000 3.86 3.87 3.87
512 1000 4.12 4.13 4.12
1024 1000 4.66 4.67 4.66
2048 1000 5.76 5.76 5.76
[0]: CR completed...
[1]: CR completed...
4096 1000 8.08 8.08 8.08
8192 1000 10.19 10.19 10.19
16384 1000 17.97 17.98 17.98
32768 1000 25.63 25.63 25.63
65536 640 40.58 40.58 40.58
131072 320 73.13 73.13 73.13
262144 160 138.05 138.05 138.05
524288 80 266.82 266.86 266.84
1048576 40 526.00 526.10 526.05
2097152 20 1036.64 1036.85 1036.75
4194304 10 2087.81 2087.88 2087.84
env PATH=/home/cluster64/unrz/uz/xx-inst-mvapich2-1.6rc2/bin:$PATH mpirun_rsh -ssh -np 2 `uniq $PBS_NODEFILE` MV2_CKPT_FILE=/lxfs/unrz/uz/chk-imb-$PBS_JOBID ./IMB-MPI1
#----------------------------------------------------------------
# Benchmarking Reduce_scatter
# #processes = 2
#----------------------------------------------------------------
#bytes #repetitions t_min[usec] t_max[usec] t_avg[usec]
0 1000 0.08 0.08 0.08
4 1000 0.96 0.97 0.97
8 1000 2.62 2.62 2.62
16 1000 2.62 2.63 2.62
32 1000 2.63 2.64 2.63
64 1000 2.71 2.72 2.72
128 1000 3.02 3.02 3.02
[CR_Callback] Checkpoint of a Process Failed
[Rank 1][cr.c: line 954]cr_checkpoint failed
cr_core.c:244 cr_checkpoint: Unexpected return from CR_OP_HAND_ABORT
Abort
MPI process (rank: 1) terminated unexpectedly on l1445
handle_mt_peer: fail to read...: Success
[Rank 0][cr.c: line 954]cr_checkpoint failed
MPI process (rank: 0) terminated unexpectedly on l1446
#----------------------------------------------------------------
# Benchmarking Allreduce
# #processes = 2
#----------------------------------------------------------------
#bytes #repetitions t_min[usec] t_max[usec] t_avg[usec]
0 1000 0.03 0.04 0.04
4 1000 2.87 2.88 2.88
8 1000 2.77 2.78 2.77
16 1000 2.73 2.73 2.73
32 1000 2.74 2.74 2.74
64 1000 2.83 2.83 2.83
128 1000 3.97 3.97 3.97
256 1000 4.18 4.18 4.18
512 1000 4.54 4.54 4.54
1024 1000 5.20 5.20 5.20
2048 1000 6.47 6.48 6.48
4096 1000 11.89 11.90 11.89
8192 1000 14.58 14.58 14.58
16384 1000 20.03 20.03 20.03
32768 1000 41.20 41.20 41.20
65536 640 66.66 66.66 66.66
131072 320 114.56 114.56 114.56
262144 160 188.93 188.94 188.93
524288 80 353.55 353.60 353.58
1048576 40 685.40 685.50 685.45
[CR_Callback] Checkpoint of a Process Failed
[Rank 1][cr.c: line 954]cr_checkpoint failed
cr_core.c:244 cr_checkpoint: Unexpected return from CR_OP_HAND_ABORT
Abort
MPI process (rank: 1) terminated unexpectedly on l1445
handle_mt_peer: fail to read...: Success
[Rank 0][cr.c: line 954]cr_checkpoint failed
MPI process (rank: 0) terminated unexpectedly on l1446
*** now the additional debug patch from 1.5.1p1 applied to 1.6rc2 ***
#----------------------------------------------------------------
# Benchmarking Allgather
# #processes = 2
#----------------------------------------------------------------
#bytes #repetitions t_min[usec] t_max[usec] t_avg[usec]
0 1000 0.04 0.04 0.04
1 1000 2.61 2.61 2.61
2 1000 2.55 2.56 2.56
4 1000 2.59 2.59 2.59
8 1000 2.60 2.60 2.60
16 1000 2.57 2.57 2.57
32 1000 2.63 2.64 2.64
64 1000 2.67 2.67 2.67
128 1000 3.68 3.69 3.69
256 1000 3.86 3.86 3.86
512 1000 4.14 4.14 4.14
1024 1000 4.67 4.67 4.67
2048 1000 5.79 5.79 5.79
4096 1000 6.88 6.89 6.89
8192 1000 9.12 9.13 9.13
16384 1000 17.88 17.88 17.88
[0]: CR completed...
[1]: CR completed...
32768 1000 3010.72 3010.72 3010.72
65536 640 41.43 41.43 41.43
131072 320 73.52 73.52 73.52
262144 160 138.42 138.43 138.42
524288 80 268.99 269.00 268.99
1048576 40 527.78 527.82 527.80
2097152 20 1043.65 1043.75 1043.70
4194304 10 2146.89 2147.01 2146.95
#----------------------------------------------------------------
# Benchmarking Gatherv
# #processes = 2
#----------------------------------------------------------------
#bytes #repetitions t_min[usec] t_max[usec] t_avg[usec]
0 1000 0.07 0.09 0.08
1 1000 2.66 2.66 2.66
2 1000 2.52 2.52 2.52
4 1000 2.48 2.48 2.48
8 1000 2.47 2.47 2.47
16 1000 2.46 2.46 2.46
32 1000 2.51 2.51 2.51
64 1000 2.61 2.61 2.61
128 1000 3.57 3.57 3.57
256 1000 3.78 3.78 3.78
512 1000 4.02 4.02 4.02
1024 1000 4.51 4.52 4.51
2048 1000 5.50 5.50 5.50
4096 1000 6.56 6.57 6.56
8192 1000 8.52 8.52 8.52
16384 1000 15.24 15.24 15.24
32768 1000 20.99 20.99 20.99
65536 640 33.83 33.84 33.84
[Rank 0][cr.c: line 955]cr_checkpoint failed with return code=-14
[CR_Callback] Checkpoint of a Process Failed
MPI process (rank: 0) terminated unexpectedly on l1440
cr_core.c:244 cr_checkpoint: Unexpected return from CR_OP_HAND_ABORT
Abort
handle_mt_peer: fail to read...: Success
[Rank 1][cr.c: line 955]cr_checkpoint failed with return code=-14
MPI process (rank: 1) terminated unexpectedly on l1439
More information about the mvapich-discuss
mailing list