[mvapich-discuss] Problems BLCR-checkpointing MVAPICH2-1.5.1 application to Lustre parallel filesystem

Thomas Zeiser thomas.zeiser at rrze.uni-erlangen.de
Fri Jan 28 08:38:00 EST 2011


Hello Xiangyong,

On Fri, Jan 28, 2011 at 12:12:12AM -0500, xiangyong ouyang wrote:
> Hello Thomas,
> 
> We re-ran the test with IMB, and we were able to checkpoint to Lustre
> successfully.
> 
> First of all, please make sure you have the right permission in the
> Lustre filesystem, enough free space available in Lustre,  not run out
> of quote, etc.  Some users experienced problems with checkpoint which
> were caused by these filesystem issues.

Permissions cannot be the problem as the first few MBs of the
checkpoint always get written.

There are no quotas on the file system and creating a 100 GB file
with "dd" works fine:

uz at l1446: 13:47 [~] $ dd if=/dev/zero of=/lxfs/unrz/uz/chk-imb-45283.ladm1.dd bs=1M count=100000
100000+0 records in
100000+0 records out
104857600000 bytes (105 GB) copied, 306.241 seconds, 342 MB/s

uz at l1446: 13:52 [~] $ ls -lh /lxfs/unrz/uz/chk-imb-45283.ladm1.dd
-rw------- 1 uz unrz 98G Jan 28 13:52 /lxfs/unrz/uz/chk-imb-45283.ladm1.dd

And of course also normal MPI-IO works fine.

> It would be helpful to collect more information about the failure you
> encountered.  I have attached a small patch that will print some error
> messages when checkpoint fails.  Can you apply this patch to your
> mvapich2?  I'm assuming you are using MVAPICH2-1.5.1p1.  Please re-run
> the checkpoint test and tell us the error print outs. Thanks!

It's error code -14: (full logs are in the attachment)

cr_checkpoint failed with return code=-14

> We have made some improvements since MVAPICH2-1.5.1p1.   If possible
> can you try our latest MVAPICH2-1.6RC2 which is available at:
> http://mvapich.cse.ohio-state.edu/download/mvapich2/mvapich2-1.6rc2.tgz

Basically the same storry; the main differences are
- it seems to take slightly longer (i.e. the process stalls longer) until
  the abort message comes (but that may be subjective)
- the way how STDOUT/STDERROR messages are generated must have
  changed as the "[0]: begin checkpoint..." message does not appear

With your patch from 1.5.1p1 applied to 1.6rc2 I again get error
-14.

[Rank 0][cr.c: line 955]cr_checkpoint failed with return code=-14
[CR_Callback] Checkpoint of a Process Failed
MPI process (rank: 0) terminated unexpectedly on l1440
cr_core.c:244 cr_checkpoint: Unexpected return from CR_OP_HAND_ABORT
Abort
handle_mt_peer: fail to read...: Success
[Rank 1][cr.c: line 955]cr_checkpoint failed with return code=-14
MPI process (rank: 1) terminated unexpectedly on l1439


To tell the complete story, I have to admit that in today's tests I
had very few IMB runs which were successfully checkpointed to the
Lustre filesystem. However, I could not detect any systematics. The
good cases are also included in the attachment.
To me, it looks like some timeing race condition of some sort ...



Thanks for your help,

thomas
-- 
Dr.-Ing. Thomas Zeiser, HPC Services
Friedrich-Alexander-Universitaet Erlangen-Nuernberg
Regionales Rechenzentrum Erlangen (RRZE)
Martensstrasse 1, 91058 Erlangen, Germany
Tel.: +49 9131 85-28737, Fax: +49 9131 302941
thomas.zeiser at rrze.uni-erlangen.de
http://www.rrze.uni-erlangen.de/hpc/
-------------- next part --------------
CC=icc F77=ifort CXX=icpc. /configure --prefix=/home/cluster64/unrz/uz/xx-inst-mvapich2-1.5.1p1-patched --enable-blcr --with-file-system=lustre

######################################################################

env PATH=/home/cluster64/unrz/uz/xx-inst-mvapich2-1.5.1p1-patched/bin:$PATH mv2_checkpoint

  PID USER     TT       COMMAND     %CPU    VSZ  START CMD
 1659 uz  pts/0    mpirun_rsh   0.0  25268  13:39 mpirun_rsh -ssh -np 2 l1446 l1445 MV2_CKPT_FILE=/lxfs/unrz/uz/chk-imb-45283.ladm1 ./IMB-MPI1

Enter PID to checkpoint or Control-C to exit: 1659
Checkpointing PID 1659
Checkpoint cancelled by application: try again later
cr_checkpoint failed

====================

env PATH=/home/cluster64/unrz/uz/xx-inst-mvapich2-1.5.1p1-patched/bin:$PATH mpirun_rsh -ssh -np 2 `uniq $PBS_NODEFILE` MV2_CKPT_FILE=/lxfs/unrz/uz/chk-imb-$PBS_JOBID ./IMB-MPI1

*** output of several tests follows ***

#----------------------------------------------------------------
# Benchmarking Reduce_scatter 
# #processes = 2 
#----------------------------------------------------------------
       #bytes #repetitions  t_min[usec]  t_max[usec]  t_avg[usec]
            0         1000         0.08         0.08         0.08
            4         1000         0.97         0.97         0.97
            8         1000         2.72         2.73         2.73
           16         1000         2.69         2.69         2.69
           32         1000         2.76         2.76         2.76
           64         1000         2.78         2.79         2.79
          128         1000         3.01         3.01         3.01
          256         1000         4.02         4.03         4.02
          512         1000         4.32         4.32         4.32
         1024         1000         4.78         4.78         4.78
         2048         1000         5.52         5.52         5.52
         4096         1000         6.84         6.85         6.84
         8192         1000         8.63         8.64         8.63
        16384         1000        12.36        12.37        12.36
        32768         1000        26.71        26.71        26.71
        65536          640        40.20        40.21        40.20
       131072          320        72.94        72.94        72.94
       262144          160       135.31       135.31       135.31
[0]: begin checkpoint...
[0]: fsync...
[0]: Reactivate channels...
[0]:  CR completed...
[1]:  CR completed...
       524288           80       272.74       272.77       272.76
      1048576           40       445.50       445.80       445.65
      2097152           20       777.59       778.04       777.82
      4194304           10      4265.79      4266.21      4266.00


#----------------------------------------------------------------
# Benchmarking Reduce 
# #processes = 2 
#----------------------------------------------------------------
       #bytes #repetitions  t_min[usec]  t_max[usec]  t_avg[usec]
            0         1000         0.29         0.29         0.29
            4         1000         2.75         2.75         2.75
            8         1000         2.71         2.71         2.71
           16         1000         2.81         2.81         2.81
           32         1000         2.82         2.82         2.82
           64         1000         2.94         2.94         2.94
          128         1000         4.04         4.04         4.04
          256         1000         4.32         4.32         4.32
          512         1000         4.80         4.80         4.80
         1024         1000         5.13         5.14         5.14
         2048         1000         6.32         6.33         6.32
         4096         1000         8.28         8.29         8.29
         8192         1000        10.70        10.70        10.70
        16384         1000        14.79        14.80        14.80
        32768         1000        39.59        39.59        39.59
        65536          640        59.75        59.76        59.75
       131072          320       102.02       102.02       102.02
       262144          160       183.92       183.93       183.92
       524288           80       338.54       338.55       338.54
      1048576           40       896.40       900.67       898.54
[0]: begin checkpoint...
[Rank 0][cr.c: line 722]cr_checkpoint failed with return code=-14
[CR_Callback] Checkpoint of a Process Failed
MPI process (rank: 0) terminated unexpectedly on l1446
cr_core.c:244 cr_checkpoint: Unexpected return from CR_OP_HAND_ABORT
Abort
[Rank 1][cr.c: line 722]cr_checkpoint failed with return code=-14
MPI process (rank: 1) terminated unexpectedly on l1445


uz at l1446: 13:41 [~] $ ll /lxfs/unrz/uz/chk-imb-45283.ladm1*
-rw------- 1 uz unrz 8777728 Jan 28 13:39 /lxfs/unrz/uz/chk-imb-45283.ladm1.1.0
-rw------- 1 uz unrz 9764864 Jan 28 13:39 /lxfs/unrz/uz/chk-imb-45283.ladm1.1.1


#----------------------------------------------------------------
# Benchmarking Reduce 
# #processes = 2 
#----------------------------------------------------------------
       #bytes #repetitions  t_min[usec]  t_max[usec]  t_avg[usec]
            0         1000         0.29         0.31         0.30
            4         1000         2.81         2.81         2.81
            8         1000         2.76         2.76         2.76
           16         1000         2.82         2.82         2.82
           32         1000         2.87         2.87         2.87
           64         1000         2.97         2.97         2.97
          128         1000         4.06         4.06         4.06
          256         1000         4.33         4.33         4.33
          512         1000         4.82         4.82         4.82
[0]: begin checkpoint...
[Rank 0][cr.c: line 722]cr_checkpoint failed with return code=-14
[CR_Callback] Checkpoint of a Process Failed
MPI process (rank: 0) terminated unexpectedly on l1446
cr_core.c:244 cr_checkpoint: Unexpected return from CR_OP_HAND_ABORT
Abort
[Rank 1][cr.c: line 722]cr_checkpoint failed with return code=-14
MPI process (rank: 1) terminated unexpectedly on l1445


uz at l1446: 13:46 [~] $ ll /lxfs/unrz/uz/chk-imb-45283.ladm1*
-rw------- 1 uz unrz 9543680 Jan 28 13:43 /lxfs/unrz/uz/chk-imb-45283.ladm1.1.0
-rw------- 1 uz unrz 9768960 Jan 28 13:43 /lxfs/unrz/uz/chk-imb-45283.ladm1.1.1


######################################################################

###
### MVAPICH2-1.6RC2
###

env CC=icc F77=ifort CXX=icpc ./configure --prefix=/home/cluster64/unrz/uz/xx-inst-mvapich2-1.6rc2 --enable-blcr --with-file-system=lustre 

====================

env PATH=/home/cluster64/unrz/uz/xx-inst-mvapich2-1.6rc2/bin:$PATH mpirun_rsh -ssh -np 2 `uniq $PBS_NODEFILE` MV2_CKPT_FILE=/lxfs/unrz/uz/chk-imb-$PBS_JOBID ./IMB-MPI1

*** output of several tests follows ***

#----------------------------------------------------------------
# Benchmarking Allgather 
# #processes = 2 
#----------------------------------------------------------------
       #bytes #repetitions  t_min[usec]  t_max[usec]  t_avg[usec]
            0         1000         0.03         0.03         0.03
            1         1000         2.54         2.55         2.55
            2         1000         2.54         2.54         2.54
            4         1000         2.57         2.57         2.57
            8         1000         2.58         2.58         2.58
           16         1000         2.53         2.54         2.54
           32         1000         2.59         2.59         2.59
           64         1000         2.65         2.65         2.65
          128         1000         3.69         3.69         3.69
          256         1000         3.86         3.87         3.87
          512         1000         4.12         4.13         4.12
         1024         1000         4.66         4.67         4.66
         2048         1000         5.76         5.76         5.76
[0]:  CR completed...
[1]:  CR completed...
         4096         1000         8.08         8.08         8.08
         8192         1000        10.19        10.19        10.19
        16384         1000        17.97        17.98        17.98
        32768         1000        25.63        25.63        25.63
        65536          640        40.58        40.58        40.58
       131072          320        73.13        73.13        73.13
       262144          160       138.05       138.05       138.05
       524288           80       266.82       266.86       266.84
      1048576           40       526.00       526.10       526.05
      2097152           20      1036.64      1036.85      1036.75
      4194304           10      2087.81      2087.88      2087.84


 env PATH=/home/cluster64/unrz/uz/xx-inst-mvapich2-1.6rc2/bin:$PATH mpirun_rsh -ssh -np 2 `uniq $PBS_NODEFILE` MV2_CKPT_FILE=/lxfs/unrz/uz/chk-imb-$PBS_JOBID ./IMB-MPI1

#----------------------------------------------------------------
# Benchmarking Reduce_scatter 
# #processes = 2 
#----------------------------------------------------------------
       #bytes #repetitions  t_min[usec]  t_max[usec]  t_avg[usec]
            0         1000         0.08         0.08         0.08
            4         1000         0.96         0.97         0.97
            8         1000         2.62         2.62         2.62
           16         1000         2.62         2.63         2.62
           32         1000         2.63         2.64         2.63
           64         1000         2.71         2.72         2.72
          128         1000         3.02         3.02         3.02

[CR_Callback] Checkpoint of a Process Failed
[Rank 1][cr.c: line 954]cr_checkpoint failed
cr_core.c:244 cr_checkpoint: Unexpected return from CR_OP_HAND_ABORT
Abort
MPI process (rank: 1) terminated unexpectedly on l1445
handle_mt_peer: fail to read...: Success
[Rank 0][cr.c: line 954]cr_checkpoint failed
MPI process (rank: 0) terminated unexpectedly on l1446


#----------------------------------------------------------------
# Benchmarking Allreduce 
# #processes = 2 
#----------------------------------------------------------------
       #bytes #repetitions  t_min[usec]  t_max[usec]  t_avg[usec]
            0         1000         0.03         0.04         0.04
            4         1000         2.87         2.88         2.88
            8         1000         2.77         2.78         2.77
           16         1000         2.73         2.73         2.73
           32         1000         2.74         2.74         2.74
           64         1000         2.83         2.83         2.83
          128         1000         3.97         3.97         3.97
          256         1000         4.18         4.18         4.18
          512         1000         4.54         4.54         4.54
         1024         1000         5.20         5.20         5.20
         2048         1000         6.47         6.48         6.48
         4096         1000        11.89        11.90        11.89
         8192         1000        14.58        14.58        14.58
        16384         1000        20.03        20.03        20.03
        32768         1000        41.20        41.20        41.20
        65536          640        66.66        66.66        66.66
       131072          320       114.56       114.56       114.56
       262144          160       188.93       188.94       188.93
       524288           80       353.55       353.60       353.58
      1048576           40       685.40       685.50       685.45
[CR_Callback] Checkpoint of a Process Failed
[Rank 1][cr.c: line 954]cr_checkpoint failed
cr_core.c:244 cr_checkpoint: Unexpected return from CR_OP_HAND_ABORT
Abort
MPI process (rank: 1) terminated unexpectedly on l1445
handle_mt_peer: fail to read...: Success
[Rank 0][cr.c: line 954]cr_checkpoint failed
MPI process (rank: 0) terminated unexpectedly on l1446


*** now the additional debug patch from 1.5.1p1 applied to 1.6rc2 ***


#----------------------------------------------------------------
# Benchmarking Allgather 
# #processes = 2 
#----------------------------------------------------------------
       #bytes #repetitions  t_min[usec]  t_max[usec]  t_avg[usec]
            0         1000         0.04         0.04         0.04
            1         1000         2.61         2.61         2.61
            2         1000         2.55         2.56         2.56
            4         1000         2.59         2.59         2.59
            8         1000         2.60         2.60         2.60
           16         1000         2.57         2.57         2.57
           32         1000         2.63         2.64         2.64
           64         1000         2.67         2.67         2.67
          128         1000         3.68         3.69         3.69
          256         1000         3.86         3.86         3.86
          512         1000         4.14         4.14         4.14
         1024         1000         4.67         4.67         4.67
         2048         1000         5.79         5.79         5.79
         4096         1000         6.88         6.89         6.89
         8192         1000         9.12         9.13         9.13
        16384         1000        17.88        17.88        17.88
[0]:  CR completed...
[1]:  CR completed...
        32768         1000      3010.72      3010.72      3010.72
        65536          640        41.43        41.43        41.43
       131072          320        73.52        73.52        73.52
       262144          160       138.42       138.43       138.42
       524288           80       268.99       269.00       268.99
      1048576           40       527.78       527.82       527.80
      2097152           20      1043.65      1043.75      1043.70
      4194304           10      2146.89      2147.01      2146.95


#----------------------------------------------------------------
# Benchmarking Gatherv 
# #processes = 2 
#----------------------------------------------------------------
       #bytes #repetitions  t_min[usec]  t_max[usec]  t_avg[usec]
            0         1000         0.07         0.09         0.08
            1         1000         2.66         2.66         2.66
            2         1000         2.52         2.52         2.52
            4         1000         2.48         2.48         2.48
            8         1000         2.47         2.47         2.47
           16         1000         2.46         2.46         2.46
           32         1000         2.51         2.51         2.51
           64         1000         2.61         2.61         2.61
          128         1000         3.57         3.57         3.57
          256         1000         3.78         3.78         3.78
          512         1000         4.02         4.02         4.02
         1024         1000         4.51         4.52         4.51
         2048         1000         5.50         5.50         5.50
         4096         1000         6.56         6.57         6.56
         8192         1000         8.52         8.52         8.52
        16384         1000        15.24        15.24        15.24
        32768         1000        20.99        20.99        20.99
        65536          640        33.83        33.84        33.84
[Rank 0][cr.c: line 955]cr_checkpoint failed with return code=-14
[CR_Callback] Checkpoint of a Process Failed
MPI process (rank: 0) terminated unexpectedly on l1440
cr_core.c:244 cr_checkpoint: Unexpected return from CR_OP_HAND_ABORT
Abort
handle_mt_peer: fail to read...: Success
[Rank 1][cr.c: line 955]cr_checkpoint failed with return code=-14
MPI process (rank: 1) terminated unexpectedly on l1439


More information about the mvapich-discuss mailing list