[mvapich-discuss] Problem with NPB-2.4/mvapich2/BLCR

sunway qilu sunwaycn at gmail.com
Tue Oct 23 06:04:11 EDT 2007


I'm the the mvapich2 + Blcr, but the result is not all right .
would you please help me?
many thanks!

This is my env:

OS : Linux Kernel 2.6.42
C/Fortran :  intel C/C++/Fortran 10.0.0.23
mvapich2 :  mvapich2-trunk-2007-10-22
BLCR : 0.6.1
Program: NPB-2.4

following is my  test step:

1. $ mpdboot -n 3
2.$ cat cfg
cn22
cn22
cn22
cn22
cn23
cn23
cn23
cn23

3. normal test,the result is good.

$ mpirun -machinefile ./cfg -np 8 ./lu.A.8


 NAS Parallel Benchmarks 2.4 -- LU Benchmark

 Size:  64x 64x 64
 Iterations: 250
 Number of processes:     8

 Time step    1
 Time step   20
 Time step   40
 Time step   60
 Time step   80
 Time step  100
 Time step  120
 Time step  140
 Time step  160
 Time step  180
 Time step  200
 Time step  220
 Time step  240
 Time step  250

 Verification being performed for class A
 Accuracy setting for epsilon =  0.1000000000000E-07
 Comparison of RMS-norms of residual
           1   0.7790210760669E+03 0.7790210760669E+03 0.1386387341159E-13
           2   0.6340276525969E+02 0.6340276525969E+02 0.5603404937070E-14
           3   0.1949924972729E+03 0.1949924972729E+03 0.9036993778374E-14
           4   0.1784530116042E+03 0.1784530116042E+03 0.3185343769198E-15
           5   0.1838476034946E+04 0.1838476034946E+04 0.1187280792767E-13
 Comparison of RMS-norms of solution error
           1   0.2996408568547E+02 0.2996408568547E+02 0.1185657295234E-14
           2   0.2819457636500E+01 0.2819457636500E+01 0.1370326007271E-13
           3   0.7347341269878E+01 0.7347341269877E+01 0.7373944071964E-14
           4   0.6713922568778E+01 0.6713922568778E+01 0.7937342832911E-15
           5   0.7071531568839E+02 0.7071531568839E+02 0.1185656063379E-13
 Comparison of surface integral
               0.2603092560489E+02 0.2603092560489E+02 0.2729609951429E-15
 Verification Successful


 LU Benchmark Completed.
 Class           =                        A
 Size            =             64x  64x  64
 Iterations      =                      250
 Time in seconds =                    17.72
 Total processes =                        8
 Compiled procs  =                        8
 Mop/s total     =                  6733.74
 Mop/s/process   =                   841.72
 Operation type  =           floating point
 Verification    =               SUCCESSFUL
 Version         =                      2.4
 Compile date    =              23 Oct 2007

 Compile options:
    MPIF77       = mpif90
    FLINK        = mpif90
    FMPI_LIB     = (none)
    FMPI_INC     = (none)
    FFLAGS       = -O3
    FLINKFLAGS   = (none)
    RAND         = (none)


 Please send the results of this run to:

 NPB Development Team
 Internet: npb at nas.nasa.gov

 If email is not available, send this to:

 MS T27A-1
 NASA Ames Research Center
 Moffett Field, CA  94035-1000

 Fax: 650-604-3957

4  As the lu.A.8 running(4.1), checkpoint it(4.2) .the lu.A.8 contiune(4.3),the
result is good.

 4.1 $ mpirun -machinefile ./cfg -np 8 ./lu.A.8


 NAS Parallel Benchmarks 2.4 -- LU Benchmark

 Size:  64x 64x 64
 Iterations: 250
 Number of processes:     8

 Time step    1
 Time step   20
 Time step   40
 Time step   60
 Time step   80
 Time step  100
 Time step  120

...
4.2 $ mv2_checkpoint

  PID USER     TT       COMMAND     %CPU   VSZ  START CMD
 7968 yangshj  pts/0    mpirun       0.0 14672  17:25 mpirun -machinefile
./cfg -np 8 ./lu.A.8

Enter PID to checkpoint or Control-C to exit: 7968
Checkpointing PID 7968
Checkpoint file: context.7968

4.3 $ mpirun -machinefile ./cfg -np 8 ./lu.A.8


 NAS Parallel Benchmarks 2.4 -- LU Benchmark

 Size:  64x 64x 64
 Iterations: 250
 Number of processes:     8

 Time step    1
 Time step   20
 Time step   40
 Time step   60
 Time step   80
 Time step  100
 Time step  120
 Time step  140
 Time step  160
 Time step  180
 Time step  200
 Time step  220
 Time step  240
 Time step  250

 Verification being performed for class A
 Accuracy setting for epsilon =  0.1000000000000E-07
 Comparison of RMS-norms of residual
           1   0.7790210760669E+03 0.7790210760669E+03 0.1386387341159E-13
           2   0.6340276525969E+02 0.6340276525969E+02 0.5603404937070E-14
           3   0.1949924972729E+03 0.1949924972729E+03 0.9036993778374E-14
           4   0.1784530116042E+03 0.1784530116042E+03 0.3185343769198E-15
           5   0.1838476034946E+04 0.1838476034946E+04 0.1187280792767E-13
 Comparison of RMS-norms of solution error
           1   0.2996408568547E+02 0.2996408568547E+02 0.1185657295234E-14
           2   0.2819457636500E+01 0.2819457636500E+01 0.1370326007271E-13
           3   0.7347341269878E+01 0.7347341269877E+01 0.7373944071964E-14
           4   0.6713922568778E+01 0.6713922568778E+01 0.7937342832911E-15
           5   0.7071531568839E+02 0.7071531568839E+02 0.1185656063379E-13
 Comparison of surface integral
               0.2603092560489E+02 0.2603092560489E+02 0.2729609951429E-15
 Verification Successful


 LU Benchmark Completed.
 Class           =                        A
 Size            =             64x  64x  64
 Iterations      =                      250
 Time in seconds =                    18.78
 Total processes =                        8
 Compiled procs  =                        8
 Mop/s total     =                  6352.76
 Mop/s/process   =                   794.10
 Operation type  =           floating point
 Verification    =               SUCCESSFUL
 Version         =                      2.4
 Compile date    =              23 Oct 2007

 Compile options:
    MPIF77       = mpif90
    FLINK        = mpif90
    FMPI_LIB     = (none)
    FMPI_INC     = (none)
    FFLAGS       = -O3
    FLINKFLAGS   = (none)
    RAND         = (none)


 Please send the results of this run to:

 NPB Development Team
 Internet: npb at nas.nasa.gov

 If email is not available, send this to:

 MS T27A-1
 NASA Ames Research Center
 Moffett Field, CA  94035-1000

 Fax: 650-604-3957

5.restart the PID  7968 ,then result has "NaN "(5.1),sometimes  the
"FAILURE: " & "UNSUCCESSFUL"

5.1 $ cr_restart context.7968
mpiexec_cn21 (mpiexec 335): mpiexec: Restarting
 Time step  120
 Time step  140
 Time step  160
 Time step  180
 Time step  200
 Time step  220
 Time step  240
 Time step  250

 Verification being performed for class A
 Accuracy setting for epsilon =  0.1000000000000E-07
 Comparison of RMS-norms of residual
           1   NaN                 0.7790210760669E+03 NaN
           2   NaN                 0.6340276525969E+02 NaN
           3   NaN                 0.1949924972729E+03 NaN
           4   NaN                 0.1784530116042E+03 NaN
           5   NaN                 0.1838476034946E+04 NaN
 Comparison of RMS-norms of solution error
           1   NaN                 0.2996408568547E+02 NaN
           2   NaN                 0.2819457636500E+01 NaN
           3   NaN                 0.7347341269877E+01 NaN
           4   NaN                 0.6713922568778E+01 NaN
           5   NaN                 0.7071531568839E+02 NaN
 Comparison of surface integral
               NaN                 0.2603092560489E+02 NaN
 Verification Successful


 LU Benchmark Completed.
 Class           =                        A
 Size            =             64x  64x  64
 Iterations      =                      250
 Time in seconds =                    66.11
 Total processes =                        8
 Compiled procs  =                        8
 Mop/s total     =                  1804.50
 Mop/s/process   =                   225.56
 Operation type  =           floating point
 Verification    =               SUCCESSFUL
 Version         =                      2.4
 Compile date    =              23 Oct 2007

 Compile options:
    MPIF77       = mpif90
    FLINK        = mpif90
    FMPI_LIB     = (none)
    FMPI_INC     = (none)
    FFLAGS       = -O3
    FLINKFLAGS   = (none)
    RAND         = (none)


 Please send the results of this run to:

 NPB Development Team
 Internet: npb at nas.nasa.gov

 If email is not available, send this to:

 MS T27A-1
 NASA Ames Research Center
 Moffett Field, CA  94035-1000

 Fax: 650-604-3957

5.2.$ mpirun -machinefile ./cfg -np 8 ./lu.A.8


 NAS Parallel Benchmarks 2.4 -- LU Benchmark

 Size:  64x 64x 64
 Iterations: 250
 Number of processes:     8

 Time step    1
 Time step   20
 Time step   40
 Time step   60
 Time step   80
 Time step  100
 Time step  120
 Time step  140
 Time step  160
 Time step  180
 Time step  200
 Time step  220
 Time step  240
 Time step  250

 Verification being performed for class A
 Accuracy setting for epsilon =  0.1000000000000E-07
 Comparison of RMS-norms of residual
 FAILURE:  1   0.7790355334612E+03 0.7790210760669E+03 0.1855841227478E-04
 FAILURE:  2   0.6340489955249E+02 0.6340276525969E+02 0.3366245600758E-04
 FAILURE:  3   0.1949964027466E+03 0.1949924972729E+03 0.2002884068547E-04
 FAILURE:  4   0.1784563048837E+03 0.1784530116042E+03 0.1845460320509E-04
 FAILURE:  5   0.1838499810682E+04 0.1838476034946E+04 0.1293230623563E-04
 Comparison of RMS-norms of solution error
 FAILURE:  1   0.2996451081467E+02 0.2996408568547E+02 0.1418795824413E-04
 FAILURE:  2   0.2819496132217E+01 0.2819457636500E+01 0.1365358930094E-04
 FAILURE:  3   0.7347450238213E+01 0.7347341269877E+01 0.1483098878912E-04
 FAILURE:  4   0.6714013230847E+01 0.6713922568778E+01 0.1350359173032E-04
 FAILURE:  5   0.7071607035800E+02 0.7071531568839E+02 0.1067194005931E-04
 Comparison of surface integral
 FAILURE:      0.2603109553197E+02 0.2603092560489E+02 0.6527892352571E-05
 Verification failed


 LU Benchmark Completed.
 Class           =                        A
 Size            =             64x  64x  64
 Iterations      =                      250
 Time in seconds =                    17.15
 Total processes =                        8
 Compiled procs  =                        8
 Mop/s total     =                  6956.73
 Mop/s/process   =                   869.59
 Operation type  =           floating point
 Verification    =             UNSUCCESSFUL
 Version         =                      2.4
 Compile date    =              22 Oct 2007

 Compile options:
    MPIF77       = mpif90
    FLINK        = mpif90
    FMPI_LIB     = (none)
    FMPI_INC     = (none)
    FFLAGS       = -O3
    FLINKFLAGS   = (none)
    RAND         = (none)


 Please send the results of this run to:

 NPB Development Team
 Internet: npb at nas.nasa.gov

 If email is not available, send this to:

 MS T27A-1
 NASA Ames Research Center
 Moffett Field, CA  94035-1000

 Fax: 650-604-3957
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20071023/6e1a01cb/attachment-0001.html


More information about the mvapich-discuss mailing list