[mvapich-discuss] Problem with NPB-2.4/mvapich2/BLCR

wei huang huanwei at cse.ohio-state.edu
Wed Oct 24 13:34:23 EDT 2007


Hi,

Thanks for your detailed note. We are looking at it and will get back to
you as soon as we find anything.

Also, would you please let us know a bit more information on your
computing platform? Such as CPU, memory size, etc. BTW, do you mean
kernel 2.6.22?

Thanks.


Thanks.

Regards,
Wei Huang

774 Dreese Lab, 2015 Neil Ave,
Dept. of Computer Science and Engineering
Ohio State University
OH 43210
Tel: (614)292-8501


On Tue, 23 Oct 2007, sunway qilu wrote:

> I'm the the mvapich2 + Blcr, but the result is not all right .
> would you please help me?
> many thanks!
>
> This is my env:
>
> OS : Linux Kernel 2.6.42
> C/Fortran :  intel C/C++/Fortran 10.0.0.23
> mvapich2 :  mvapich2-trunk-2007-10-22
> BLCR : 0.6.1
> Program: NPB-2.4
>
> following is my  test step:
>
> 1. $ mpdboot -n 3
> 2.$ cat cfg
> cn22
> cn22
> cn22
> cn22
> cn23
> cn23
> cn23
> cn23
>
> 3. normal test,the result is good.
>
> $ mpirun -machinefile ./cfg -np 8 ./lu.A.8
>
>
>  NAS Parallel Benchmarks 2.4 -- LU Benchmark
>
>  Size:  64x 64x 64
>  Iterations: 250
>  Number of processes:     8
>
>  Time step    1
>  Time step   20
>  Time step   40
>  Time step   60
>  Time step   80
>  Time step  100
>  Time step  120
>  Time step  140
>  Time step  160
>  Time step  180
>  Time step  200
>  Time step  220
>  Time step  240
>  Time step  250
>
>  Verification being performed for class A
>  Accuracy setting for epsilon =  0.1000000000000E-07
>  Comparison of RMS-norms of residual
>            1   0.7790210760669E+03 0.7790210760669E+03 0.1386387341159E-13
>            2   0.6340276525969E+02 0.6340276525969E+02 0.5603404937070E-14
>            3   0.1949924972729E+03 0.1949924972729E+03 0.9036993778374E-14
>            4   0.1784530116042E+03 0.1784530116042E+03 0.3185343769198E-15
>            5   0.1838476034946E+04 0.1838476034946E+04 0.1187280792767E-13
>  Comparison of RMS-norms of solution error
>            1   0.2996408568547E+02 0.2996408568547E+02 0.1185657295234E-14
>            2   0.2819457636500E+01 0.2819457636500E+01 0.1370326007271E-13
>            3   0.7347341269878E+01 0.7347341269877E+01 0.7373944071964E-14
>            4   0.6713922568778E+01 0.6713922568778E+01 0.7937342832911E-15
>            5   0.7071531568839E+02 0.7071531568839E+02 0.1185656063379E-13
>  Comparison of surface integral
>                0.2603092560489E+02 0.2603092560489E+02 0.2729609951429E-15
>  Verification Successful
>
>
>  LU Benchmark Completed.
>  Class           =                        A
>  Size            =             64x  64x  64
>  Iterations      =                      250
>  Time in seconds =                    17.72
>  Total processes =                        8
>  Compiled procs  =                        8
>  Mop/s total     =                  6733.74
>  Mop/s/process   =                   841.72
>  Operation type  =           floating point
>  Verification    =               SUCCESSFUL
>  Version         =                      2.4
>  Compile date    =              23 Oct 2007
>
>  Compile options:
>     MPIF77       = mpif90
>     FLINK        = mpif90
>     FMPI_LIB     = (none)
>     FMPI_INC     = (none)
>     FFLAGS       = -O3
>     FLINKFLAGS   = (none)
>     RAND         = (none)
>
>
>  Please send the results of this run to:
>
>  NPB Development Team
>  Internet: npb at nas.nasa.gov
>
>  If email is not available, send this to:
>
>  MS T27A-1
>  NASA Ames Research Center
>  Moffett Field, CA  94035-1000
>
>  Fax: 650-604-3957
>
> 4  As the lu.A.8 running(4.1), checkpoint it(4.2) .the lu.A.8 contiune(4.3),the
> result is good.
>
>  4.1 $ mpirun -machinefile ./cfg -np 8 ./lu.A.8
>
>
>  NAS Parallel Benchmarks 2.4 -- LU Benchmark
>
>  Size:  64x 64x 64
>  Iterations: 250
>  Number of processes:     8
>
>  Time step    1
>  Time step   20
>  Time step   40
>  Time step   60
>  Time step   80
>  Time step  100
>  Time step  120
>
> ...
> 4.2 $ mv2_checkpoint
>
>   PID USER     TT       COMMAND     %CPU   VSZ  START CMD
>  7968 yangshj  pts/0    mpirun       0.0 14672  17:25 mpirun -machinefile
> ./cfg -np 8 ./lu.A.8
>
> Enter PID to checkpoint or Control-C to exit: 7968
> Checkpointing PID 7968
> Checkpoint file: context.7968
>
> 4.3 $ mpirun -machinefile ./cfg -np 8 ./lu.A.8
>
>
>  NAS Parallel Benchmarks 2.4 -- LU Benchmark
>
>  Size:  64x 64x 64
>  Iterations: 250
>  Number of processes:     8
>
>  Time step    1
>  Time step   20
>  Time step   40
>  Time step   60
>  Time step   80
>  Time step  100
>  Time step  120
>  Time step  140
>  Time step  160
>  Time step  180
>  Time step  200
>  Time step  220
>  Time step  240
>  Time step  250
>
>  Verification being performed for class A
>  Accuracy setting for epsilon =  0.1000000000000E-07
>  Comparison of RMS-norms of residual
>            1   0.7790210760669E+03 0.7790210760669E+03 0.1386387341159E-13
>            2   0.6340276525969E+02 0.6340276525969E+02 0.5603404937070E-14
>            3   0.1949924972729E+03 0.1949924972729E+03 0.9036993778374E-14
>            4   0.1784530116042E+03 0.1784530116042E+03 0.3185343769198E-15
>            5   0.1838476034946E+04 0.1838476034946E+04 0.1187280792767E-13
>  Comparison of RMS-norms of solution error
>            1   0.2996408568547E+02 0.2996408568547E+02 0.1185657295234E-14
>            2   0.2819457636500E+01 0.2819457636500E+01 0.1370326007271E-13
>            3   0.7347341269878E+01 0.7347341269877E+01 0.7373944071964E-14
>            4   0.6713922568778E+01 0.6713922568778E+01 0.7937342832911E-15
>            5   0.7071531568839E+02 0.7071531568839E+02 0.1185656063379E-13
>  Comparison of surface integral
>                0.2603092560489E+02 0.2603092560489E+02 0.2729609951429E-15
>  Verification Successful
>
>
>  LU Benchmark Completed.
>  Class           =                        A
>  Size            =             64x  64x  64
>  Iterations      =                      250
>  Time in seconds =                    18.78
>  Total processes =                        8
>  Compiled procs  =                        8
>  Mop/s total     =                  6352.76
>  Mop/s/process   =                   794.10
>  Operation type  =           floating point
>  Verification    =               SUCCESSFUL
>  Version         =                      2.4
>  Compile date    =              23 Oct 2007
>
>  Compile options:
>     MPIF77       = mpif90
>     FLINK        = mpif90
>     FMPI_LIB     = (none)
>     FMPI_INC     = (none)
>     FFLAGS       = -O3
>     FLINKFLAGS   = (none)
>     RAND         = (none)
>
>
>  Please send the results of this run to:
>
>  NPB Development Team
>  Internet: npb at nas.nasa.gov
>
>  If email is not available, send this to:
>
>  MS T27A-1
>  NASA Ames Research Center
>  Moffett Field, CA  94035-1000
>
>  Fax: 650-604-3957
>
> 5.restart the PID  7968 ,then result has "NaN "(5.1),sometimes  the
> "FAILURE: " & "UNSUCCESSFUL"
>
> 5.1 $ cr_restart context.7968
> mpiexec_cn21 (mpiexec 335): mpiexec: Restarting
>  Time step  120
>  Time step  140
>  Time step  160
>  Time step  180
>  Time step  200
>  Time step  220
>  Time step  240
>  Time step  250
>
>  Verification being performed for class A
>  Accuracy setting for epsilon =  0.1000000000000E-07
>  Comparison of RMS-norms of residual
>            1   NaN                 0.7790210760669E+03 NaN
>            2   NaN                 0.6340276525969E+02 NaN
>            3   NaN                 0.1949924972729E+03 NaN
>            4   NaN                 0.1784530116042E+03 NaN
>            5   NaN                 0.1838476034946E+04 NaN
>  Comparison of RMS-norms of solution error
>            1   NaN                 0.2996408568547E+02 NaN
>            2   NaN                 0.2819457636500E+01 NaN
>            3   NaN                 0.7347341269877E+01 NaN
>            4   NaN                 0.6713922568778E+01 NaN
>            5   NaN                 0.7071531568839E+02 NaN
>  Comparison of surface integral
>                NaN                 0.2603092560489E+02 NaN
>  Verification Successful
>
>
>  LU Benchmark Completed.
>  Class           =                        A
>  Size            =             64x  64x  64
>  Iterations      =                      250
>  Time in seconds =                    66.11
>  Total processes =                        8
>  Compiled procs  =                        8
>  Mop/s total     =                  1804.50
>  Mop/s/process   =                   225.56
>  Operation type  =           floating point
>  Verification    =               SUCCESSFUL
>  Version         =                      2.4
>  Compile date    =              23 Oct 2007
>
>  Compile options:
>     MPIF77       = mpif90
>     FLINK        = mpif90
>     FMPI_LIB     = (none)
>     FMPI_INC     = (none)
>     FFLAGS       = -O3
>     FLINKFLAGS   = (none)
>     RAND         = (none)
>
>
>  Please send the results of this run to:
>
>  NPB Development Team
>  Internet: npb at nas.nasa.gov
>
>  If email is not available, send this to:
>
>  MS T27A-1
>  NASA Ames Research Center
>  Moffett Field, CA  94035-1000
>
>  Fax: 650-604-3957
>
> 5.2.$ mpirun -machinefile ./cfg -np 8 ./lu.A.8
>
>
>  NAS Parallel Benchmarks 2.4 -- LU Benchmark
>
>  Size:  64x 64x 64
>  Iterations: 250
>  Number of processes:     8
>
>  Time step    1
>  Time step   20
>  Time step   40
>  Time step   60
>  Time step   80
>  Time step  100
>  Time step  120
>  Time step  140
>  Time step  160
>  Time step  180
>  Time step  200
>  Time step  220
>  Time step  240
>  Time step  250
>
>  Verification being performed for class A
>  Accuracy setting for epsilon =  0.1000000000000E-07
>  Comparison of RMS-norms of residual
>  FAILURE:  1   0.7790355334612E+03 0.7790210760669E+03 0.1855841227478E-04
>  FAILURE:  2   0.6340489955249E+02 0.6340276525969E+02 0.3366245600758E-04
>  FAILURE:  3   0.1949964027466E+03 0.1949924972729E+03 0.2002884068547E-04
>  FAILURE:  4   0.1784563048837E+03 0.1784530116042E+03 0.1845460320509E-04
>  FAILURE:  5   0.1838499810682E+04 0.1838476034946E+04 0.1293230623563E-04
>  Comparison of RMS-norms of solution error
>  FAILURE:  1   0.2996451081467E+02 0.2996408568547E+02 0.1418795824413E-04
>  FAILURE:  2   0.2819496132217E+01 0.2819457636500E+01 0.1365358930094E-04
>  FAILURE:  3   0.7347450238213E+01 0.7347341269877E+01 0.1483098878912E-04
>  FAILURE:  4   0.6714013230847E+01 0.6713922568778E+01 0.1350359173032E-04
>  FAILURE:  5   0.7071607035800E+02 0.7071531568839E+02 0.1067194005931E-04
>  Comparison of surface integral
>  FAILURE:      0.2603109553197E+02 0.2603092560489E+02 0.6527892352571E-05
>  Verification failed
>
>
>  LU Benchmark Completed.
>  Class           =                        A
>  Size            =             64x  64x  64
>  Iterations      =                      250
>  Time in seconds =                    17.15
>  Total processes =                        8
>  Compiled procs  =                        8
>  Mop/s total     =                  6956.73
>  Mop/s/process   =                   869.59
>  Operation type  =           floating point
>  Verification    =             UNSUCCESSFUL
>  Version         =                      2.4
>  Compile date    =              22 Oct 2007
>
>  Compile options:
>     MPIF77       = mpif90
>     FLINK        = mpif90
>     FMPI_LIB     = (none)
>     FMPI_INC     = (none)
>     FFLAGS       = -O3
>     FLINKFLAGS   = (none)
>     RAND         = (none)
>
>
>  Please send the results of this run to:
>
>  NPB Development Team
>  Internet: npb at nas.nasa.gov
>
>  If email is not available, send this to:
>
>  MS T27A-1
>  NASA Ames Research Center
>  Moffett Field, CA  94035-1000
>
>  Fax: 650-604-3957
>



More information about the mvapich-discuss mailing list