[mvapich-discuss] Problem with NPB-2.4/mvapich2/BLCR

wei huang huanwei at cse.ohio-state.edu
Fri Oct 26 00:14:15 EDT 2007


Hi,

Unfortunately we cannot reproduce the problem. We have tried on our
cluster with the closest setting with yours:

CPU:    Intel E5345 2.33GHz (Dual-sockets quad-core)
Memory: 6GB
OS: 	2.6.18-8.el5 kernel, cr to local file system

We run 8 processes, 4 processes on each node, block distribution as you
specified. We tried checkpoint/restart at various timestamp. But we did
not see the problem.

Do you see the problem consistently? Is it possible for you to try a new
kernel?

Thanks.

Regards,
Wei Huang

774 Dreese Lab, 2015 Neil Ave,
Dept. of Computer Science and Engineering
Ohio State University
OH 43210
Tel: (614)292-8501


On Thu, 25 Oct 2007, sunway qilu wrote:

> Thanks for your's response!
>
>  there'a a bit more information abiut my computing platform:
> 1. CPU : Intel Woodcrest 5140(2.33GHz,4M Cache,1333MHz)
> 2. Mem : 4GB (  had try to set the mem=2046M as system boot in grub.config,but
> the
> error reproducibility.)
> 3. OS Kernel : 2.6.9-42 + lustre 1.5.95
> 4. I had test the mvapich2_blcr at another platform (CPU:Intep Woodcrest
> 160;Mem:16GB ),b the error reproducibility
>
> thanks
>
> 2007/10/25, wei huang <huanwei at cse.ohio-state.edu>:
> >
> > Hi,
> >
> > Thanks for your detailed note. We are looking at it and will get back to
> > you as soon as we find anything.
> >
> > Also, would you please let us know a bit more information on your
> > computing platform? Such as CPU, memory size, etc. BTW, do you mean
> > kernel 2.6.22?
> >
> > Thanks.
> >
> >
> > Thanks.
> >
> > Regards,
> > Wei Huang
> >
> > 774 Dreese Lab, 2015 Neil Ave,
> > Dept. of Computer Science and Engineering
> > Ohio State University
> > OH 43210
> > Tel: (614)292-8501
> >
> >
> > On Tue, 23 Oct 2007, sunway qilu wrote:
> >
> > > I'm the the mvapich2 + Blcr, but the result is not all right .
> > > would you please help me?
> > > many thanks!
> > >
> > > This is my env:
> > >
> > > OS : Linux Kernel 2.6.42
> > > C/Fortran :  intel C/C++/Fortran 10.0.0.23
> > > mvapich2 :  mvapich2-trunk-2007-10-22
> > > BLCR : 0.6.1
> > > Program: NPB-2.4
> > >
> > > following is my  test step:
> > >
> > > 1. $ mpdboot -n 3
> > > 2.$ cat cfg
> > > cn22
> > > cn22
> > > cn22
> > > cn22
> > > cn23
> > > cn23
> > > cn23
> > > cn23
> > >
> > > 3. normal test,the result is good.
> > >
> > > $ mpirun -machinefile ./cfg -np 8 ./lu.A.8
> > >
> > >
> > >  NAS Parallel Benchmarks 2.4 -- LU Benchmark
> > >
> > >  Size:  64x 64x 64
> > >  Iterations: 250
> > >  Number of processes:     8
> > >
> > >  Time step    1
> > >  Time step   20
> > >  Time step   40
> > >  Time step   60
> > >  Time step   80
> > >  Time step  100
> > >  Time step  120
> > >  Time step  140
> > >  Time step  160
> > >  Time step  180
> > >  Time step  200
> > >  Time step  220
> > >  Time step  240
> > >  Time step  250
> > >
> > >  Verification being performed for class A
> > >  Accuracy setting for epsilon =  0.1000000000000E-07
> > >  Comparison of RMS-norms of residual
> > >            1   0.7790210760669E+03 0.7790210760669E+03
> > 0.1386387341159E-13
> > >            2   0.6340276525969E+02 0.6340276525969E+02
> > 0.5603404937070E-14
> > >            3   0.1949924972729E+03 0.1949924972729E+03
> > 0.9036993778374E-14
> > >            4   0.1784530116042E+03 0.1784530116042E+03
> > 0.3185343769198E-15
> > >            5   0.1838476034946E+04 0.1838476034946E+04
> > 0.1187280792767E-13
> > >  Comparison of RMS-norms of solution error
> > >            1   0.2996408568547E+02 0.2996408568547E+02
> > 0.1185657295234E-14
> > >            2   0.2819457636500E+01 0.2819457636500E+01
> > 0.1370326007271E-13
> > >            3   0.7347341269878E+01 0.7347341269877E+01
> > 0.7373944071964E-14
> > >            4   0.6713922568778E+01 0.6713922568778E+01
> > 0.7937342832911E-15
> > >            5   0.7071531568839E+02 0.7071531568839E+02
> > 0.1185656063379E-13
> > >  Comparison of surface integral
> > >                0.2603092560489E+02 0.2603092560489E+02
> > 0.2729609951429E-15
> > >  Verification Successful
> > >
> > >
> > >  LU Benchmark Completed.
> > >  Class           =                        A
> > >  Size            =             64x  64x  64
> > >  Iterations      =                      250
> > >  Time in seconds =                    17.72
> > >  Total processes =                        8
> > >  Compiled procs  =                        8
> > >  Mop/s total     =                  6733.74
> > >  Mop/s/process   =                   841.72
> > >  Operation type  =           floating point
> > >  Verification    =               SUCCESSFUL
> > >  Version         =                      2.4
> > >  Compile date    =              23 Oct 2007
> > >
> > >  Compile options:
> > >     MPIF77       = mpif90
> > >     FLINK        = mpif90
> > >     FMPI_LIB     = (none)
> > >     FMPI_INC     = (none)
> > >     FFLAGS       = -O3
> > >     FLINKFLAGS   = (none)
> > >     RAND         = (none)
> > >
> > >
> > >  Please send the results of this run to:
> > >
> > >  NPB Development Team
> > >  Internet: npb at nas.nasa.gov
> > >
> > >  If email is not available, send this to:
> > >
> > >  MS T27A-1
> > >  NASA Ames Research Center
> > >  Moffett Field, CA  94035-1000
> > >
> > >  Fax: 650-604-3957
> > >
> > > 4  As the lu.A.8 running(4.1), checkpoint it(4.2) .the lu.A.8 contiune(
> > 4.3),the
> > > result is good.
> > >
> > >  4.1 $ mpirun -machinefile ./cfg -np 8 ./lu.A.8
> > >
> > >
> > >  NAS Parallel Benchmarks 2.4 -- LU Benchmark
> > >
> > >  Size:  64x 64x 64
> > >  Iterations: 250
> > >  Number of processes:     8
> > >
> > >  Time step    1
> > >  Time step   20
> > >  Time step   40
> > >  Time step   60
> > >  Time step   80
> > >  Time step  100
> > >  Time step  120
> > >
> > > ...
> > > 4.2 $ mv2_checkpoint
> > >
> > >   PID USER     TT       COMMAND     %CPU   VSZ  START CMD
> > >  7968 yangshj  pts/0    mpirun       0.0 14672  17:25 mpirun
> > -machinefile
> > > ./cfg -np 8 ./lu.A.8
> > >
> > > Enter PID to checkpoint or Control-C to exit: 7968
> > > Checkpointing PID 7968
> > > Checkpoint file: context.7968
> > >
> > > 4.3 $ mpirun -machinefile ./cfg -np 8 ./lu.A.8
> > >
> > >
> > >  NAS Parallel Benchmarks 2.4 -- LU Benchmark
> > >
> > >  Size:  64x 64x 64
> > >  Iterations: 250
> > >  Number of processes:     8
> > >
> > >  Time step    1
> > >  Time step   20
> > >  Time step   40
> > >  Time step   60
> > >  Time step   80
> > >  Time step  100
> > >  Time step  120
> > >  Time step  140
> > >  Time step  160
> > >  Time step  180
> > >  Time step  200
> > >  Time step  220
> > >  Time step  240
> > >  Time step  250
> > >
> > >  Verification being performed for class A
> > >  Accuracy setting for epsilon =  0.1000000000000E-07
> > >  Comparison of RMS-norms of residual
> > >            1   0.7790210760669E+03 0.7790210760669E+03
> > 0.1386387341159E-13
> > >            2   0.6340276525969E+02 0.6340276525969E+02
> > 0.5603404937070E-14
> > >            3   0.1949924972729E+03 0.1949924972729E+03
> > 0.9036993778374E-14
> > >            4   0.1784530116042E+03 0.1784530116042E+03
> > 0.3185343769198E-15
> > >            5   0.1838476034946E+04 0.1838476034946E+04
> > 0.1187280792767E-13
> > >  Comparison of RMS-norms of solution error
> > >            1   0.2996408568547E+02 0.2996408568547E+02
> > 0.1185657295234E-14
> > >            2   0.2819457636500E+01 0.2819457636500E+01
> > 0.1370326007271E-13
> > >            3   0.7347341269878E+01 0.7347341269877E+01
> > 0.7373944071964E-14
> > >            4   0.6713922568778E+01 0.6713922568778E+01
> > 0.7937342832911E-15
> > >            5   0.7071531568839E+02 0.7071531568839E+02
> > 0.1185656063379E-13
> > >  Comparison of surface integral
> > >                0.2603092560489E+02 0.2603092560489E+02
> > 0.2729609951429E-15
> > >  Verification Successful
> > >
> > >
> > >  LU Benchmark Completed.
> > >  Class           =                        A
> > >  Size            =             64x  64x  64
> > >  Iterations      =                      250
> > >  Time in seconds =                    18.78
> > >  Total processes =                        8
> > >  Compiled procs  =                        8
> > >  Mop/s total     =                  6352.76
> > >  Mop/s/process   =                   794.10
> > >  Operation type  =           floating point
> > >  Verification    =               SUCCESSFUL
> > >  Version         =                      2.4
> > >  Compile date    =              23 Oct 2007
> > >
> > >  Compile options:
> > >     MPIF77       = mpif90
> > >     FLINK        = mpif90
> > >     FMPI_LIB     = (none)
> > >     FMPI_INC     = (none)
> > >     FFLAGS       = -O3
> > >     FLINKFLAGS   = (none)
> > >     RAND         = (none)
> > >
> > >
> > >  Please send the results of this run to:
> > >
> > >  NPB Development Team
> > >  Internet: npb at nas.nasa.gov
> > >
> > >  If email is not available, send this to:
> > >
> > >  MS T27A-1
> > >  NASA Ames Research Center
> > >  Moffett Field, CA  94035-1000
> > >
> > >  Fax: 650-604-3957
> > >
> > > 5.restart the PID  7968 ,then result has "NaN "(5.1),sometimes  the
> > > "FAILURE: " & "UNSUCCESSFUL"
> > >
> > > 5.1 $ cr_restart context.7968
> > > mpiexec_cn21 (mpiexec 335): mpiexec: Restarting
> > >  Time step  120
> > >  Time step  140
> > >  Time step  160
> > >  Time step  180
> > >  Time step  200
> > >  Time step  220
> > >  Time step  240
> > >  Time step  250
> > >
> > >  Verification being performed for class A
> > >  Accuracy setting for epsilon =  0.1000000000000E-07
> > >  Comparison of RMS-norms of residual
> > >            1   NaN                 0.7790210760669E+03 NaN
> > >            2   NaN                 0.6340276525969E+02 NaN
> > >            3   NaN                 0.1949924972729E+03 NaN
> > >            4   NaN                 0.1784530116042E+03 NaN
> > >            5   NaN                 0.1838476034946E+04 NaN
> > >  Comparison of RMS-norms of solution error
> > >            1   NaN                 0.2996408568547E+02 NaN
> > >            2   NaN                 0.2819457636500E+01 NaN
> > >            3   NaN                 0.7347341269877E+01 NaN
> > >            4   NaN                 0.6713922568778E+01 NaN
> > >            5   NaN                 0.7071531568839E+02 NaN
> > >  Comparison of surface integral
> > >                NaN                 0.2603092560489E+02 NaN
> > >  Verification Successful
> > >
> > >
> > >  LU Benchmark Completed.
> > >  Class           =                        A
> > >  Size            =             64x  64x  64
> > >  Iterations      =                      250
> > >  Time in seconds =                    66.11
> > >  Total processes =                        8
> > >  Compiled procs  =                        8
> > >  Mop/s total     =                  1804.50
> > >  Mop/s/process   =                   225.56
> > >  Operation type  =           floating point
> > >  Verification    =               SUCCESSFUL
> > >  Version         =                      2.4
> > >  Compile date    =              23 Oct 2007
> > >
> > >  Compile options:
> > >     MPIF77       = mpif90
> > >     FLINK        = mpif90
> > >     FMPI_LIB     = (none)
> > >     FMPI_INC     = (none)
> > >     FFLAGS       = -O3
> > >     FLINKFLAGS   = (none)
> > >     RAND         = (none)
> > >
> > >
> > >  Please send the results of this run to:
> > >
> > >  NPB Development Team
> > >  Internet: npb at nas.nasa.gov
> > >
> > >  If email is not available, send this to:
> > >
> > >  MS T27A-1
> > >  NASA Ames Research Center
> > >  Moffett Field, CA  94035-1000
> > >
> > >  Fax: 650-604-3957
> > >
> > > 5.2.$ mpirun -machinefile ./cfg -np 8 ./lu.A.8
> > >
> > >
> > >  NAS Parallel Benchmarks 2.4 -- LU Benchmark
> > >
> > >  Size:  64x 64x 64
> > >  Iterations: 250
> > >  Number of processes:     8
> > >
> > >  Time step    1
> > >  Time step   20
> > >  Time step   40
> > >  Time step   60
> > >  Time step   80
> > >  Time step  100
> > >  Time step  120
> > >  Time step  140
> > >  Time step  160
> > >  Time step  180
> > >  Time step  200
> > >  Time step  220
> > >  Time step  240
> > >  Time step  250
> > >
> > >  Verification being performed for class A
> > >  Accuracy setting for epsilon =  0.1000000000000E-07
> > >  Comparison of RMS-norms of residual
> > >  FAILURE:  1   0.7790355334612E+03 0.7790210760669E+03
> > 0.1855841227478E-04
> > >  FAILURE:  2   0.6340489955249E+02 0.6340276525969E+02
> > 0.3366245600758E-04
> > >  FAILURE:  3   0.1949964027466E+03 0.1949924972729E+03
> > 0.2002884068547E-04
> > >  FAILURE:  4   0.1784563048837E+03 0.1784530116042E+03
> > 0.1845460320509E-04
> > >  FAILURE:  5   0.1838499810682E+04 0.1838476034946E+04
> > 0.1293230623563E-04
> > >  Comparison of RMS-norms of solution error
> > >  FAILURE:  1   0.2996451081467E+02 0.2996408568547E+02
> > 0.1418795824413E-04
> > >  FAILURE:  2   0.2819496132217E+01 0.2819457636500E+01
> > 0.1365358930094E-04
> > >  FAILURE:  3   0.7347450238213E+01 0.7347341269877E+01
> > 0.1483098878912E-04
> > >  FAILURE:  4   0.6714013230847E+01 0.6713922568778E+01
> > 0.1350359173032E-04
> > >  FAILURE:  5   0.7071607035800E+02 0.7071531568839E+02
> > 0.1067194005931E-04
> > >  Comparison of surface integral
> > >  FAILURE:      0.2603109553197E+02 0.2603092560489E+02
> > 0.6527892352571E-05
> > >  Verification failed
> > >
> > >
> > >  LU Benchmark Completed.
> > >  Class           =                        A
> > >  Size            =             64x  64x  64
> > >  Iterations      =                      250
> > >  Time in seconds =                    17.15
> > >  Total processes =                        8
> > >  Compiled procs  =                        8
> > >  Mop/s total     =                  6956.73
> > >  Mop/s/process   =                   869.59
> > >  Operation type  =           floating point
> > >  Verification    =             UNSUCCESSFUL
> > >  Version         =                      2.4
> > >  Compile date    =              22 Oct 2007
> > >
> > >  Compile options:
> > >     MPIF77       = mpif90
> > >     FLINK        = mpif90
> > >     FMPI_LIB     = (none)
> > >     FMPI_INC     = (none)
> > >     FFLAGS       = -O3
> > >     FLINKFLAGS   = (none)
> > >     RAND         = (none)
> > >
> > >
> > >  Please send the results of this run to:
> > >
> > >  NPB Development Team
> > >  Internet: npb at nas.nasa.gov
> > >
> > >  If email is not available, send this to:
> > >
> > >  MS T27A-1
> > >  NASA Ames Research Center
> > >  Moffett Field, CA  94035-1000
> > >
> > >  Fax: 650-604-3957
> > >
> >
> >
>




More information about the mvapich-discuss mailing list