[mvapich-discuss] Problem with NPB-2.4/mvapich2/BLCR

wei huang huanwei at cse.ohio-state.edu
Mon Oct 29 17:07:43 EDT 2007


Hi,

We've contacted BLCR people regarding this issue. Apparently there is
other people reporting problems similar to yours (not in MPI context,
though). You can take a look at this bugzilla entry from BLCR:

http://upc-bugs.lbl.gov/bugzilla/show_bug.cgi?id=2001

This thread is still active right now. Let's hope the problem get resolved
soon.

Thanks.

Regards,
Wei Huang

774 Dreese Lab, 2015 Neil Ave,
Dept. of Computer Science and Engineering
Ohio State University
OH 43210
Tel: (614)292-8501


On Mon, 29 Oct 2007, sunway wrote:

> > > > > 2007/10/26, wei huang <huanwei at cse.ohio-state.edu>:
> > > > > >
> > > > > > Hi,
> > > > > >
> > > > > > Unfortunately we cannot reproduce the problem. We have tried on
> > > our
> > > > > > cluster with the closest setting with yours:
> > > > > >
> > > > > > CPU:    Intel E5345 2.33GHz (Dual-sockets quad-core)
> > > > > > Memory: 6GB
> > > > > > OS:     2.6.18-8.el5 kernel, cr to local file system
> > > > > >
> > > > > > We run 8 processes, 4 processes on each node, block distribution
> > > as you
> > > > > > specified. We tried checkpoint/restart at various timestamp. But
> > > we did
> > > > > > not see the problem.
> > > > > >
> > > > > > Do you see the problem consistently? Is it possible for you to try
> > > a new
> > > > > > kernel?
> > > > > >
> > > > > > Thanks.
> > > > > >
> > > > > > Regards,
> > > > > > Wei Huang
> > > > > >
> > > > > > 774 Dreese Lab, 2015 Neil Ave,
> > > > > > Dept. of Computer Science and Engineering
> > > > > > Ohio State University
> > > > > > OH 43210
> > > > > > Tel: (614)292-8501
> > > > > >
> > > > > >
> > > > > > On Thu, 25 Oct 2007, sunway qilu wrote:
> > > > > >
> > > > > > > Thanks for your's response!
> > > > > > >
> > > > > > >  there'a a bit more information abiut my computing platform:
> > > > > > > 1. CPU : Intel Woodcrest 5140(2.33GHz,4M Cache,1333MHz)
> > > > > > > 2. Mem : 4GB (  had try to set the mem=2046M as system boot in
> > > > > > grub.config,but
> > > > > > > the
> > > > > > > error reproducibility.)
> > > > > > > 3. OS Kernel : 2.6.9-42 + lustre 1.5.95
> > > > > > > 4. I had test the mvapich2_blcr at another platform (CPU:Intep
> > > > > > Woodcrest
> > > > > > > 160;Mem:16GB ),b the error reproducibility
> > > > > > >
> > > > > > > thanks
> > > > > > >
> > > > > > > 2007/10/25, wei huang < huanwei at cse.ohio-state.edu>:
> > > > > > > >
> > > > > > > > Hi,
> > > > > > > >
> > > > > > > > Thanks for your detailed note. We are looking at it and will
> > > get
> > > > > > back to
> > > > > > > > you as soon as we find anything.
> > > > > > > >
> > > > > > > > Also, would you please let us know a bit more information on
> > > your
> > > > > > > > computing platform? Such as CPU, memory size, etc. BTW, do you
> > > mean
> > > > > > > > kernel 2.6.22?
> > > > > > > >
> > > > > > > > Thanks.
> > > > > > > >
> > > > > > > >
> > > > > > > > Thanks.
> > > > > > > >
> > > > > > > > Regards,
> > > > > > > > Wei Huang
> > > > > > > >
> > > > > > > > 774 Dreese Lab, 2015 Neil Ave,
> > > > > > > > Dept. of Computer Science and Engineering
> > > > > > > > Ohio State University
> > > > > > > > OH 43210
> > > > > > > > Tel: (614)292-8501
> > > > > > > >
> > > > > > > >
> > > > > > > > On Tue, 23 Oct 2007, sunway qilu wrote:
> > > > > > > >
> > > > > > > > > I'm the the mvapich2 + Blcr, but the result is not all right
> > > .
> > > > > > > > > would you please help me?
> > > > > > > > > many thanks!
> > > > > > > > >
> > > > > > > > > This is my env:
> > > > > > > > >
> > > > > > > > > OS : Linux Kernel 2.6.42
> > > > > > > > > C/Fortran :  intel C/C++/Fortran 10.0.0.23
> > > > > > > > > mvapich2 :  mvapich2-trunk-2007-10-22
> > > > > > > > > BLCR : 0.6.1
> > > > > > > > > Program: NPB-2.4
> > > > > > > > >
> > > > > > > > > following is my  test step:
> > > > > > > > >
> > > > > > > > > 1. $ mpdboot -n 3
> > > > > > > > > 2.$ cat cfg
> > > > > > > > > cn22
> > > > > > > > > cn22
> > > > > > > > > cn22
> > > > > > > > > cn22
> > > > > > > > > cn23
> > > > > > > > > cn23
> > > > > > > > > cn23
> > > > > > > > > cn23
> > > > > > > > >
> > > > > > > > > 3. normal test,the result is good.
> > > > > > > > >
> > > > > > > > > $ mpirun -machinefile ./cfg -np 8 ./lu.A.8
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >  NAS Parallel Benchmarks 2.4 -- LU Benchmark
> > > > > > > > >
> > > > > > > > >  Size:  64x 64x 64
> > > > > > > > >  Iterations: 250
> > > > > > > > >  Number of processes:     8
> > > > > > > > >
> > > > > > > > >  Time step    1
> > > > > > > > >  Time step   20
> > > > > > > > >  Time step   40
> > > > > > > > >  Time step   60
> > > > > > > > >  Time step   80
> > > > > > > > >  Time step  100
> > > > > > > > >  Time step  120
> > > > > > > > >  Time step  140
> > > > > > > > >  Time step  160
> > > > > > > > >  Time step  180
> > > > > > > > >  Time step  200
> > > > > > > > >  Time step  220
> > > > > > > > >  Time step  240
> > > > > > > > >  Time step  250
> > > > > > > > >
> > > > > > > > >  Verification being performed for class A
> > > > > > > > >  Accuracy setting for epsilon =   0.1000000000000E-07
> > > > > > > > >  Comparison of RMS-norms of residual
> > > > > > > > >            1   0.7790210760669E+03 0.7790210760669E+03
> > > > > > > > 0.1386387341159E-13
> > > > > > > > >            2   0.6340276525969E+02 0.6340276525969E+02
> > > > > > > > 0.5603404937070E-14
> > > > > > > > >            3   0.1949924972729E+03 0.1949924972729E+03
> > > > > > > > 0.9036993778374E-14
> > > > > > > > >            4   0.1784530116042E+03 0.1784530116042E+03
> > > > > > > > 0.3185343769198E-15
> > > > > > > > >            5   0.1838476034946E+04 0.1838476034946E+04
> > > > > > > > 0.1187280792767E-13
> > > > > > > > >  Comparison of RMS-norms of solution error
> > > > > > > > >            1   0.2996408568547E+02 0.2996408568547E+02
> > > > > > > > 0.1185657295234E-14
> > > > > > > > >            2   0.2819457636500E+01 0.2819457636500E+01
> > > > > > > > 0.1370326007271E-13
> > > > > > > > >            3   0.7347341269878E+01 0.7347341269877E+01
> > > > > > > > 0.7373944071964E-14
> > > > > > > > >            4   0.6713922568778E+01 0.6713922568778E+01
> > > > > > > > 0.7937342832911E-15
> > > > > > > > >            5   0.7071531568839E+02 0.7071531568839E+02
> > > > > > > > 0.1185656063379E-13
> > > > > > > > >  Comparison of surface integral
> > > > > > > > >                 0.2603092560489E+02 0.2603092560489E+02
> > > > > > > > 0.2729609951429E-15
> > > > > > > > >  Verification Successful
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >  LU Benchmark Completed.
> > > > > > > > >  Class           =                        A
> > > > > > > > >  Size            =             64x  64x  64
> > > > > > > > >  Iterations      =                      250
> > > > > > > > >  Time in seconds =                     17.72
> > > > > > > > >  Total processes =                        8
> > > > > > > > >  Compiled procs  =                        8
> > > > > > > > >  Mop/s total     =                  6733.74
> > > > > > > > >  Mop/s/process   =                   841.72
> > > > > > > > >  Operation type  =           floating point
> > > > > > > > >  Verification    =               SUCCESSFUL
> > > > > > > > >  Version         =                      2.4
> > > > > > > > >  Compile date    =              23 Oct 2007
> > > > > > > > >
> > > > > > > > >  Compile options:
> > > > > > > > >     MPIF77       = mpif90
> > > > > > > > >     FLINK        = mpif90
> > > > > > > > >     FMPI_LIB     = (none)
> > > > > > > > >     FMPI_INC     = (none)
> > > > > > > > >     FFLAGS       = -O3
> > > > > > > > >     FLINKFLAGS   = (none)
> > > > > > > > >     RAND         = (none)
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >  Please send the results of this run to:
> > > > > > > > >
> > > > > > > > >  NPB Development Team
> > > > > > > > >  Internet: npb at nas.nasa.gov
> > > > > > > > >
> > > > > > > > >  If email is not available, send this to:
> > > > > > > > >
> > > > > > > > >  MS T27A-1
> > > > > > > > >  NASA Ames Research Center
> > > > > > > > >  Moffett Field, CA  94035-1000
> > > > > > > > >
> > > > > > > > >  Fax: 650-604-3957
> > > > > > > > >
> > > > > > > > > 4  As the lu.A.8 running(4.1), checkpoint it(4.2) .the
> > > lu.A.8contiune(
> > > > > > > > 4.3),the
> > > > > > > > > result is good.
> > > > > > > > >
> > > > > > > > >   4.1 $ mpirun -machinefile ./cfg -np 8 ./lu.A.8
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >  NAS Parallel Benchmarks 2.4 -- LU Benchmark
> > > > > > > > >
> > > > > > > > >  Size:  64x 64x 64
> > > > > > > > >  Iterations: 250
> > > > > > > > >  Number of processes:     8
> > > > > > > > >
> > > > > > > > >  Time step    1
> > > > > > > > >  Time step   20
> > > > > > > > >  Time step   40
> > > > > > > > >  Time step   60
> > > > > > > > >  Time step   80
> > > > > > > > >  Time step  100
> > > > > > > > >  Time step  120
> > > > > > > > >
> > > > > > > > > ...
> > > > > > > > > 4.2 $ mv2_checkpoint
> > > > > > > > >
> > > > > > > > >   PID USER     TT       COMMAND     %CPU   VSZ  START CMD
> > > > > > > > >  7968 yangshj  pts/0    mpirun       0.0 14672  17:25 mpirun
> > > > > > > > -machinefile
> > > > > > > > > ./cfg -np 8 ./lu.A.8
> > > > > > > > >
> > > > > > > > > Enter PID to checkpoint or Control-C to exit: 7968
> > > > > > > > > Checkpointing PID 7968
> > > > > > > > > Checkpoint file: context.7968
> > > > > > > > >
> > > > > > > > > 4.3 $ mpirun -machinefile ./cfg -np 8 ./lu.A.8
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >  NAS Parallel Benchmarks 2.4 -- LU Benchmark
> > > > > > > > >
> > > > > > > > >  Size:  64x 64x 64
> > > > > > > > >  Iterations: 250
> > > > > > > > >  Number of processes:     8
> > > > > > > > >
> > > > > > > > >  Time step    1
> > > > > > > > >  Time step   20
> > > > > > > > >  Time step   40
> > > > > > > > >  Time step   60
> > > > > > > > >  Time step   80
> > > > > > > > >  Time step  100
> > > > > > > > >  Time step  120
> > > > > > > > >  Time step  140
> > > > > > > > >  Time step  160
> > > > > > > > >  Time step  180
> > > > > > > > >  Time step  200
> > > > > > > > >  Time step  220
> > > > > > > > >  Time step  240
> > > > > > > > >  Time step  250
> > > > > > > > >
> > > > > > > > >  Verification being performed for class A
> > > > > > > > >  Accuracy setting for epsilon =   0.1000000000000E-07
> > > > > > > > >  Comparison of RMS-norms of residual
> > > > > > > > >            1   0.7790210760669E+03 0.7790210760669E+03
> > > > > > > > 0.1386387341159E-13
> > > > > > > > >            2   0.6340276525969E+02 0.6340276525969E+02
> > > > > > > > 0.5603404937070E-14
> > > > > > > > >            3   0.1949924972729E+03 0.1949924972729E+03
> > > > > > > > 0.9036993778374E-14
> > > > > > > > >            4   0.1784530116042E+03 0.1784530116042E+03
> > > > > > > > 0.3185343769198E-15
> > > > > > > > >            5   0.1838476034946E+04 0.1838476034946E+04
> > > > > > > > 0.1187280792767E-13
> > > > > > > > >  Comparison of RMS-norms of solution error
> > > > > > > > >            1   0.2996408568547E+02 0.2996408568547E+02
> > > > > > > > 0.1185657295234E-14
> > > > > > > > >            2   0.2819457636500E+01 0.2819457636500E+01
> > > > > > > > 0.1370326007271E-13
> > > > > > > > >            3   0.7347341269878E+01 0.7347341269877E+01
> > > > > > > > 0.7373944071964E-14
> > > > > > > > >            4   0.6713922568778E+01 0.6713922568778E+01
> > > > > > > > 0.7937342832911E-15
> > > > > > > > >            5   0.7071531568839E+02 0.7071531568839E+02
> > > > > > > > 0.1185656063379E-13
> > > > > > > > >  Comparison of surface integral
> > > > > > > > >                0.2603092560489E+02 0.2603092560489E+02
> > > > > > > > 0.2729609951429E-15
> > > > > > > > >  Verification Successful
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >  LU Benchmark Completed.
> > > > > > > > >  Class           =                        A
> > > > > > > > >  Size            =             64x  64x  64
> > > > > > > > >  Iterations      =                      250
> > > > > > > > >  Time in seconds =                     18.78
> > > > > > > > >  Total processes =                        8
> > > > > > > > >  Compiled procs  =                        8
> > > > > > > > >  Mop/s total     =                   6352.76
> > > > > > > > >  Mop/s/process   =                   794.10
> > > > > > > > >  Operation type  =           floating point
> > > > > > > > >  Verification    =               SUCCESSFUL
> > > > > > > > >  Version         =                       2.4
> > > > > > > > >  Compile date    =              23 Oct 2007
> > > > > > > > >
> > > > > > > > >  Compile options:
> > > > > > > > >     MPIF77       = mpif90
> > > > > > > > >     FLINK        = mpif90
> > > > > > > > >     FMPI_LIB     = (none)
> > > > > > > > >     FMPI_INC     = (none)
> > > > > > > > >     FFLAGS       = -O3
> > > > > > > > >     FLINKFLAGS   = (none)
> > > > > > > > >     RAND         = (none)
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >  Please send the results of this run to:
> > > > > > > > >
> > > > > > > > >  NPB Development Team
> > > > > > > > >  Internet: npb at nas.nasa.gov
> > > > > > > > >
> > > > > > > > >  If email is not available, send this to:
> > > > > > > > >
> > > > > > > > >  MS T27A-1
> > > > > > > > >  NASA Ames Research Center
> > > > > > > > >  Moffett Field, CA  94035-1000
> > > > > > > > >
> > > > > > > > >  Fax: 650-604-3957
> > > > > > > > >
> > > > > > > > > 5.restart the PID  7968 ,then result has "NaN "(5.1
> > > > > > ),sometimes  the
> > > > > > > > > "FAILURE: " & "UNSUCCESSFUL"
> > > > > > > > >
> > > > > > > > > 5.1 $ cr_restart context.7968
> > > > > > > > > mpiexec_cn21 (mpiexec 335): mpiexec: Restarting
> > > > > > > > >  Time step  120
> > > > > > > > >  Time step  140
> > > > > > > > >  Time step  160
> > > > > > > > >  Time step  180
> > > > > > > > >  Time step  200
> > > > > > > > >  Time step  220
> > > > > > > > >  Time step  240
> > > > > > > > >  Time step  250
> > > > > > > > >
> > > > > > > > >  Verification being performed for class A
> > > > > > > > >  Accuracy setting for epsilon =  0.1000000000000E-07
> > > > > > > > >  Comparison of RMS-norms of residual
> > > > > > > > >            1   NaN                 0.7790210760669E+03 NaN
> > > > > > > > >            2   NaN                 0.6340276525969E+02 NaN
> > > > > > > > >            3   NaN                 0.1949924972729E+03 NaN
> > > > > > > > >            4   NaN                 0.1784530116042E+03 NaN
> > > > > > > > >            5   NaN                 0.1838476034946E+04 NaN
> > > > > > > > >  Comparison of RMS-norms of solution error
> > > > > > > > >            1   NaN                 0.2996408568547E+02 NaN
> > > > > > > > >            2   NaN                 0.2819457636500E+01 NaN
> > > > > > > > >            3   NaN                 0.7347341269877E+01 NaN
> > > > > > > > >            4   NaN                 0.6713922568778E+01 NaN
> > > > > > > > >            5   NaN                 0.7071531568839E+02 NaN
> > > > > > > > >  Comparison of surface integral
> > > > > > > > >                NaN                 0.2603092560489E+02 NaN
> > > > > > > > >  Verification Successful
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >  LU Benchmark Completed.
> > > > > > > > >  Class           =                        A
> > > > > > > > >  Size            =             64x  64x  64
> > > > > > > > >  Iterations      =                      250
> > > > > > > > >  Time in seconds =                     66.11
> > > > > > > > >  Total processes =                        8
> > > > > > > > >  Compiled procs  =                        8
> > > > > > > > >  Mop/s total     =                  1804.50
> > > > > > > > >  Mop/s/process   =                   225.56
> > > > > > > > >  Operation type  =           floating point
> > > > > > > > >  Verification    =               SUCCESSFUL
> > > > > > > > >  Version         =                      2.4
> > > > > > > > >  Compile date    =              23 Oct 2007
> > > > > > > > >
> > > > > > > > >  Compile options:
> > > > > > > > >     MPIF77       = mpif90
> > > > > > > > >     FLINK        = mpif90
> > > > > > > > >     FMPI_LIB     = (none)
> > > > > > > > >     FMPI_INC     = (none)
> > > > > > > > >     FFLAGS       = -O3
> > > > > > > > >     FLINKFLAGS   = (none)
> > > > > > > > >     RAND         = (none)
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >  Please send the results of this run to:
> > > > > > > > >
> > > > > > > > >  NPB Development Team
> > > > > > > > >  Internet: npb at nas.nasa.gov
> > > > > > > > >
> > > > > > > > >  If email is not available, send this to:
> > > > > > > > >
> > > > > > > > >  MS T27A-1
> > > > > > > > >  NASA Ames Research Center
> > > > > > > > >  Moffett Field, CA  94035-1000
> > > > > > > > >
> > > > > > > > >  Fax: 650-604-3957
> > > > > > > > >
> > > > > > > > > 5.2.$ mpirun -machinefile ./cfg -np 8 ./lu.A.8
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >  NAS Parallel Benchmarks 2.4 -- LU Benchmark
> > > > > > > > >
> > > > > > > > >  Size:  64x 64x 64
> > > > > > > > >  Iterations: 250
> > > > > > > > >  Number of processes:     8
> > > > > > > > >
> > > > > > > > >  Time step    1
> > > > > > > > >  Time step   20
> > > > > > > > >  Time step   40
> > > > > > > > >  Time step   60
> > > > > > > > >  Time step   80
> > > > > > > > >  Time step  100
> > > > > > > > >  Time step  120
> > > > > > > > >  Time step  140
> > > > > > > > >  Time step  160
> > > > > > > > >  Time step  180
> > > > > > > > >  Time step  200
> > > > > > > > >  Time step  220
> > > > > > > > >  Time step  240
> > > > > > > > >  Time step  250
> > > > > > > > >
> > > > > > > > >  Verification being performed for class A
> > > > > > > > >  Accuracy setting for epsilon =   0.1000000000000E-07
> > > > > > > > >  Comparison of RMS-norms of residual
> > > > > > > > >  FAILURE:  1   0.7790355334612E+03 0.7790210760669E+03
> > > > > > > > 0.1855841227478E-04
> > > > > > > > >  FAILURE:  2   0.6340489955249E+02 0.6340276525969E+02
> > > > > > > > 0.3366245600758E-04
> > > > > > > > >  FAILURE:  3   0.1949964027466E+03 0.1949924972729E+03
> > > > > > > > 0.2002884068547E-04
> > > > > > > > >  FAILURE:  4   0.1784563048837E+03 0.1784530116042E+03
> > > > > > > > 0.1845460320509E-04
> > > > > > > > >  FAILURE:  5   0.1838499810682E+04 0.1838476034946E+04
> > > > > > > > 0.1293230623563E-04
> > > > > > > > >  Comparison of RMS-norms of solution error
> > > > > > > > >  FAILURE:  1   0.2996451081467E+02 0.2996408568547E+02
> > > > > > > > 0.1418795824413E-04
> > > > > > > > >  FAILURE:  2   0.2819496132217E+01 0.2819457636500E+01
> > > > > > > > 0.1365358930094E-04
> > > > > > > > >  FAILURE:  3   0.7347450238213E+01 0.7347341269877E+01
> > > > > > > > 0.1483098878912E-04
> > > > > > > > >  FAILURE:  4   0.6714013230847E+01 0.6713922568778E+01
> > > > > > > > 0.1350359173032E-04
> > > > > > > > >  FAILURE:  5   0.7071607035800E+02 0.7071531568839E+02
> > > > > > > > 0.1067194005931E-04
> > > > > > > > >  Comparison of surface integral
> > > > > > > > >  FAILURE:       0.2603109553197E+02 0.2603092560489E+02
> > > > > > > > 0.6527892352571E-05
> > > > > > > > >  Verification failed
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >  LU Benchmark Completed.
> > > > > > > > >  Class           =                        A
> > > > > > > > >  Size            =             64x  64x  64
> > > > > > > > >  Iterations      =                      250
> > > > > > > > >  Time in seconds =                     17.15
> > > > > > > > >  Total processes =                        8
> > > > > > > > >  Compiled procs  =                        8
> > > > > > > > >  Mop/s total     =                   6956.73
> > > > > > > > >  Mop/s/process   =                   869.59
> > > > > > > > >  Operation type  =           floating point
> > > > > > > > >  Verification    =             UNSUCCESSFUL
> > > > > > > > >  Version         =                       2.4
> > > > > > > > >  Compile date    =              22 Oct 2007
> > > > > > > > >
> > > > > > > > >  Compile options:
> > > > > > > > >     MPIF77       = mpif90
> > > > > > > > >     FLINK        = mpif90
> > > > > > > > >     FMPI_LIB     = (none)
> > > > > > > > >     FMPI_INC     = (none)
> > > > > > > > >     FFLAGS       = -O3
> > > > > > > > >     FLINKFLAGS   = (none)
> > > > > > > > >     RAND         = (none)
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >  Please send the results of this run to:
> > > > > > > > >
> > > > > > > > >  NPB Development Team
> > > > > > > > >  Internet: npb at nas.nasa.gov
> > > > > > > > >
> > > > > > > > >  If email is not available, send this to:
> > > > > > > > >
> > > > > > > > >  MS T27A-1
> > > > > > > > >  NASA Ames Research Center
> > > > > > > > >  Moffett Field, CA  94035-1000
> > > > > > > > >
> > > > > > > > >  Fax: 650-604-3957
> > > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> > >
> >
>




More information about the mvapich-discuss mailing list