[mvapich-discuss] Problem with NPB-2.4/mvapich2/BLCR
wei huang
huanwei at cse.ohio-state.edu
Mon Oct 29 17:07:43 EDT 2007
Hi,
We've contacted BLCR people regarding this issue. Apparently there is
other people reporting problems similar to yours (not in MPI context,
though). You can take a look at this bugzilla entry from BLCR:
http://upc-bugs.lbl.gov/bugzilla/show_bug.cgi?id=2001
This thread is still active right now. Let's hope the problem get resolved
soon.
Thanks.
Regards,
Wei Huang
774 Dreese Lab, 2015 Neil Ave,
Dept. of Computer Science and Engineering
Ohio State University
OH 43210
Tel: (614)292-8501
On Mon, 29 Oct 2007, sunway wrote:
> > > > > 2007/10/26, wei huang <huanwei at cse.ohio-state.edu>:
> > > > > >
> > > > > > Hi,
> > > > > >
> > > > > > Unfortunately we cannot reproduce the problem. We have tried on
> > > our
> > > > > > cluster with the closest setting with yours:
> > > > > >
> > > > > > CPU: Intel E5345 2.33GHz (Dual-sockets quad-core)
> > > > > > Memory: 6GB
> > > > > > OS: 2.6.18-8.el5 kernel, cr to local file system
> > > > > >
> > > > > > We run 8 processes, 4 processes on each node, block distribution
> > > as you
> > > > > > specified. We tried checkpoint/restart at various timestamp. But
> > > we did
> > > > > > not see the problem.
> > > > > >
> > > > > > Do you see the problem consistently? Is it possible for you to try
> > > a new
> > > > > > kernel?
> > > > > >
> > > > > > Thanks.
> > > > > >
> > > > > > Regards,
> > > > > > Wei Huang
> > > > > >
> > > > > > 774 Dreese Lab, 2015 Neil Ave,
> > > > > > Dept. of Computer Science and Engineering
> > > > > > Ohio State University
> > > > > > OH 43210
> > > > > > Tel: (614)292-8501
> > > > > >
> > > > > >
> > > > > > On Thu, 25 Oct 2007, sunway qilu wrote:
> > > > > >
> > > > > > > Thanks for your's response!
> > > > > > >
> > > > > > > there'a a bit more information abiut my computing platform:
> > > > > > > 1. CPU : Intel Woodcrest 5140(2.33GHz,4M Cache,1333MHz)
> > > > > > > 2. Mem : 4GB ( had try to set the mem=2046M as system boot in
> > > > > > grub.config,but
> > > > > > > the
> > > > > > > error reproducibility.)
> > > > > > > 3. OS Kernel : 2.6.9-42 + lustre 1.5.95
> > > > > > > 4. I had test the mvapich2_blcr at another platform (CPU:Intep
> > > > > > Woodcrest
> > > > > > > 160;Mem:16GB ),b the error reproducibility
> > > > > > >
> > > > > > > thanks
> > > > > > >
> > > > > > > 2007/10/25, wei huang < huanwei at cse.ohio-state.edu>:
> > > > > > > >
> > > > > > > > Hi,
> > > > > > > >
> > > > > > > > Thanks for your detailed note. We are looking at it and will
> > > get
> > > > > > back to
> > > > > > > > you as soon as we find anything.
> > > > > > > >
> > > > > > > > Also, would you please let us know a bit more information on
> > > your
> > > > > > > > computing platform? Such as CPU, memory size, etc. BTW, do you
> > > mean
> > > > > > > > kernel 2.6.22?
> > > > > > > >
> > > > > > > > Thanks.
> > > > > > > >
> > > > > > > >
> > > > > > > > Thanks.
> > > > > > > >
> > > > > > > > Regards,
> > > > > > > > Wei Huang
> > > > > > > >
> > > > > > > > 774 Dreese Lab, 2015 Neil Ave,
> > > > > > > > Dept. of Computer Science and Engineering
> > > > > > > > Ohio State University
> > > > > > > > OH 43210
> > > > > > > > Tel: (614)292-8501
> > > > > > > >
> > > > > > > >
> > > > > > > > On Tue, 23 Oct 2007, sunway qilu wrote:
> > > > > > > >
> > > > > > > > > I'm the the mvapich2 + Blcr, but the result is not all right
> > > .
> > > > > > > > > would you please help me?
> > > > > > > > > many thanks!
> > > > > > > > >
> > > > > > > > > This is my env:
> > > > > > > > >
> > > > > > > > > OS : Linux Kernel 2.6.42
> > > > > > > > > C/Fortran : intel C/C++/Fortran 10.0.0.23
> > > > > > > > > mvapich2 : mvapich2-trunk-2007-10-22
> > > > > > > > > BLCR : 0.6.1
> > > > > > > > > Program: NPB-2.4
> > > > > > > > >
> > > > > > > > > following is my test step:
> > > > > > > > >
> > > > > > > > > 1. $ mpdboot -n 3
> > > > > > > > > 2.$ cat cfg
> > > > > > > > > cn22
> > > > > > > > > cn22
> > > > > > > > > cn22
> > > > > > > > > cn22
> > > > > > > > > cn23
> > > > > > > > > cn23
> > > > > > > > > cn23
> > > > > > > > > cn23
> > > > > > > > >
> > > > > > > > > 3. normal test,the result is good.
> > > > > > > > >
> > > > > > > > > $ mpirun -machinefile ./cfg -np 8 ./lu.A.8
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > NAS Parallel Benchmarks 2.4 -- LU Benchmark
> > > > > > > > >
> > > > > > > > > Size: 64x 64x 64
> > > > > > > > > Iterations: 250
> > > > > > > > > Number of processes: 8
> > > > > > > > >
> > > > > > > > > Time step 1
> > > > > > > > > Time step 20
> > > > > > > > > Time step 40
> > > > > > > > > Time step 60
> > > > > > > > > Time step 80
> > > > > > > > > Time step 100
> > > > > > > > > Time step 120
> > > > > > > > > Time step 140
> > > > > > > > > Time step 160
> > > > > > > > > Time step 180
> > > > > > > > > Time step 200
> > > > > > > > > Time step 220
> > > > > > > > > Time step 240
> > > > > > > > > Time step 250
> > > > > > > > >
> > > > > > > > > Verification being performed for class A
> > > > > > > > > Accuracy setting for epsilon = 0.1000000000000E-07
> > > > > > > > > Comparison of RMS-norms of residual
> > > > > > > > > 1 0.7790210760669E+03 0.7790210760669E+03
> > > > > > > > 0.1386387341159E-13
> > > > > > > > > 2 0.6340276525969E+02 0.6340276525969E+02
> > > > > > > > 0.5603404937070E-14
> > > > > > > > > 3 0.1949924972729E+03 0.1949924972729E+03
> > > > > > > > 0.9036993778374E-14
> > > > > > > > > 4 0.1784530116042E+03 0.1784530116042E+03
> > > > > > > > 0.3185343769198E-15
> > > > > > > > > 5 0.1838476034946E+04 0.1838476034946E+04
> > > > > > > > 0.1187280792767E-13
> > > > > > > > > Comparison of RMS-norms of solution error
> > > > > > > > > 1 0.2996408568547E+02 0.2996408568547E+02
> > > > > > > > 0.1185657295234E-14
> > > > > > > > > 2 0.2819457636500E+01 0.2819457636500E+01
> > > > > > > > 0.1370326007271E-13
> > > > > > > > > 3 0.7347341269878E+01 0.7347341269877E+01
> > > > > > > > 0.7373944071964E-14
> > > > > > > > > 4 0.6713922568778E+01 0.6713922568778E+01
> > > > > > > > 0.7937342832911E-15
> > > > > > > > > 5 0.7071531568839E+02 0.7071531568839E+02
> > > > > > > > 0.1185656063379E-13
> > > > > > > > > Comparison of surface integral
> > > > > > > > > 0.2603092560489E+02 0.2603092560489E+02
> > > > > > > > 0.2729609951429E-15
> > > > > > > > > Verification Successful
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > LU Benchmark Completed.
> > > > > > > > > Class = A
> > > > > > > > > Size = 64x 64x 64
> > > > > > > > > Iterations = 250
> > > > > > > > > Time in seconds = 17.72
> > > > > > > > > Total processes = 8
> > > > > > > > > Compiled procs = 8
> > > > > > > > > Mop/s total = 6733.74
> > > > > > > > > Mop/s/process = 841.72
> > > > > > > > > Operation type = floating point
> > > > > > > > > Verification = SUCCESSFUL
> > > > > > > > > Version = 2.4
> > > > > > > > > Compile date = 23 Oct 2007
> > > > > > > > >
> > > > > > > > > Compile options:
> > > > > > > > > MPIF77 = mpif90
> > > > > > > > > FLINK = mpif90
> > > > > > > > > FMPI_LIB = (none)
> > > > > > > > > FMPI_INC = (none)
> > > > > > > > > FFLAGS = -O3
> > > > > > > > > FLINKFLAGS = (none)
> > > > > > > > > RAND = (none)
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > Please send the results of this run to:
> > > > > > > > >
> > > > > > > > > NPB Development Team
> > > > > > > > > Internet: npb at nas.nasa.gov
> > > > > > > > >
> > > > > > > > > If email is not available, send this to:
> > > > > > > > >
> > > > > > > > > MS T27A-1
> > > > > > > > > NASA Ames Research Center
> > > > > > > > > Moffett Field, CA 94035-1000
> > > > > > > > >
> > > > > > > > > Fax: 650-604-3957
> > > > > > > > >
> > > > > > > > > 4 As the lu.A.8 running(4.1), checkpoint it(4.2) .the
> > > lu.A.8contiune(
> > > > > > > > 4.3),the
> > > > > > > > > result is good.
> > > > > > > > >
> > > > > > > > > 4.1 $ mpirun -machinefile ./cfg -np 8 ./lu.A.8
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > NAS Parallel Benchmarks 2.4 -- LU Benchmark
> > > > > > > > >
> > > > > > > > > Size: 64x 64x 64
> > > > > > > > > Iterations: 250
> > > > > > > > > Number of processes: 8
> > > > > > > > >
> > > > > > > > > Time step 1
> > > > > > > > > Time step 20
> > > > > > > > > Time step 40
> > > > > > > > > Time step 60
> > > > > > > > > Time step 80
> > > > > > > > > Time step 100
> > > > > > > > > Time step 120
> > > > > > > > >
> > > > > > > > > ...
> > > > > > > > > 4.2 $ mv2_checkpoint
> > > > > > > > >
> > > > > > > > > PID USER TT COMMAND %CPU VSZ START CMD
> > > > > > > > > 7968 yangshj pts/0 mpirun 0.0 14672 17:25 mpirun
> > > > > > > > -machinefile
> > > > > > > > > ./cfg -np 8 ./lu.A.8
> > > > > > > > >
> > > > > > > > > Enter PID to checkpoint or Control-C to exit: 7968
> > > > > > > > > Checkpointing PID 7968
> > > > > > > > > Checkpoint file: context.7968
> > > > > > > > >
> > > > > > > > > 4.3 $ mpirun -machinefile ./cfg -np 8 ./lu.A.8
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > NAS Parallel Benchmarks 2.4 -- LU Benchmark
> > > > > > > > >
> > > > > > > > > Size: 64x 64x 64
> > > > > > > > > Iterations: 250
> > > > > > > > > Number of processes: 8
> > > > > > > > >
> > > > > > > > > Time step 1
> > > > > > > > > Time step 20
> > > > > > > > > Time step 40
> > > > > > > > > Time step 60
> > > > > > > > > Time step 80
> > > > > > > > > Time step 100
> > > > > > > > > Time step 120
> > > > > > > > > Time step 140
> > > > > > > > > Time step 160
> > > > > > > > > Time step 180
> > > > > > > > > Time step 200
> > > > > > > > > Time step 220
> > > > > > > > > Time step 240
> > > > > > > > > Time step 250
> > > > > > > > >
> > > > > > > > > Verification being performed for class A
> > > > > > > > > Accuracy setting for epsilon = 0.1000000000000E-07
> > > > > > > > > Comparison of RMS-norms of residual
> > > > > > > > > 1 0.7790210760669E+03 0.7790210760669E+03
> > > > > > > > 0.1386387341159E-13
> > > > > > > > > 2 0.6340276525969E+02 0.6340276525969E+02
> > > > > > > > 0.5603404937070E-14
> > > > > > > > > 3 0.1949924972729E+03 0.1949924972729E+03
> > > > > > > > 0.9036993778374E-14
> > > > > > > > > 4 0.1784530116042E+03 0.1784530116042E+03
> > > > > > > > 0.3185343769198E-15
> > > > > > > > > 5 0.1838476034946E+04 0.1838476034946E+04
> > > > > > > > 0.1187280792767E-13
> > > > > > > > > Comparison of RMS-norms of solution error
> > > > > > > > > 1 0.2996408568547E+02 0.2996408568547E+02
> > > > > > > > 0.1185657295234E-14
> > > > > > > > > 2 0.2819457636500E+01 0.2819457636500E+01
> > > > > > > > 0.1370326007271E-13
> > > > > > > > > 3 0.7347341269878E+01 0.7347341269877E+01
> > > > > > > > 0.7373944071964E-14
> > > > > > > > > 4 0.6713922568778E+01 0.6713922568778E+01
> > > > > > > > 0.7937342832911E-15
> > > > > > > > > 5 0.7071531568839E+02 0.7071531568839E+02
> > > > > > > > 0.1185656063379E-13
> > > > > > > > > Comparison of surface integral
> > > > > > > > > 0.2603092560489E+02 0.2603092560489E+02
> > > > > > > > 0.2729609951429E-15
> > > > > > > > > Verification Successful
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > LU Benchmark Completed.
> > > > > > > > > Class = A
> > > > > > > > > Size = 64x 64x 64
> > > > > > > > > Iterations = 250
> > > > > > > > > Time in seconds = 18.78
> > > > > > > > > Total processes = 8
> > > > > > > > > Compiled procs = 8
> > > > > > > > > Mop/s total = 6352.76
> > > > > > > > > Mop/s/process = 794.10
> > > > > > > > > Operation type = floating point
> > > > > > > > > Verification = SUCCESSFUL
> > > > > > > > > Version = 2.4
> > > > > > > > > Compile date = 23 Oct 2007
> > > > > > > > >
> > > > > > > > > Compile options:
> > > > > > > > > MPIF77 = mpif90
> > > > > > > > > FLINK = mpif90
> > > > > > > > > FMPI_LIB = (none)
> > > > > > > > > FMPI_INC = (none)
> > > > > > > > > FFLAGS = -O3
> > > > > > > > > FLINKFLAGS = (none)
> > > > > > > > > RAND = (none)
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > Please send the results of this run to:
> > > > > > > > >
> > > > > > > > > NPB Development Team
> > > > > > > > > Internet: npb at nas.nasa.gov
> > > > > > > > >
> > > > > > > > > If email is not available, send this to:
> > > > > > > > >
> > > > > > > > > MS T27A-1
> > > > > > > > > NASA Ames Research Center
> > > > > > > > > Moffett Field, CA 94035-1000
> > > > > > > > >
> > > > > > > > > Fax: 650-604-3957
> > > > > > > > >
> > > > > > > > > 5.restart the PID 7968 ,then result has "NaN "(5.1
> > > > > > ),sometimes the
> > > > > > > > > "FAILURE: " & "UNSUCCESSFUL"
> > > > > > > > >
> > > > > > > > > 5.1 $ cr_restart context.7968
> > > > > > > > > mpiexec_cn21 (mpiexec 335): mpiexec: Restarting
> > > > > > > > > Time step 120
> > > > > > > > > Time step 140
> > > > > > > > > Time step 160
> > > > > > > > > Time step 180
> > > > > > > > > Time step 200
> > > > > > > > > Time step 220
> > > > > > > > > Time step 240
> > > > > > > > > Time step 250
> > > > > > > > >
> > > > > > > > > Verification being performed for class A
> > > > > > > > > Accuracy setting for epsilon = 0.1000000000000E-07
> > > > > > > > > Comparison of RMS-norms of residual
> > > > > > > > > 1 NaN 0.7790210760669E+03 NaN
> > > > > > > > > 2 NaN 0.6340276525969E+02 NaN
> > > > > > > > > 3 NaN 0.1949924972729E+03 NaN
> > > > > > > > > 4 NaN 0.1784530116042E+03 NaN
> > > > > > > > > 5 NaN 0.1838476034946E+04 NaN
> > > > > > > > > Comparison of RMS-norms of solution error
> > > > > > > > > 1 NaN 0.2996408568547E+02 NaN
> > > > > > > > > 2 NaN 0.2819457636500E+01 NaN
> > > > > > > > > 3 NaN 0.7347341269877E+01 NaN
> > > > > > > > > 4 NaN 0.6713922568778E+01 NaN
> > > > > > > > > 5 NaN 0.7071531568839E+02 NaN
> > > > > > > > > Comparison of surface integral
> > > > > > > > > NaN 0.2603092560489E+02 NaN
> > > > > > > > > Verification Successful
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > LU Benchmark Completed.
> > > > > > > > > Class = A
> > > > > > > > > Size = 64x 64x 64
> > > > > > > > > Iterations = 250
> > > > > > > > > Time in seconds = 66.11
> > > > > > > > > Total processes = 8
> > > > > > > > > Compiled procs = 8
> > > > > > > > > Mop/s total = 1804.50
> > > > > > > > > Mop/s/process = 225.56
> > > > > > > > > Operation type = floating point
> > > > > > > > > Verification = SUCCESSFUL
> > > > > > > > > Version = 2.4
> > > > > > > > > Compile date = 23 Oct 2007
> > > > > > > > >
> > > > > > > > > Compile options:
> > > > > > > > > MPIF77 = mpif90
> > > > > > > > > FLINK = mpif90
> > > > > > > > > FMPI_LIB = (none)
> > > > > > > > > FMPI_INC = (none)
> > > > > > > > > FFLAGS = -O3
> > > > > > > > > FLINKFLAGS = (none)
> > > > > > > > > RAND = (none)
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > Please send the results of this run to:
> > > > > > > > >
> > > > > > > > > NPB Development Team
> > > > > > > > > Internet: npb at nas.nasa.gov
> > > > > > > > >
> > > > > > > > > If email is not available, send this to:
> > > > > > > > >
> > > > > > > > > MS T27A-1
> > > > > > > > > NASA Ames Research Center
> > > > > > > > > Moffett Field, CA 94035-1000
> > > > > > > > >
> > > > > > > > > Fax: 650-604-3957
> > > > > > > > >
> > > > > > > > > 5.2.$ mpirun -machinefile ./cfg -np 8 ./lu.A.8
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > NAS Parallel Benchmarks 2.4 -- LU Benchmark
> > > > > > > > >
> > > > > > > > > Size: 64x 64x 64
> > > > > > > > > Iterations: 250
> > > > > > > > > Number of processes: 8
> > > > > > > > >
> > > > > > > > > Time step 1
> > > > > > > > > Time step 20
> > > > > > > > > Time step 40
> > > > > > > > > Time step 60
> > > > > > > > > Time step 80
> > > > > > > > > Time step 100
> > > > > > > > > Time step 120
> > > > > > > > > Time step 140
> > > > > > > > > Time step 160
> > > > > > > > > Time step 180
> > > > > > > > > Time step 200
> > > > > > > > > Time step 220
> > > > > > > > > Time step 240
> > > > > > > > > Time step 250
> > > > > > > > >
> > > > > > > > > Verification being performed for class A
> > > > > > > > > Accuracy setting for epsilon = 0.1000000000000E-07
> > > > > > > > > Comparison of RMS-norms of residual
> > > > > > > > > FAILURE: 1 0.7790355334612E+03 0.7790210760669E+03
> > > > > > > > 0.1855841227478E-04
> > > > > > > > > FAILURE: 2 0.6340489955249E+02 0.6340276525969E+02
> > > > > > > > 0.3366245600758E-04
> > > > > > > > > FAILURE: 3 0.1949964027466E+03 0.1949924972729E+03
> > > > > > > > 0.2002884068547E-04
> > > > > > > > > FAILURE: 4 0.1784563048837E+03 0.1784530116042E+03
> > > > > > > > 0.1845460320509E-04
> > > > > > > > > FAILURE: 5 0.1838499810682E+04 0.1838476034946E+04
> > > > > > > > 0.1293230623563E-04
> > > > > > > > > Comparison of RMS-norms of solution error
> > > > > > > > > FAILURE: 1 0.2996451081467E+02 0.2996408568547E+02
> > > > > > > > 0.1418795824413E-04
> > > > > > > > > FAILURE: 2 0.2819496132217E+01 0.2819457636500E+01
> > > > > > > > 0.1365358930094E-04
> > > > > > > > > FAILURE: 3 0.7347450238213E+01 0.7347341269877E+01
> > > > > > > > 0.1483098878912E-04
> > > > > > > > > FAILURE: 4 0.6714013230847E+01 0.6713922568778E+01
> > > > > > > > 0.1350359173032E-04
> > > > > > > > > FAILURE: 5 0.7071607035800E+02 0.7071531568839E+02
> > > > > > > > 0.1067194005931E-04
> > > > > > > > > Comparison of surface integral
> > > > > > > > > FAILURE: 0.2603109553197E+02 0.2603092560489E+02
> > > > > > > > 0.6527892352571E-05
> > > > > > > > > Verification failed
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > LU Benchmark Completed.
> > > > > > > > > Class = A
> > > > > > > > > Size = 64x 64x 64
> > > > > > > > > Iterations = 250
> > > > > > > > > Time in seconds = 17.15
> > > > > > > > > Total processes = 8
> > > > > > > > > Compiled procs = 8
> > > > > > > > > Mop/s total = 6956.73
> > > > > > > > > Mop/s/process = 869.59
> > > > > > > > > Operation type = floating point
> > > > > > > > > Verification = UNSUCCESSFUL
> > > > > > > > > Version = 2.4
> > > > > > > > > Compile date = 22 Oct 2007
> > > > > > > > >
> > > > > > > > > Compile options:
> > > > > > > > > MPIF77 = mpif90
> > > > > > > > > FLINK = mpif90
> > > > > > > > > FMPI_LIB = (none)
> > > > > > > > > FMPI_INC = (none)
> > > > > > > > > FFLAGS = -O3
> > > > > > > > > FLINKFLAGS = (none)
> > > > > > > > > RAND = (none)
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > Please send the results of this run to:
> > > > > > > > >
> > > > > > > > > NPB Development Team
> > > > > > > > > Internet: npb at nas.nasa.gov
> > > > > > > > >
> > > > > > > > > If email is not available, send this to:
> > > > > > > > >
> > > > > > > > > MS T27A-1
> > > > > > > > > NASA Ames Research Center
> > > > > > > > > Moffett Field, CA 94035-1000
> > > > > > > > >
> > > > > > > > > Fax: 650-604-3957
> > > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> > >
> >
>
More information about the mvapich-discuss
mailing list