[mvapich-discuss] Problem with NPB-2.4/mvapich2/BLCR
wei huang
huanwei at cse.ohio-state.edu
Fri Oct 26 00:14:15 EDT 2007
Hi,
Unfortunately we cannot reproduce the problem. We have tried on our
cluster with the closest setting with yours:
CPU: Intel E5345 2.33GHz (Dual-sockets quad-core)
Memory: 6GB
OS: 2.6.18-8.el5 kernel, cr to local file system
We run 8 processes, 4 processes on each node, block distribution as you
specified. We tried checkpoint/restart at various timestamp. But we did
not see the problem.
Do you see the problem consistently? Is it possible for you to try a new
kernel?
Thanks.
Regards,
Wei Huang
774 Dreese Lab, 2015 Neil Ave,
Dept. of Computer Science and Engineering
Ohio State University
OH 43210
Tel: (614)292-8501
On Thu, 25 Oct 2007, sunway qilu wrote:
> Thanks for your's response!
>
> there'a a bit more information abiut my computing platform:
> 1. CPU : Intel Woodcrest 5140(2.33GHz,4M Cache,1333MHz)
> 2. Mem : 4GB ( had try to set the mem=2046M as system boot in grub.config,but
> the
> error reproducibility.)
> 3. OS Kernel : 2.6.9-42 + lustre 1.5.95
> 4. I had test the mvapich2_blcr at another platform (CPU:Intep Woodcrest
> 160;Mem:16GB ),b the error reproducibility
>
> thanks
>
> 2007/10/25, wei huang <huanwei at cse.ohio-state.edu>:
> >
> > Hi,
> >
> > Thanks for your detailed note. We are looking at it and will get back to
> > you as soon as we find anything.
> >
> > Also, would you please let us know a bit more information on your
> > computing platform? Such as CPU, memory size, etc. BTW, do you mean
> > kernel 2.6.22?
> >
> > Thanks.
> >
> >
> > Thanks.
> >
> > Regards,
> > Wei Huang
> >
> > 774 Dreese Lab, 2015 Neil Ave,
> > Dept. of Computer Science and Engineering
> > Ohio State University
> > OH 43210
> > Tel: (614)292-8501
> >
> >
> > On Tue, 23 Oct 2007, sunway qilu wrote:
> >
> > > I'm the the mvapich2 + Blcr, but the result is not all right .
> > > would you please help me?
> > > many thanks!
> > >
> > > This is my env:
> > >
> > > OS : Linux Kernel 2.6.42
> > > C/Fortran : intel C/C++/Fortran 10.0.0.23
> > > mvapich2 : mvapich2-trunk-2007-10-22
> > > BLCR : 0.6.1
> > > Program: NPB-2.4
> > >
> > > following is my test step:
> > >
> > > 1. $ mpdboot -n 3
> > > 2.$ cat cfg
> > > cn22
> > > cn22
> > > cn22
> > > cn22
> > > cn23
> > > cn23
> > > cn23
> > > cn23
> > >
> > > 3. normal test,the result is good.
> > >
> > > $ mpirun -machinefile ./cfg -np 8 ./lu.A.8
> > >
> > >
> > > NAS Parallel Benchmarks 2.4 -- LU Benchmark
> > >
> > > Size: 64x 64x 64
> > > Iterations: 250
> > > Number of processes: 8
> > >
> > > Time step 1
> > > Time step 20
> > > Time step 40
> > > Time step 60
> > > Time step 80
> > > Time step 100
> > > Time step 120
> > > Time step 140
> > > Time step 160
> > > Time step 180
> > > Time step 200
> > > Time step 220
> > > Time step 240
> > > Time step 250
> > >
> > > Verification being performed for class A
> > > Accuracy setting for epsilon = 0.1000000000000E-07
> > > Comparison of RMS-norms of residual
> > > 1 0.7790210760669E+03 0.7790210760669E+03
> > 0.1386387341159E-13
> > > 2 0.6340276525969E+02 0.6340276525969E+02
> > 0.5603404937070E-14
> > > 3 0.1949924972729E+03 0.1949924972729E+03
> > 0.9036993778374E-14
> > > 4 0.1784530116042E+03 0.1784530116042E+03
> > 0.3185343769198E-15
> > > 5 0.1838476034946E+04 0.1838476034946E+04
> > 0.1187280792767E-13
> > > Comparison of RMS-norms of solution error
> > > 1 0.2996408568547E+02 0.2996408568547E+02
> > 0.1185657295234E-14
> > > 2 0.2819457636500E+01 0.2819457636500E+01
> > 0.1370326007271E-13
> > > 3 0.7347341269878E+01 0.7347341269877E+01
> > 0.7373944071964E-14
> > > 4 0.6713922568778E+01 0.6713922568778E+01
> > 0.7937342832911E-15
> > > 5 0.7071531568839E+02 0.7071531568839E+02
> > 0.1185656063379E-13
> > > Comparison of surface integral
> > > 0.2603092560489E+02 0.2603092560489E+02
> > 0.2729609951429E-15
> > > Verification Successful
> > >
> > >
> > > LU Benchmark Completed.
> > > Class = A
> > > Size = 64x 64x 64
> > > Iterations = 250
> > > Time in seconds = 17.72
> > > Total processes = 8
> > > Compiled procs = 8
> > > Mop/s total = 6733.74
> > > Mop/s/process = 841.72
> > > Operation type = floating point
> > > Verification = SUCCESSFUL
> > > Version = 2.4
> > > Compile date = 23 Oct 2007
> > >
> > > Compile options:
> > > MPIF77 = mpif90
> > > FLINK = mpif90
> > > FMPI_LIB = (none)
> > > FMPI_INC = (none)
> > > FFLAGS = -O3
> > > FLINKFLAGS = (none)
> > > RAND = (none)
> > >
> > >
> > > Please send the results of this run to:
> > >
> > > NPB Development Team
> > > Internet: npb at nas.nasa.gov
> > >
> > > If email is not available, send this to:
> > >
> > > MS T27A-1
> > > NASA Ames Research Center
> > > Moffett Field, CA 94035-1000
> > >
> > > Fax: 650-604-3957
> > >
> > > 4 As the lu.A.8 running(4.1), checkpoint it(4.2) .the lu.A.8 contiune(
> > 4.3),the
> > > result is good.
> > >
> > > 4.1 $ mpirun -machinefile ./cfg -np 8 ./lu.A.8
> > >
> > >
> > > NAS Parallel Benchmarks 2.4 -- LU Benchmark
> > >
> > > Size: 64x 64x 64
> > > Iterations: 250
> > > Number of processes: 8
> > >
> > > Time step 1
> > > Time step 20
> > > Time step 40
> > > Time step 60
> > > Time step 80
> > > Time step 100
> > > Time step 120
> > >
> > > ...
> > > 4.2 $ mv2_checkpoint
> > >
> > > PID USER TT COMMAND %CPU VSZ START CMD
> > > 7968 yangshj pts/0 mpirun 0.0 14672 17:25 mpirun
> > -machinefile
> > > ./cfg -np 8 ./lu.A.8
> > >
> > > Enter PID to checkpoint or Control-C to exit: 7968
> > > Checkpointing PID 7968
> > > Checkpoint file: context.7968
> > >
> > > 4.3 $ mpirun -machinefile ./cfg -np 8 ./lu.A.8
> > >
> > >
> > > NAS Parallel Benchmarks 2.4 -- LU Benchmark
> > >
> > > Size: 64x 64x 64
> > > Iterations: 250
> > > Number of processes: 8
> > >
> > > Time step 1
> > > Time step 20
> > > Time step 40
> > > Time step 60
> > > Time step 80
> > > Time step 100
> > > Time step 120
> > > Time step 140
> > > Time step 160
> > > Time step 180
> > > Time step 200
> > > Time step 220
> > > Time step 240
> > > Time step 250
> > >
> > > Verification being performed for class A
> > > Accuracy setting for epsilon = 0.1000000000000E-07
> > > Comparison of RMS-norms of residual
> > > 1 0.7790210760669E+03 0.7790210760669E+03
> > 0.1386387341159E-13
> > > 2 0.6340276525969E+02 0.6340276525969E+02
> > 0.5603404937070E-14
> > > 3 0.1949924972729E+03 0.1949924972729E+03
> > 0.9036993778374E-14
> > > 4 0.1784530116042E+03 0.1784530116042E+03
> > 0.3185343769198E-15
> > > 5 0.1838476034946E+04 0.1838476034946E+04
> > 0.1187280792767E-13
> > > Comparison of RMS-norms of solution error
> > > 1 0.2996408568547E+02 0.2996408568547E+02
> > 0.1185657295234E-14
> > > 2 0.2819457636500E+01 0.2819457636500E+01
> > 0.1370326007271E-13
> > > 3 0.7347341269878E+01 0.7347341269877E+01
> > 0.7373944071964E-14
> > > 4 0.6713922568778E+01 0.6713922568778E+01
> > 0.7937342832911E-15
> > > 5 0.7071531568839E+02 0.7071531568839E+02
> > 0.1185656063379E-13
> > > Comparison of surface integral
> > > 0.2603092560489E+02 0.2603092560489E+02
> > 0.2729609951429E-15
> > > Verification Successful
> > >
> > >
> > > LU Benchmark Completed.
> > > Class = A
> > > Size = 64x 64x 64
> > > Iterations = 250
> > > Time in seconds = 18.78
> > > Total processes = 8
> > > Compiled procs = 8
> > > Mop/s total = 6352.76
> > > Mop/s/process = 794.10
> > > Operation type = floating point
> > > Verification = SUCCESSFUL
> > > Version = 2.4
> > > Compile date = 23 Oct 2007
> > >
> > > Compile options:
> > > MPIF77 = mpif90
> > > FLINK = mpif90
> > > FMPI_LIB = (none)
> > > FMPI_INC = (none)
> > > FFLAGS = -O3
> > > FLINKFLAGS = (none)
> > > RAND = (none)
> > >
> > >
> > > Please send the results of this run to:
> > >
> > > NPB Development Team
> > > Internet: npb at nas.nasa.gov
> > >
> > > If email is not available, send this to:
> > >
> > > MS T27A-1
> > > NASA Ames Research Center
> > > Moffett Field, CA 94035-1000
> > >
> > > Fax: 650-604-3957
> > >
> > > 5.restart the PID 7968 ,then result has "NaN "(5.1),sometimes the
> > > "FAILURE: " & "UNSUCCESSFUL"
> > >
> > > 5.1 $ cr_restart context.7968
> > > mpiexec_cn21 (mpiexec 335): mpiexec: Restarting
> > > Time step 120
> > > Time step 140
> > > Time step 160
> > > Time step 180
> > > Time step 200
> > > Time step 220
> > > Time step 240
> > > Time step 250
> > >
> > > Verification being performed for class A
> > > Accuracy setting for epsilon = 0.1000000000000E-07
> > > Comparison of RMS-norms of residual
> > > 1 NaN 0.7790210760669E+03 NaN
> > > 2 NaN 0.6340276525969E+02 NaN
> > > 3 NaN 0.1949924972729E+03 NaN
> > > 4 NaN 0.1784530116042E+03 NaN
> > > 5 NaN 0.1838476034946E+04 NaN
> > > Comparison of RMS-norms of solution error
> > > 1 NaN 0.2996408568547E+02 NaN
> > > 2 NaN 0.2819457636500E+01 NaN
> > > 3 NaN 0.7347341269877E+01 NaN
> > > 4 NaN 0.6713922568778E+01 NaN
> > > 5 NaN 0.7071531568839E+02 NaN
> > > Comparison of surface integral
> > > NaN 0.2603092560489E+02 NaN
> > > Verification Successful
> > >
> > >
> > > LU Benchmark Completed.
> > > Class = A
> > > Size = 64x 64x 64
> > > Iterations = 250
> > > Time in seconds = 66.11
> > > Total processes = 8
> > > Compiled procs = 8
> > > Mop/s total = 1804.50
> > > Mop/s/process = 225.56
> > > Operation type = floating point
> > > Verification = SUCCESSFUL
> > > Version = 2.4
> > > Compile date = 23 Oct 2007
> > >
> > > Compile options:
> > > MPIF77 = mpif90
> > > FLINK = mpif90
> > > FMPI_LIB = (none)
> > > FMPI_INC = (none)
> > > FFLAGS = -O3
> > > FLINKFLAGS = (none)
> > > RAND = (none)
> > >
> > >
> > > Please send the results of this run to:
> > >
> > > NPB Development Team
> > > Internet: npb at nas.nasa.gov
> > >
> > > If email is not available, send this to:
> > >
> > > MS T27A-1
> > > NASA Ames Research Center
> > > Moffett Field, CA 94035-1000
> > >
> > > Fax: 650-604-3957
> > >
> > > 5.2.$ mpirun -machinefile ./cfg -np 8 ./lu.A.8
> > >
> > >
> > > NAS Parallel Benchmarks 2.4 -- LU Benchmark
> > >
> > > Size: 64x 64x 64
> > > Iterations: 250
> > > Number of processes: 8
> > >
> > > Time step 1
> > > Time step 20
> > > Time step 40
> > > Time step 60
> > > Time step 80
> > > Time step 100
> > > Time step 120
> > > Time step 140
> > > Time step 160
> > > Time step 180
> > > Time step 200
> > > Time step 220
> > > Time step 240
> > > Time step 250
> > >
> > > Verification being performed for class A
> > > Accuracy setting for epsilon = 0.1000000000000E-07
> > > Comparison of RMS-norms of residual
> > > FAILURE: 1 0.7790355334612E+03 0.7790210760669E+03
> > 0.1855841227478E-04
> > > FAILURE: 2 0.6340489955249E+02 0.6340276525969E+02
> > 0.3366245600758E-04
> > > FAILURE: 3 0.1949964027466E+03 0.1949924972729E+03
> > 0.2002884068547E-04
> > > FAILURE: 4 0.1784563048837E+03 0.1784530116042E+03
> > 0.1845460320509E-04
> > > FAILURE: 5 0.1838499810682E+04 0.1838476034946E+04
> > 0.1293230623563E-04
> > > Comparison of RMS-norms of solution error
> > > FAILURE: 1 0.2996451081467E+02 0.2996408568547E+02
> > 0.1418795824413E-04
> > > FAILURE: 2 0.2819496132217E+01 0.2819457636500E+01
> > 0.1365358930094E-04
> > > FAILURE: 3 0.7347450238213E+01 0.7347341269877E+01
> > 0.1483098878912E-04
> > > FAILURE: 4 0.6714013230847E+01 0.6713922568778E+01
> > 0.1350359173032E-04
> > > FAILURE: 5 0.7071607035800E+02 0.7071531568839E+02
> > 0.1067194005931E-04
> > > Comparison of surface integral
> > > FAILURE: 0.2603109553197E+02 0.2603092560489E+02
> > 0.6527892352571E-05
> > > Verification failed
> > >
> > >
> > > LU Benchmark Completed.
> > > Class = A
> > > Size = 64x 64x 64
> > > Iterations = 250
> > > Time in seconds = 17.15
> > > Total processes = 8
> > > Compiled procs = 8
> > > Mop/s total = 6956.73
> > > Mop/s/process = 869.59
> > > Operation type = floating point
> > > Verification = UNSUCCESSFUL
> > > Version = 2.4
> > > Compile date = 22 Oct 2007
> > >
> > > Compile options:
> > > MPIF77 = mpif90
> > > FLINK = mpif90
> > > FMPI_LIB = (none)
> > > FMPI_INC = (none)
> > > FFLAGS = -O3
> > > FLINKFLAGS = (none)
> > > RAND = (none)
> > >
> > >
> > > Please send the results of this run to:
> > >
> > > NPB Development Team
> > > Internet: npb at nas.nasa.gov
> > >
> > > If email is not available, send this to:
> > >
> > > MS T27A-1
> > > NASA Ames Research Center
> > > Moffett Field, CA 94035-1000
> > >
> > > Fax: 650-604-3957
> > >
> >
> >
>
More information about the mvapich-discuss
mailing list