[mvapich-discuss] Problem with NPB-2.4/mvapich2/BLCR
wei huang
huanwei at cse.ohio-state.edu
Wed Oct 24 13:34:23 EDT 2007
Hi,
Thanks for your detailed note. We are looking at it and will get back to
you as soon as we find anything.
Also, would you please let us know a bit more information on your
computing platform? Such as CPU, memory size, etc. BTW, do you mean
kernel 2.6.22?
Thanks.
Thanks.
Regards,
Wei Huang
774 Dreese Lab, 2015 Neil Ave,
Dept. of Computer Science and Engineering
Ohio State University
OH 43210
Tel: (614)292-8501
On Tue, 23 Oct 2007, sunway qilu wrote:
> I'm the the mvapich2 + Blcr, but the result is not all right .
> would you please help me?
> many thanks!
>
> This is my env:
>
> OS : Linux Kernel 2.6.42
> C/Fortran : intel C/C++/Fortran 10.0.0.23
> mvapich2 : mvapich2-trunk-2007-10-22
> BLCR : 0.6.1
> Program: NPB-2.4
>
> following is my test step:
>
> 1. $ mpdboot -n 3
> 2.$ cat cfg
> cn22
> cn22
> cn22
> cn22
> cn23
> cn23
> cn23
> cn23
>
> 3. normal test,the result is good.
>
> $ mpirun -machinefile ./cfg -np 8 ./lu.A.8
>
>
> NAS Parallel Benchmarks 2.4 -- LU Benchmark
>
> Size: 64x 64x 64
> Iterations: 250
> Number of processes: 8
>
> Time step 1
> Time step 20
> Time step 40
> Time step 60
> Time step 80
> Time step 100
> Time step 120
> Time step 140
> Time step 160
> Time step 180
> Time step 200
> Time step 220
> Time step 240
> Time step 250
>
> Verification being performed for class A
> Accuracy setting for epsilon = 0.1000000000000E-07
> Comparison of RMS-norms of residual
> 1 0.7790210760669E+03 0.7790210760669E+03 0.1386387341159E-13
> 2 0.6340276525969E+02 0.6340276525969E+02 0.5603404937070E-14
> 3 0.1949924972729E+03 0.1949924972729E+03 0.9036993778374E-14
> 4 0.1784530116042E+03 0.1784530116042E+03 0.3185343769198E-15
> 5 0.1838476034946E+04 0.1838476034946E+04 0.1187280792767E-13
> Comparison of RMS-norms of solution error
> 1 0.2996408568547E+02 0.2996408568547E+02 0.1185657295234E-14
> 2 0.2819457636500E+01 0.2819457636500E+01 0.1370326007271E-13
> 3 0.7347341269878E+01 0.7347341269877E+01 0.7373944071964E-14
> 4 0.6713922568778E+01 0.6713922568778E+01 0.7937342832911E-15
> 5 0.7071531568839E+02 0.7071531568839E+02 0.1185656063379E-13
> Comparison of surface integral
> 0.2603092560489E+02 0.2603092560489E+02 0.2729609951429E-15
> Verification Successful
>
>
> LU Benchmark Completed.
> Class = A
> Size = 64x 64x 64
> Iterations = 250
> Time in seconds = 17.72
> Total processes = 8
> Compiled procs = 8
> Mop/s total = 6733.74
> Mop/s/process = 841.72
> Operation type = floating point
> Verification = SUCCESSFUL
> Version = 2.4
> Compile date = 23 Oct 2007
>
> Compile options:
> MPIF77 = mpif90
> FLINK = mpif90
> FMPI_LIB = (none)
> FMPI_INC = (none)
> FFLAGS = -O3
> FLINKFLAGS = (none)
> RAND = (none)
>
>
> Please send the results of this run to:
>
> NPB Development Team
> Internet: npb at nas.nasa.gov
>
> If email is not available, send this to:
>
> MS T27A-1
> NASA Ames Research Center
> Moffett Field, CA 94035-1000
>
> Fax: 650-604-3957
>
> 4 As the lu.A.8 running(4.1), checkpoint it(4.2) .the lu.A.8 contiune(4.3),the
> result is good.
>
> 4.1 $ mpirun -machinefile ./cfg -np 8 ./lu.A.8
>
>
> NAS Parallel Benchmarks 2.4 -- LU Benchmark
>
> Size: 64x 64x 64
> Iterations: 250
> Number of processes: 8
>
> Time step 1
> Time step 20
> Time step 40
> Time step 60
> Time step 80
> Time step 100
> Time step 120
>
> ...
> 4.2 $ mv2_checkpoint
>
> PID USER TT COMMAND %CPU VSZ START CMD
> 7968 yangshj pts/0 mpirun 0.0 14672 17:25 mpirun -machinefile
> ./cfg -np 8 ./lu.A.8
>
> Enter PID to checkpoint or Control-C to exit: 7968
> Checkpointing PID 7968
> Checkpoint file: context.7968
>
> 4.3 $ mpirun -machinefile ./cfg -np 8 ./lu.A.8
>
>
> NAS Parallel Benchmarks 2.4 -- LU Benchmark
>
> Size: 64x 64x 64
> Iterations: 250
> Number of processes: 8
>
> Time step 1
> Time step 20
> Time step 40
> Time step 60
> Time step 80
> Time step 100
> Time step 120
> Time step 140
> Time step 160
> Time step 180
> Time step 200
> Time step 220
> Time step 240
> Time step 250
>
> Verification being performed for class A
> Accuracy setting for epsilon = 0.1000000000000E-07
> Comparison of RMS-norms of residual
> 1 0.7790210760669E+03 0.7790210760669E+03 0.1386387341159E-13
> 2 0.6340276525969E+02 0.6340276525969E+02 0.5603404937070E-14
> 3 0.1949924972729E+03 0.1949924972729E+03 0.9036993778374E-14
> 4 0.1784530116042E+03 0.1784530116042E+03 0.3185343769198E-15
> 5 0.1838476034946E+04 0.1838476034946E+04 0.1187280792767E-13
> Comparison of RMS-norms of solution error
> 1 0.2996408568547E+02 0.2996408568547E+02 0.1185657295234E-14
> 2 0.2819457636500E+01 0.2819457636500E+01 0.1370326007271E-13
> 3 0.7347341269878E+01 0.7347341269877E+01 0.7373944071964E-14
> 4 0.6713922568778E+01 0.6713922568778E+01 0.7937342832911E-15
> 5 0.7071531568839E+02 0.7071531568839E+02 0.1185656063379E-13
> Comparison of surface integral
> 0.2603092560489E+02 0.2603092560489E+02 0.2729609951429E-15
> Verification Successful
>
>
> LU Benchmark Completed.
> Class = A
> Size = 64x 64x 64
> Iterations = 250
> Time in seconds = 18.78
> Total processes = 8
> Compiled procs = 8
> Mop/s total = 6352.76
> Mop/s/process = 794.10
> Operation type = floating point
> Verification = SUCCESSFUL
> Version = 2.4
> Compile date = 23 Oct 2007
>
> Compile options:
> MPIF77 = mpif90
> FLINK = mpif90
> FMPI_LIB = (none)
> FMPI_INC = (none)
> FFLAGS = -O3
> FLINKFLAGS = (none)
> RAND = (none)
>
>
> Please send the results of this run to:
>
> NPB Development Team
> Internet: npb at nas.nasa.gov
>
> If email is not available, send this to:
>
> MS T27A-1
> NASA Ames Research Center
> Moffett Field, CA 94035-1000
>
> Fax: 650-604-3957
>
> 5.restart the PID 7968 ,then result has "NaN "(5.1),sometimes the
> "FAILURE: " & "UNSUCCESSFUL"
>
> 5.1 $ cr_restart context.7968
> mpiexec_cn21 (mpiexec 335): mpiexec: Restarting
> Time step 120
> Time step 140
> Time step 160
> Time step 180
> Time step 200
> Time step 220
> Time step 240
> Time step 250
>
> Verification being performed for class A
> Accuracy setting for epsilon = 0.1000000000000E-07
> Comparison of RMS-norms of residual
> 1 NaN 0.7790210760669E+03 NaN
> 2 NaN 0.6340276525969E+02 NaN
> 3 NaN 0.1949924972729E+03 NaN
> 4 NaN 0.1784530116042E+03 NaN
> 5 NaN 0.1838476034946E+04 NaN
> Comparison of RMS-norms of solution error
> 1 NaN 0.2996408568547E+02 NaN
> 2 NaN 0.2819457636500E+01 NaN
> 3 NaN 0.7347341269877E+01 NaN
> 4 NaN 0.6713922568778E+01 NaN
> 5 NaN 0.7071531568839E+02 NaN
> Comparison of surface integral
> NaN 0.2603092560489E+02 NaN
> Verification Successful
>
>
> LU Benchmark Completed.
> Class = A
> Size = 64x 64x 64
> Iterations = 250
> Time in seconds = 66.11
> Total processes = 8
> Compiled procs = 8
> Mop/s total = 1804.50
> Mop/s/process = 225.56
> Operation type = floating point
> Verification = SUCCESSFUL
> Version = 2.4
> Compile date = 23 Oct 2007
>
> Compile options:
> MPIF77 = mpif90
> FLINK = mpif90
> FMPI_LIB = (none)
> FMPI_INC = (none)
> FFLAGS = -O3
> FLINKFLAGS = (none)
> RAND = (none)
>
>
> Please send the results of this run to:
>
> NPB Development Team
> Internet: npb at nas.nasa.gov
>
> If email is not available, send this to:
>
> MS T27A-1
> NASA Ames Research Center
> Moffett Field, CA 94035-1000
>
> Fax: 650-604-3957
>
> 5.2.$ mpirun -machinefile ./cfg -np 8 ./lu.A.8
>
>
> NAS Parallel Benchmarks 2.4 -- LU Benchmark
>
> Size: 64x 64x 64
> Iterations: 250
> Number of processes: 8
>
> Time step 1
> Time step 20
> Time step 40
> Time step 60
> Time step 80
> Time step 100
> Time step 120
> Time step 140
> Time step 160
> Time step 180
> Time step 200
> Time step 220
> Time step 240
> Time step 250
>
> Verification being performed for class A
> Accuracy setting for epsilon = 0.1000000000000E-07
> Comparison of RMS-norms of residual
> FAILURE: 1 0.7790355334612E+03 0.7790210760669E+03 0.1855841227478E-04
> FAILURE: 2 0.6340489955249E+02 0.6340276525969E+02 0.3366245600758E-04
> FAILURE: 3 0.1949964027466E+03 0.1949924972729E+03 0.2002884068547E-04
> FAILURE: 4 0.1784563048837E+03 0.1784530116042E+03 0.1845460320509E-04
> FAILURE: 5 0.1838499810682E+04 0.1838476034946E+04 0.1293230623563E-04
> Comparison of RMS-norms of solution error
> FAILURE: 1 0.2996451081467E+02 0.2996408568547E+02 0.1418795824413E-04
> FAILURE: 2 0.2819496132217E+01 0.2819457636500E+01 0.1365358930094E-04
> FAILURE: 3 0.7347450238213E+01 0.7347341269877E+01 0.1483098878912E-04
> FAILURE: 4 0.6714013230847E+01 0.6713922568778E+01 0.1350359173032E-04
> FAILURE: 5 0.7071607035800E+02 0.7071531568839E+02 0.1067194005931E-04
> Comparison of surface integral
> FAILURE: 0.2603109553197E+02 0.2603092560489E+02 0.6527892352571E-05
> Verification failed
>
>
> LU Benchmark Completed.
> Class = A
> Size = 64x 64x 64
> Iterations = 250
> Time in seconds = 17.15
> Total processes = 8
> Compiled procs = 8
> Mop/s total = 6956.73
> Mop/s/process = 869.59
> Operation type = floating point
> Verification = UNSUCCESSFUL
> Version = 2.4
> Compile date = 22 Oct 2007
>
> Compile options:
> MPIF77 = mpif90
> FLINK = mpif90
> FMPI_LIB = (none)
> FMPI_INC = (none)
> FFLAGS = -O3
> FLINKFLAGS = (none)
> RAND = (none)
>
>
> Please send the results of this run to:
>
> NPB Development Team
> Internet: npb at nas.nasa.gov
>
> If email is not available, send this to:
>
> MS T27A-1
> NASA Ames Research Center
> Moffett Field, CA 94035-1000
>
> Fax: 650-604-3957
>
More information about the mvapich-discuss
mailing list