Frequently
Asked Questions on the Linpack Benchmark and Top500
(Last
updated 5/8/2007 3:29 PM)
What is the Linpack Benchmark?
What is the Linpack Benchmark report?
What is the reference for the Linpack Benchmark
Report?
Is there a paper which describes the benchmark in
some detail and gives a historical perspective?
What is the theoretical peak performance?
What are the three benchmarks in the Linpack
Benchmark report?
What is the Linpack Fortran n = 100 benchmark?
What exactly does the Linpack Fortran n=100
benchmark time?
What is the Linpack n = 1000 benchmark (TPP, Best
Effort)?
What is the Linpack’s “Highly Parallel Computing”
benchmark?
Why is my performance results below the theoritical peak?
What are the ground rules for the first benchmark?
What are the ground rules for the second
benchmark?
What are the ground rules for the third benchmark?
To what accuracy must be the solution conform?
Can I get a more personalized list of machine and
performance results?
How can I get the Linpack Benchmark program?
Is there a Java version of the Linpack Benchmark?
What do I do to run the Linpack Benchmark Program?
How does the Linpack Benchmark performance relate
to my application?
Are there errors in the Linpack Benchmark report?
How can I get the complete Linpack software
collection?
Where can I get an optimized version of the BLAS?
Is Linpack the most efficient way to solve systems
of equations?
How can I get the whole LAPACK software
collection?
What is the history behind the Linpack Benchmark?
How can I add my computer's result to the table?
How can I measure the execution time more
accurately and reliably?
Should I run the single and double precision of
the benchmarks?
How can I interpret the results from the
benchmark?
What matrix is used to run the benchmark?
How can I get my computer listed on the Top500?
For HPL What problem size N should I run?
For HPL what block size NB should I use?
For HPL what process grid ratio P x Q should I use?
For HPL what about the one processor case?
For HPL why so many options in HPL.dat?
Can I use Strassen's Method when doing
the matrix multiplies?
Where can I get a copy of the Top500 report?
Where can I get the software to generate
performance results for the Top500?
Why would a machine appear in the Linpack Benchmark report but not in the Top500 list?
Why would a machine appear in the Top500 list and not in the Linpack Benchmark report?
What about a list of clusters?
How can I interpret the results from the Linpack
100x100 benchmark?
Do you have an archive of previous Linpack Benchmark
reports or results?
What is the HPC Challenge benchmark?
Where can I get additional information
on the HPC Challenge benchmark?
Is there a benchmark for sparse matrices?
Where can I get additional information on
benchmarks?
The Linpack Benchmark is a measure of a
computer’s floatingpoint rate of execution. It is determined by running a
computer program that solves a dense system of linear equations. Over the years
the characteristics of the benchmark has changed a bit. In fact, there are
three benchmarks included in the Linpack Benchmark report.
The Linpack Benchmark is something that grew out
of the Linpack software project. It was originally intended to give users of
the package a feeling for how long it would take to solve certain matrix
problems. The benchmark stated as an appendix to the Linpack Users' Guide and
has grown since the Linpack User’s Guide was published in 1979.
The Linpack Benchmark report is entitled
“Performance of Various Computers Using Standard Linear Equations Software”.
The report lists the performance in Mflop/s of a number of computer systems. A
copy of the report is available at http://www.netlib.org/benchmark/performance.ps.
The Linpack Benchmark report should be
referenced in the following way:
“Performance of
Various Computers Using Standard Linear Equations Software”, Jack Dongarra,
The paper “The LINPACK Benchmark: Past, Present,
and Future” by
Mflop/s is a rate of execution, millions of
floating point operations per second. Whenever this term is used it will refer
to 64 bit floating point operations and the operations will be either addition
or multiplication. Gflop/s refers to billions of floating point operations per
second and Tflop/s refers to trillions of floating
point operations per second.
The theoretical peak is based not on an actual performance from a benchmark
run, but on a paper computation to determine the theoretical peak rate of
execution of floating point operations for the machine. This is the number
manufacturers often cite; it represents an upper bound on performance. That is,
the manufacturer guarantees that programs will not exceed this ratesort of a
"speed of light" for a given computer. The theoretical peak performance is
determined by counting the number of floatingpoint additions and multiplications
(in full precision) that can be completed during a period of time, usually the
cycle time of the machine. For example, an Intel Itanium 2 at 1.5 GHz can
complete 4 floating point operations per cycle or a theoretical peak
performance of 6 GFlop/s.
The three benchmarks in the Linpack Benchmark
report are for Linpack Fortran n = 100 benchmark (see
Table 1 for the report), Linpack n = 1000 benchmark (see Table 1 of the
report), and Linpack’s Highly Parallel Computing
benchmark (see Table 3 of the report).
The first benchmark is for a matrix of order 100
using the Linpack software in Fortran. The results can
be found in Table 1 of the benchmark report. In order to run this benchmark
download the file from http://www.netlib.org/benchmark/Linpackd,
this is a Fortran program. In order to run the program
you will need to supply a timing function called SECOND which should report the
CPU time that has elapsed. The ground rules for running this benchmark are that
you can make no changes to the Fortran code, not even
to the comments. Only compiler optimization can be used to enhance performance.
The Linpack benchmark measures the performance
of two routines from the Linpack collection of software. These routines are
DGEFA and DGESL (these are doubleprecision versions; SGEFA and SGESL are their
singleprecision counterparts). DGEFA performs the LU decomposition with
partial pivoting, and DGESL uses that decomposition to solve the given system
of linear equations.
Most of the time is spent in DGEFA. Once the
matrix has been decomposed, DGESL is used to find the solution; this process
requires O(n^{2}) floatingpoint operations,
as opposed to the O(n^{3})
floatingpoint operations of DGEFA. The
results for this benchmark can be found in Table 1 second column under “LINPACK
Benchmark n = 100” of the Linpack Benchmark Report.
The second benchmark is for a matrix of size
1000 and can be found in Table 1 of the benchmark report. In order to run this
benchmark download the file from http://www.netlib.org/benchmark/1000d,
this is a Fortran driver. The ground rules for running
this benchmark are a bit more relaxed in that you can specify any linear
equation solve you wish, implemented in any language. A requirement is that
your method must compute a solution and the solution must return a result to
the prescribed accuracy. TPP stands for Toward Peak Performance; this is the
title of the column in the benchmark report that lists the results.
The performance of a computer
is a complicated issue, a function of many interrelated quantities. These
quantities include the application, the algorithm, the size of the problem, the
highlevel language, the implementation, the human level of effort used to
optimize the program, the compiler's ability to optimize, the age of the
compiler, the operating system, the architecture of the computer, and the
hardware characteristics. The results
presented for this benchmark suites should not be
extolled as measures of total system performance (unless enough analysis has
been performed to indicate a reliable correlation of the benchmarks to the
workload of interest) but, rather, as reference points for further evaluations.
There are many reasons why
your results may vary from results recorded in the Linpack Benchmark Report.
Issues such as load on the system, accuracy of the clock, compiler options,
version of the compiler, size of cache, bandwidth from memory, amount of
memory, etc can effect the performance even when the processors are the same.
The third benchmark is called the Highly
Parallel Computing Benchmark and can be found in Table 3 of the Benchmark
Report. (This is the benchmark use for the Top500 report). This benchmark
attempts to measure the best performance of a machine in solving a system of
equations. The problem size and software can be chosen to produce the best
performance.
http://www.netlib.org/benchmark/hpl/
The “ground rules” for running the first
benchmark in the report, n=100 case, are that the program is run as is with no
changes to the source code, not even changes to the comments are allowed. The
compiler through compiler switches can perform optimization at compile time.
The user must supply a timing function called SECOND. SECOND returns the
running CPU time for the process. The matrix generated by the benchmark program
must be used to run this case.
The “ground rules” for running the second
benchmark in the report, n=1000 case, allows for a complete user replacement of
the LU factorization and solver steps. The calling sequence should be the same
as the original routines. The problem
size should be of order 1000. The accuracy of the solution must satisfy the
following bound:
_{}(On IEEE machines this is 2^{53 )} and n is the size of the
problem. The matrix used must be the same matrix used in the driver program
available from netlib.
The “ground rules” for running the third
benchmark in the report, Highly Parallel case, allows for a complete user
replacement of the LU factorization and solver steps. The accuracy of the
solution must satisfy the following bound:
_{}(On IEEE machines this is 2^{53 )} and n is the size of the
problem. The matrix used must be the same matrix used in the driver program
available from netlib. There is no restriction on the
problem size.
The solution to all three benchmarks must
satisfy the following mathematical formula:
_{}(On IEEE machines this is 2^{53 )} and n is the size of the
problem. This implies the computation must be done in 64 bit floating point
arithmetic.
In order to have an entry included in the
Linpack Benchmark report the results must be computed using full precision. By
full precision we generally mean 64 bit floating point arithmetic or higher.
Note that this is not an issue of single or double precision as some systems
have 64bit floating point arithmetic as single precision. It is a function of
the arithmetic used.
You can get a more personalized listing of
machines by using the interface at http://performance.netlib.org/performance/html/PDSbrowse.html
This list is not kept current however and may
lag the Linpack benchmark report by months.
You can download the programs used to generate
the Linpack benchmark results by using the URL is http://www.netlib.org/benchmark/linpackd.
This is a Fortran program. There is a C version of the
benchmark located at: http://www.netlib.org/benchmark/linpackc.
There is a Java version of the benchmark that can be downloaded as an applet
at:
There is a Java program at:
http://www.netlib.org/benchmark/linpackjava/
There is a Java version of the benchmark that
can be downloaded as an applet at:
There is a Java program at: http://www.netlib.org/benchmark/linpackjava/
For the 100x100 based Fortran
version, you need to supply a timing function called SECOND. SECOND is an
elapse timer function that will be called from Fortran
and is expected to return the running CPU time in seconds. In the program two
called to SECOND are made and the difference taken to gather the time.
The performance of the Linpack benchmark is
typical for applications where the basic operation is based on vector
primitives such as added a scalar multiple of a vector to another vector. Many
applications exhibit the same performance as the Linpack Benchmark. However,
results should not be taken too seriously. In order to measure the performance
of any computer it’s critical to probe for the performance of your
applications. The Linpack Benchmark can only give one point of reference. In addition, in multiprogramming environments
it is often difficult to reliably measure the execution time of a single
program. We trust that anyone actually evaluating machines and operating
systems will gather more reliable and more representative data.
While we make every attempt to verify the
results obtained from users and vendors, errors are bound to exist and should
be brought to our attention. We encourage users to obtain the programs and run
the routines on their machines, reporting any discrepancies with the numbers
listed here.
The Linpack package is a collection of Fortran subroutines for solving various systems of linear
equations. (http://www.netlib.org/Linpack/) The software in Linpack is based on
a decompositional approach to numerical linear
algebra. The general idea is the following. Given a problem involving a matrix,
one factors or decomposes the matrix into a product of simple, wellstructured
matrices which can be easily manipulated to solve the original problem. The
package has the capability of handling many different matrix types and
different data types, and provides a range of options. Linpack itself is built
on another package called the BLAS. Linpack was designed in the late 70's and
has been superseded by a package called LAPACK.
The Linpack software library is available from netlib. See http://www.netlib.org/Linpack/
The
BLAS (Basic Linear Algebra Subprograms) are high quality "building block"
routines for performing basic vector and matrix operations. Level 1 BLAS do
vectorvector operations, Level 2 BLAS do matrixvector operations, and Level 3
BLAS do matrixmatrix operations. Because the BLAS are efficient, portable, and
widely available, they're commonly used in the development of high quality
linear algebra software, LINPACK and LAPACK for example. For additional
information see: http://www.netlib.org/blas/
The
ATLAS (Automatically Tuned Linear Algebra Software) project is an ongoing
research effort focusing on applying empirical techniques in order to provide
portable performance for the BLAS routines. At present, it provides C and Fortran77
interfaces to a portably efficient BLAS implementation, as well as a few
routines from LAPACK. For additional information see: http://www.netlib.org/atlas/
Linpack is not the most efficient software for
solving matrix problems. This is mainly due to the way the algorithm and
resulting software accesses memory. The
memory access patterns of the algorithm has disregard for the multilayered
memory hierarchies of RISC architecture and vector computers, thereby spending
too much time moving data instead of doing useful floatingpoint operations.
LAPACK addresses this problem by reorganizing the algorithms to use block
matrix operations, such as matrix multiplication in the innermost loops. For each computer architecture block operations can be
optimized to account for memory hierarchies, providing a transportable way to
achieve high efficiency on diverse modern machines. We use the term
“Transportable” instead of “portable” because, for fastest possible
performance, LAPACK requires that highly optimized block matrix operations be
already implemented on each machine. These operations are performed by the
Level 3 BLAS in most cases.
LAPACK is a software collection to solve various
matrix problem in linear algebra. In particular, systems of linear equations, least squares problems,
eigenvalue problems, and singular value decomposition. The software is based on
the use of block partitioned matrix techniques that aid in achieving high
performance on RISC based systems, vector computers, and shared memory parallel
processors.
LAPACK can be obtained from netlib,
see (http://www.netlib.org/lapack/)
The Linpack Benchmark is, in some sense, an
accident. It was originally designed to assist users of the Linpack package by
providing information on execution times required to solve a system of linear
equations. The first ``Linpack Benchmark'' report appeared as an appendix in
the Linpack Users' Guide in 1979. The appendix comprised data for one commonly
used path in Linpack for a matrix problem of size 100, on a collection of
widely used computers (23 in all), so users could estimate the time required to
solve their matrix problem.
Over the years other data was added, more as a
hobby than anything else, and today the collection includes hundreds of
different computer systems.
You can contact Jack Dongarra and send him the
output from the benchmark program. When sending results please include the
specific information on the computer on which the test was run, the compiler,
the optimization that was used, and the site it was run on. You can contact
Dongarra by sending email to dongarra@cs.utk.edu.
In order to run the benchmark program you will
have to supply a function to gather the execution time on your computer. The
execution time is requested by a call to the Fortran
function SECOND. It is expected that the routine returns the accumulated
execution time of your program. Two called to SECOND are
made and the difference taken to compute the execution time.
The Performance API (PAPI)
project specifies a standard application programming interface (API) for
accessing hardware performance counters available on most modern microprocessors.
These counters exist as a small set of registers that count Events, occurrences
of specific signals related to the processor's function. Monitoring these
events facilitates correlation between the structure of source/object code and
the efficiency of the mapping of that code to the underlying architecture.
For addition information see:
http://icl.cs.utk.edu/projects/papi/
The results reported in the benchmark report
reflect performance for 64 bit floating point arithmetic. On some machines this
may be DOUBLE PERCISION, such as computers that have IEEE floating point
arithmetic and on other computers this may be single precision, (declared REAL
in Fortran), such as Cray’s vector computers.
When and how often are the results updated in
the benchmark report?
The benchmark report is updated continuously as
new results arrive. They are posted to the web as they are updated.
The matrices are generated using a pseudorandom
number generator. The matrices are designed to force partial pivoting to be
performed in Gaussian Elimination.
The Top500 list the 500 fastest computer system being used today. In 1993 the collection was started
and has been updated every 6 months since then. The report lists the sites that
have the 500 most powerful computer systems installed. The best Linpack
benchmark performance achieved is used as a performance measure in ranking the
computers. The TOP500 list has been updated twice a year since June 1993.
To
be listed on the Top500 list you have to run the software that can be found at http://www.netlib.org/benchmark/hpl/
and the performance of the benchmark run must be within the range of the 500
fasted computers for that period of time.
HPL is a software package
that solves a (random) dense linear system in double precision (64 bits)
arithmetic on distributedmemory computers. It can thus be regarded as a
portable as well as freely available implementation of the High Performance
Computing Linpack Benchmark.
In
order to find out the best performance of your system, the largest problem size
fitting in memory is what you should aim for. The amount of memory used by HPL
is essentially the size of the coefficient matrix. So for example, if you have
4 nodes with 256 Mb of memory on each, this corresponds to 1 Gb total, i.e., 125 M double
precision (8 bytes) elements. The square root of that number is 11180. One
definitely needs to leave some memory for the OS as well as for other things,
so a problem size of 10000 is likely to fit. As a rule of thumb, 80 % of the
total amount of memory is a good guess. If the problem size you pick is too
large, swapping will occur, and the performance will drop. If multiple
processes are spawn on each node (say you have 2 processors per node), what
counts is the available amount of memory to each process.
HPL
uses the block size NB for the data distribution as well as for the
computational granularity. From a data distribution point of view, the smallest
NB, the better the load balance. You definitely want to stay away from very
large values of NB. From a computation point of view, a too small value of NB
may limit the computational performance by a large factor because almost no
data reuse will occur in the highest level of the memory hierarchy. The number
of messages will also increase. Efficient matrixmultiply routines are often
internally blocked. Small multiples of this blocking factor
are likely to be good block sizes for HPL. The bottom line is that
"good" block sizes are almost always in the [32 ..
256] interval. The best values depend on the computation / communication
performance ratio of your system. To a much less extent, the problem size matters
as well. Say for example, you empirically found that 44 was a good block size
with respect to performance. 88 or 132 are likely to give slightly better
results for large problem sizes because of a slightly higher flop rate.
This
depends on the physical interconnection network you have. Assuming a mesh or a
switch HPL "likes" a 1:k ratio with k in
[1..3]. In other words, P and Q should be approximately equal, with Q slightly
larger than P. Examples: 2 x 2, 2 x 4, 2 x 5, 3 x 4, 4 x 4, 4 x 6, 5 x 6, 4 x 8
... If you are running on a simple Ethernet network, there is only one wire
through which all the messages are exchanged. On such a network, the
performance and scalability of HPL is strongly limited and very flat process
grids are likely to be the best choices: 1 x 4, 1 x 8, 2 x 4
...
HPL
has been designed to perform well for large problem sizes on hundreds of nodes
and more. The software works on one node and for large problem sizes, one can
usually achieve pretty good performance on a single processor as well. For
small problem sizes however, the overhead due to messagepassing, local
indexing and so on can be significant.
There
are quite a few reasons. First off, these options are useful to determine what
matters and what does not on your system. Second, HPL is often used in the
context of early evaluation of new systems. In such a case, everything is
usually not quite working right, and it is convenient to be able to vary these
parameters without recompiling. Finally, every system has its own peculiarities
and one is likely to be willing to empirically determine the best set of
parameters. In any case, one can always follow the advice provided in the tuning section of the
HPL document and not worry about the complexity of the input file.
Certainly. There is always room for performance
improvements. Specific knowledge about a particular system is always a source
of performance gains. Even from a generic point of view, better algorithms or
more efficient formulation of the classic ones are potential winners.
The
normal matrix multination algorithm requires n^{3} + O(n^{2})
multiplications and about the same number of additions. Strassen's algorithm reduces the total number
of operations to O(n^{2.82}) by recursively
multiplying 2n × 2n matrices using seven n × n matrix multiplications. Thus
using Strassen’s Algorithm will distort the true execution rate. As a result we
do not allow Strassen’s Algorithm to be used for the TOP500 reporting. As a
side note, in the "usual" matrix multiplication, we have an n^{2 }error
term. In Strassen's method, the error exponent p for n^{p}
ranges from 23.85 and the numerical error can be 10100 times greater than
that for standard multiplication.
The Top500 reports are maintained at http://www.top500.org/.
There is software available that has been optimized
and many people use to generate the Top500 performance results. This benchmark attempts to measure the best
performance of a machine in solving a system of equations. The problem size and
software can be chosen to produce the best performance. A copy of that software
can be downloaded from:
http://www.netlib.org/benchmark/hpl/
In order to run this you will need MPI and an
optimized version of the BLAS. For MPI you can see: http://wwwunix.mcs.anl.gov/mpi/mpich/download.html
and for the BLAS see: http://www.netlib.org/atlas/
.
There could be two reasons.
First the Linpack Benchmark report contains historic information. Even if a
computer is no longer in existence it can appear in the Linpack benchmark
report. This is unlike the Top500 which report the 500 fastest computers in
existence at a given point in time. The second reason is that the Top500 list come out twice a year and the Linpack Benchmark report
is updated continuously.
If a machine is in the Top500
list it should appear in the Linpack Benchmark report. If you see an instance
where this is not the case, its probably a mistake and
please send email to Jack Dongarra dongarra@cs.utk.edu
about the situation.
We
are starting a new list on Clusters for more information see http://clusters.top500.org/.
When the Linpack Fortran
n = 100 benchmark is run it produces the following kind of results:
Please send the results of this run to:
Jack J. Dongarra
Computer Science Department
Fax: 8659748296
Internet: dongarra@cs.utk.edu
norm. resid resid machep x(1) x(n)
1.67005097E+00 7.41628980E14 2.22044605E16 1.00000000E+00 1.00000000E+00
times are reported
for matrices of order 100
dgefa dgesl total
mflops
unit ratio
times for array with
leading dimension of 201
1.540E03 6.888E05 1.609E03
4.268E+02 4.686E03 2.873E02
1.509E03 7.084E05 1.579E03
4.348E+02 4.600E03 2.820E02
1.509E03 7.003E05 1.579E03
4.348E+02 4.600E03 2.820E02
1.502E03 6.593E05 1.568E03
4.380E+02 4.567E03 2.800E02
times for array with
leading dimension of 200
1.431E03 6.716E05 1.498E03
4.584E+02 4.363E03 2.675E02
1.424E03 6.694E05 1.491E03
4.605E+02 4.343E03 2.663E02
1.431E03 6.699E05 1.498E03
4.583E+02 4.364E03 2.676E02
1.432E03 6.439E05 1.497E03
4.588E+02 4.360E03 2.673E02
The norm.
resid is a measure of the
accuracy of the computation. The value should be O(1).
If the value is much greater than O(100) it suggest
that the results are not correct.
The resid is the unnormalized quantity.
The term machep
measure the precision used to carry out the computation. On an IEEE floating
point computer the value should be 2.22044605e16.
The values of x(1) and
x(n) are the first and last component of the solution. The problem is
constructed so that the values of solution should be all ones.
There are two sets of timings performed both on
matrices of size 100. The first one is where the 2dimensional array that
contained the matrix has a leading dimension of 201, and a second set where the
leading dimension 200. This is done to see what effect, if any, the placement
of the arrays in memory has on the performance.
Times for dgefa and dgesl are reported. dgefa
factors the matrix using Gaussian
elimination with partial pivoting and dgesl
solves a system based on the factoriuzation. dgefa requires 2/3 n^{3}
operations and dgesl requires n^{2}
operations. The value of total is the sum of the times and mflops
is the execution rate, or millions of floating point operations per second.
Here a floating point operations is taken to be
floating point additions and multiplications. Unit and ratio are obsolete and
should be ignored.
If the time reported is negative or zero then
the clock resolution is not accurate enough for the granularity of the work. In
this case a different timing routine should be used that has better resolution.
No archive is maintained of previous results.
However here is some information to provide a historical perspective. The numbers in the following tables have been
extracted from old Linpack Benchmark Reports.
It took a bit of ``file archaeology'' to put the list together since I
don't have the complete set of reports.
Top Computers Over Time for the Linpack n=100
Benchmark
(Entries for this
table began in 1979.)
Year 
Computer 
Number
of Processors 
Cycle
time 
Mflop/s 
2006 
NEC SX8/1 (1 proc) 
1 
2
GHz 
2177 
2004 
Intel Pentium Nocona (1 proc 3.6
GHz) 
1 
3.6
GHz 
1803 
2003 
HP
Integrity Server rx2600 (1 proc 1.5GHz) 
1 
1.5
GHz 
1635 
2002 
Intel
Pentium 4 (3.06 GHz) 
1 
2.06
GHz 
1414 
2001 
Fujitsu VPP5000/1 
1 
3.33
nsec 
1156 
2000 
Fujitsu VPP5000/1 
1 
3.33
nsec 
1156 
1999 
CRAY T916 
4 
2.2
nsec 
1129 
1995 
CRAY T916 
1 
2.2
nsec 
522 
1994 
CRAY C90 
16 
4.2
nsec 
479 
1993 
CRAY C90 
16 
4.2
nsec 
479 
1992 
CRAY C90 
16 
4.2
nsec 
479 
1991 
CRAY C90 
16 
4.2
nsec 
403 
1990 
CRAY YMP 
8 
6.0
nsec 
275 
1989 
CRAY YMP 
8 
6.0
nsec 
275 
1988 
CRAY YMP 
1 
6.0
nsec 
74 
1987 
ETA 10E 
1 
10.5
nsec 
52 
1986 
NEC SX2 
1 
6.0
nsec 
46 
1985 
NEC SX2 
1 
6.0
nsec 
46 
1984 
CRAY XMP 
1 
9.5
nsec 
21 
1983 
CRAY 1 
1 
12.5
nsec 
12 
... 




1979 
CRAY 1 
1 
12.5
nsec 
3.4 
These numbers come from the Linpack Benchmark
Report Table 1.
=====================================================================
Top Computers Over Time for the Linpack n=1000
Benchmark
(Entries for this
table began in 1986.)
Year 
Computer 
Number
of Processors 
Cycle time in
nsec. 
Measured Mflop/s 
Peak Mflop/s 
2006 
NEC SX8/8 
8 
2
GHz 
75140 
128000 
2000 
NEC SX5/16 
16 
4.0 
45030 
64000 
1995 
CRAY T916 
16 
2.2 
19400 
28800 
1994 

4 
2 
16170 
32000 
1993 
NEC SX3/44R 
4 
2.5 
15120 
25600 
1992 
NEC SX3/44 
4 
2.9 
13420 
22000 
1991 
Fujitsu VP2600/10 
1 
3.2 
4009 
5000 
1990 
Fujitsu VP2600/10 
1 
3.2 
2919 
5000 
1989 
CRAY YMP/832 
8 
6 
2144 
2667 
1988 
CRAY YMP/832 
8 
6 
2144 
2667 
1987 
NEC SX2 
1 
6 
885 
1300 
1986 
CRAY XMP4 
4 
9.5 
713 
840 

These numbers come from the Linpack Benchmark
Report Table 1.
(Full precision; matrix size 1000; best effort
programming, maximum optimization permitted.)
Top Computers Over Time
for the HighlyParallel Linpack Benchmark
(Entries for this
table began in 1991.)
Year 
Computer 
Number of Processors 
Measured Gflop/s 
Size of Problem 
Size of 1/2 Perf 
Theoretical Peak Gflop/s 
20052006 
IBM Blue Gene/L 
131072 
280600 
1769471 

367001 
2002  2004 
Earth Simulator Computer, NEC 
5104 
35610 
1041216 
265408 
40832 
2001 
ASCI WhitePacific, IBM SP Power 3 
7424 
7226 
518096 
179000 
11136 
2000 
ASCI WhitePacific, IBM SP Power 3 
7424 
4938 
430000 

11136 
1999 
ASCI Red Intel
Pentium II Xeon core 
9632 
2379 
362880 
75400 
3207 
1998 
ASCI BluePacific SST, IBM SP 604E 
5808 
2144 
431344 

3868 
1997 
Intel ASCI Option Red (200 MHz Pentium Pro) 
9152 
1338 
235000 
63000 
1830 
1996 

2048 
368.2 
103680 
30720 
614 
1995 
Intel Paragon XP/S MP 
6768 
281.1 
128600 
25700 
338 
1994 
Intel Paragon XP/S MP 
6768 
281.1 
128600 
25700 
338 
1993 
Fujitsu NWT 
140 
124.5 
31920 
11950 
236 
1992 
NEC SX3/44 
4 
20.0 
6144 
832 
22 
1991 
Fujitsu VP2600/10 
1 
4.0 
1000 
200 
5 

These numbers come from the Linpack Benchmark
Report Table 3.
(Full precision; the manufacture is allowed to
solve as large a problem as desired, maximum optimization permitted.)
Measured Gflop/s is the measured peak rate of
execution for running the benchmark in billions of floating point operations
per second.
Size of Problem is the matrix size at which the
measured performance was observed.
Size of ½ Perf is the
size of problem needed to achieve ½ the measured peak performance.
The HPC Challenge
benchmark consists at this time of 7 benchmarks: HPL, STREAM, RandomAccess, PTRANS, FFTE, DGEMM and b_eff
Latency/Bandwidth. HPL is the Linpack TPP benchmark. The test stresses the
floating point performance of a system. STREAM is a benchmark that measures
sustainable memory bandwidth (in GB/s), RandomAccess measures the rate of random updates of memory.
PTRANS measures the rate of transfer for larges arrays of data from
multiprocessor’s memory. Latency/Bandwidth measures (as the name suggests)
latency and bandwidth of communication patterns of increasing complexity
between as many nodes as is timewise feasible.
For additional
information on the benchmark see: http://icl.cs.utk.edu/hpcc/
The Linpack Benchmark suite is built around
software for dense matrix problems. In May 2000 we started to put together a
benchmark for sparse iterative matrix problems. For additional information see:
http://www.netlib.org/benchmark/sparsebench/
For addition information on benchmarks see: http://www.netlib.org/benchweb/
Please send your comments to Jack Dongarra at dongarra@cs.utk.edu.