Frequently Asked Questions on the Linpack Benchmark and Top500

(Last updated 5/8/2007 3:29 PM)

 

 

What is the Linpack Benchmark?. 2

What is the Linpack Benchmark report?. 2

What is the reference for the Linpack Benchmark Report?. 2

Is there a paper which describes the benchmark in some detail and gives a historical perspective?. 2

What is a Mflop/s?. 2

         What is the theoretical peak performance?

What are the three benchmarks in the Linpack Benchmark report?. 3

What is the Linpack Fortran n = 100 benchmark?. 3

What exactly does the Linpack Fortran n=100 benchmark time?. 3

What is the Linpack n = 1000 benchmark (TPP, Best Effort)?. 3

What is the Linpack’s “Highly Parallel Computing” benchmark?. 3

         Why is my performance results below the theoritical peak?

         Why are the performance results for my computer different than the same machine’s results in the Linpack Report?

What are the ground rules for the first benchmark?. 4

What are the ground rules for the second benchmark?. 4

What are the ground rules for the third benchmark?. 4

To what accuracy must be the solution conform?. 4

What numerical precision is required to run and benchmark and gain an entry in the Linpack Benchmark report?. 5

Can I get a more personalized list of machine and performance results?. 5

How can I get the Linpack Benchmark program?. 5

Is there a Java version of the Linpack Benchmark?. 5

What do I do to run the Linpack Benchmark Program?. 5

How does the Linpack Benchmark performance relate to my application?. 5

Are there errors in the Linpack Benchmark report?. 6

What is Linpack?. 6

How can I get the complete Linpack software collection?. 6

What are the BLAS?. 6

Where can I get an optimized version of the BLAS?. 6

Is Linpack the most efficient way to solve systems of equations?. 7

What is LAPACK?. 7

How can I get the whole LAPACK software collection?. 7

What is the history behind the Linpack Benchmark?. 7

How can I add my computer's result to the table?. 8

What is the SECOND function?. 8

How can I measure the execution time more accurately and reliably?. 8

Should I run the single and double precision of the benchmarks?. 8

How can I interpret the results from the benchmark?. 8

What matrix is used to run the benchmark?. 8

What is the Top500?. 9

How can I get my computer listed on the Top500?. 9

What is HPL?

For HPL What problem size N should I run?

For HPL what block size NB should I use?

For HPL what process grid ratio P x Q should I use?

For HPL what about the one processor case?

For HPL why so many options in HPL.dat?

Can HPL be Outperformed?

Can I use Strassen's Method when doing the matrix multiplies?

Where can I get a copy of the Top500 report?. 9

Where can I get the software to generate performance results for the Top500?. 9

         Why would a machine appear in the Linpack Benchmark report but not in the Top500 list?

         Why would a machine appear in the Top500 list and not in the Linpack Benchmark report?

What about a list of clusters?. 9

How can I interpret the results from the Linpack 100x100 benchmark?. 9

Do you have an archive of previous Linpack Benchmark reports or results?. 11

         What is the HPC Challenge benchmark?

   Where can I get additional information on the HPC Challenge benchmark?

Is there a benchmark for sparse matrices?. 13

Where can I get additional information on benchmarks?. 13

Where can I send comments?. 13

 

What is the Linpack Benchmark?

The Linpack Benchmark is a measure of a computer’s floating-point rate of execution. It is determined by running a computer program that solves a dense system of linear equations. Over the years the characteristics of the benchmark has changed a bit. In fact, there are three benchmarks included in the Linpack Benchmark report.

 

The Linpack Benchmark is something that grew out of the Linpack software project. It was originally intended to give users of the package a feeling for how long it would take to solve certain matrix problems. The benchmark stated as an appendix to the Linpack Users' Guide and has grown since the Linpack User’s Guide was published in 1979.

 

 

What is the Linpack Benchmark report?

The Linpack Benchmark report is entitled “Performance of Various Computers Using Standard Linear Equations Software”. The report lists the performance in Mflop/s of a number of computer systems. A copy of the report is available at http://www.netlib.org/benchmark/performance.ps.

 

 

What is the reference for the Linpack Benchmark Report?

The Linpack Benchmark report should be referenced in the following way:

“Performance of Various Computers Using Standard Linear Equations Software”, Jack Dongarra, University of Tennessee, Knoxville TN, 37996, Computer Science Technical Report Number CS - 89 – 85, today’s date, url:http://www.netlib.org/benchmark/performance.ps.

 

Is there a paper which describes the benchmark in some detail and gives a historical perspective?

The paper “The LINPACK Benchmark: Past, Present, and Future” by Jack Dongarra, Piotr Luszczek, and Antoine Petitet provides a look at the details of the benchmark and provides performance data in graphics form for a number of machines on basic operations. A copy of the paper is available at http://www.netlib.org/utk/people/JackDongarra/PAPERS/hpl.pdf.

 

What is a Mflop/s?

Mflop/s is a rate of execution, millions of floating point operations per second. Whenever this term is used it will refer to 64 bit floating point operations and the operations will be either addition or multiplication. Gflop/s refers to billions of floating point operations per second and Tflop/s refers to trillions of floating point operations per second.

 

 

What is the theoretical peak performance?

The theoretical peak is based not on an actual performance from a benchmark run, but on a paper computation to determine the theoretical peak rate of execution of floating point operations for the machine. This is the number manufacturers often cite; it represents an upper bound on performance. That is, the manufacturer guarantees that programs will not exceed this rate-sort of a "speed of light" for a given computer.  The theoretical peak performance is determined by counting the number of floating-point additions and multiplications (in full precision) that can be completed during a period of time, usually the cycle time of the machine. For example, an Intel Itanium 2 at 1.5 GHz can complete 4 floating point operations per cycle or a theoretical peak performance of 6 GFlop/s.

 

What are the three benchmarks in the Linpack Benchmark report?

The three benchmarks in the Linpack Benchmark report are for Linpack Fortran n = 100 benchmark (see Table 1 for the report), Linpack n = 1000 benchmark (see Table 1 of the report), and Linpack’s Highly Parallel Computing benchmark (see Table 3 of the report).

 

 

What is the Linpack Fortran n = 100 benchmark?

The first benchmark is for a matrix of order 100 using the Linpack software in Fortran. The results can be found in Table 1 of the benchmark report. In order to run this benchmark download the file from http://www.netlib.org/benchmark/Linpackd, this is a Fortran program. In order to run the program you will need to supply a timing function called SECOND which should report the CPU time that has elapsed. The ground rules for running this benchmark are that you can make no changes to the Fortran code, not even to the comments. Only compiler optimization can be used to enhance performance.

 

 

What exactly does the Linpack Fortran n=100 benchmark time?

The Linpack benchmark measures the performance of two routines from the Linpack collection of software. These routines are DGEFA and DGESL (these are double-precision versions; SGEFA and SGESL are their single-precision counterparts). DGEFA performs the LU decomposition with partial pivoting, and DGESL uses that decomposition to solve the given system of linear equations.

Most of the time is spent in DGEFA. Once the matrix has been decomposed, DGESL is used to find the solution; this process requires O(n2) floating-point operations, as opposed to the  O(n3) floating-point operations of  DGEFA. The results for this benchmark can be found in Table 1 second column under “LINPACK Benchmark n = 100” of the Linpack Benchmark Report.

 

 

What is the Linpack n = 1000 benchmark (TPP, Best Effort)?

The second benchmark is for a matrix of size 1000 and can be found in Table 1 of the benchmark report. In order to run this benchmark download the file from http://www.netlib.org/benchmark/1000d, this is a Fortran driver. The ground rules for running this benchmark are a bit more relaxed in that you can specify any linear equation solve you wish, implemented in any language. A requirement is that your method must compute a solution and the solution must return a result to the prescribed accuracy. TPP stands for Toward Peak Performance; this is the title of the column in the benchmark report that lists the results.

 

Why is my performance results below the theoritical peak?

The performance of a computer is a complicated issue, a function of many interrelated quantities. These quantities include the application, the algorithm, the size of the problem, the high-level language, the implementation, the human level of effort used to optimize the program, the compiler's ability to optimize, the age of the compiler, the operating system, the architecture of the computer, and the hardware characteristics.  The results presented for this benchmark suites should not be extolled as measures of total system performance (unless enough analysis has been performed to indicate a reliable correlation of the benchmarks to the workload of interest) but, rather, as reference points for further evaluations.

 

Why are the performance results for my computer different than the same machine’s results in the Linpack Report?

There are many reasons why your results may vary from results recorded in the Linpack Benchmark Report. Issues such as load on the system, accuracy of the clock, compiler options, version of the compiler, size of cache, bandwidth from memory, amount of memory, etc can effect the performance even when the processors are the same.

 

 

What is the Linpack’s “Highly Parallel Computing” benchmark?

The third benchmark is called the Highly Parallel Computing Benchmark and can be found in Table 3 of the Benchmark Report. (This is the benchmark use for the Top500 report). This benchmark attempts to measure the best performance of a machine in solving a system of equations. The problem size and software can be chosen to produce the best performance.

http://www.netlib.org/benchmark/hpl/

 

 

What are the ground rules for the first benchmark?

The “ground rules” for running the first benchmark in the report, n=100 case, are that the program is run as is with no changes to the source code, not even changes to the comments are allowed. The compiler through compiler switches can perform optimization at compile time. The user must supply a timing function called SECOND. SECOND returns the running CPU time for the process. The matrix generated by the benchmark program must be used to run this case.

 

 

What are the ground rules for the second benchmark?

The “ground rules” for running the second benchmark in the report, n=1000 case, allows for a complete user replacement of the LU factorization and solver steps. The calling sequence should be the same as the original routines.  The problem size should be of order 1000. The accuracy of the solution must satisfy the following bound:

(On IEEE machines this is 2-53 ) and n is the size of the problem. The matrix used must be the same matrix used in the driver program available from netlib.

 

 

What are the ground rules for the third benchmark?

The “ground rules” for running the third benchmark in the report, Highly Parallel case, allows for a complete user replacement of the LU factorization and solver steps. The accuracy of the solution must satisfy the following bound:

(On IEEE machines this is 2-53 ) and n is the size of the problem. The matrix used must be the same matrix used in the driver program available from netlib. There is no restriction on the problem size.

 

 

To what accuracy must be the solution conform?

The solution to all three benchmarks must satisfy the following mathematical formula:

(On IEEE machines this is 2-53 ) and n is the size of the problem. This implies the computation must be done in 64 bit floating point arithmetic.

 

 

What numerical precision is required to run and benchmark and gain an entry in the Linpack Benchmark report?

In order to have an entry included in the Linpack Benchmark report the results must be computed using full precision. By full precision we generally mean 64 bit floating point arithmetic or higher. Note that this is not an issue of single or double precision as some systems have 64-bit floating point arithmetic as single precision. It is a function of the arithmetic used.

 

 

Can I get a more personalized list of machine and performance results?

You can get a more personalized listing of machines by using the interface at http://performance.netlib.org/performance/html/PDSbrowse.html

This list is not kept current however and may lag the Linpack benchmark report by months.

 

How can I get the Linpack Benchmark program?

You can download the programs used to generate the Linpack benchmark results by using the URL is http://www.netlib.org/benchmark/linpackd. This is a Fortran program. There is a C version of the benchmark located at: http://www.netlib.org/benchmark/linpackc. There is a Java version of the benchmark that can be downloaded as an applet at:

There is a Java program at:

http://www.netlib.org/benchmark/linpackjava/

 

Is there a Java version of the Linpack Benchmark?

There is a Java version of the benchmark that can be downloaded as an applet at:

There is a Java program at: http://www.netlib.org/benchmark/linpackjava/

 

 

What do I do to run the Linpack Benchmark Program?

For the 100x100 based Fortran version, you need to supply a timing function called SECOND. SECOND is an elapse timer function that will be called from Fortran and is expected to return the running CPU time in seconds. In the program two called to SECOND are made and the difference taken to gather the time.

 

 

How does the Linpack Benchmark performance relate to my application?

The performance of the Linpack benchmark is typical for applications where the basic operation is based on vector primitives such as added a scalar multiple of a vector to another vector. Many applications exhibit the same performance as the Linpack Benchmark. However, results should not be taken too seriously. In order to measure the performance of any computer it’s critical to probe for the performance of your applications. The Linpack Benchmark can only give one point of reference.  In addition, in multiprogramming environments it is often difficult to reliably measure the execution time of a single program. We trust that anyone actually evaluating machines and operating systems will gather more reliable and more representative data.

 

 

Are there errors in the Linpack Benchmark report?

While we make every attempt to verify the results obtained from users and vendors, errors are bound to exist and should be brought to our attention. We encourage users to obtain the programs and run the routines on their machines, reporting any discrepancies with the numbers listed here.

 

 

What is Linpack?

The Linpack package is a collection of Fortran subroutines for solving various systems of linear equations. (http://www.netlib.org/Linpack/) The software in Linpack is based on a decompositional approach to numerical linear algebra. The general idea is the following. Given a problem involving a matrix, one factors or decomposes the matrix into a product of simple, well-structured matrices which can be easily manipulated to solve the original problem. The package has the capability of handling many different matrix types and different data types, and provides a range of options. Linpack itself is built on another package called the BLAS. Linpack was designed in the late 70's and has been superseded by a package called LAPACK. 

 

 

How can I get the complete Linpack software collection?

The Linpack software library is available from netlib. See http://www.netlib.org/Linpack/

 

 

What are the BLAS?

The BLAS (Basic Linear Algebra Subprograms) are high quality "building block" routines for performing basic vector and matrix operations. Level 1 BLAS do vector-vector operations, Level 2 BLAS do matrix-vector operations, and Level 3 BLAS do matrix-matrix operations. Because the BLAS are efficient, portable, and widely available, they're commonly used in the development of high quality linear algebra software, LINPACK and LAPACK for example. For additional information see: http://www.netlib.org/blas/

 

 

Where can I get an optimized version of the BLAS?

The ATLAS (Automatically Tuned Linear Algebra Software) project is an ongoing research effort focusing on applying empirical techniques in order to provide portable performance for the BLAS routines. At present, it provides C and Fortran77 interfaces to a portably efficient BLAS implementation, as well as a few routines from LAPACK. For additional information see: http://www.netlib.org/atlas/

 

 

Is Linpack the most efficient way to solve systems of equations?

Linpack is not the most efficient software for solving matrix problems. This is mainly due to the way the algorithm and resulting software accesses memory.  The memory access patterns of the algorithm has disregard for the multi-layered memory hierarchies of RISC architecture and vector computers, thereby spending too much time moving data instead of doing useful floating-point operations. LAPACK addresses this problem by reorganizing the algorithms to use block matrix operations, such as matrix multiplication in the innermost loops. For each computer architecture block operations can be optimized to account for memory hierarchies, providing a transportable way to achieve high efficiency on diverse modern machines. We use the term “Transportable” instead of “portable” because, for fastest possible performance, LAPACK requires that highly optimized block matrix operations be already implemented on each machine. These operations are performed by the Level 3 BLAS in most cases.

 

 

What is LAPACK?

LAPACK is a software collection to solve various matrix problem in linear algebra. In particular, systems of linear equations, least squares problems, eigenvalue problems, and singular value decomposition. The software is based on the use of block partitioned matrix techniques that aid in achieving high performance on RISC based systems, vector computers, and shared memory parallel processors.

 

 

How can I get the whole LAPACK software collection?

LAPACK can be obtained from netlib, see (http://www.netlib.org/lapack/)

 

What is the history behind the Linpack Benchmark?

The Linpack Benchmark is, in some sense, an accident. It was originally designed to assist users of the Linpack package by providing information on execution times required to solve a system of linear equations. The first ``Linpack Benchmark'' report appeared as an appendix in the Linpack Users' Guide in 1979. The appendix comprised data for one commonly used path in Linpack for a matrix problem of size 100, on a collection of widely used computers (23 in all), so users could estimate the time required to solve their matrix problem.

 

Over the years other data was added, more as a hobby than anything else, and today the collection includes hundreds of different computer systems.

 

 

How can I add my computer's result to the table?

You can contact Jack Dongarra and send him the output from the benchmark program. When sending results please include the specific information on the computer on which the test was run, the compiler, the optimization that was used, and the site it was run on. You can contact Dongarra by sending email to dongarra@cs.utk.edu.

 

What is the SECOND function?

In order to run the benchmark program you will have to supply a function to gather the execution time on your computer. The execution time is requested by a call to the Fortran function SECOND. It is expected that the routine returns the accumulated execution time of your program. Two called to SECOND are made and the difference taken to compute the execution time.

 

 

How can I measure the execution time more accurately and reliably?

The Performance API (PAPI) project specifies a standard application programming interface (API) for accessing hardware performance counters available on most modern microprocessors. These counters exist as a small set of registers that count Events, occurrences of specific signals related to the processor's function. Monitoring these events facilitates correlation between the structure of source/object code and the efficiency of the mapping of that code to the underlying architecture.

For addition information see: http://icl.cs.utk.edu/projects/papi/

 

 

Should I run the single and double precision of the benchmarks?

The results reported in the benchmark report reflect performance for 64 bit floating point arithmetic. On some machines this may be DOUBLE PERCISION, such as computers that have IEEE floating point arithmetic and on other computers this may be single precision, (declared REAL in Fortran), such as Cray’s vector computers.

 

 

How can I interpret the results from the benchmark?

When and how often are the results updated in the benchmark report?

The benchmark report is updated continuously as new results arrive. They are posted to the web as they are updated.

 

 

What matrix is used to run the benchmark?

The matrices are generated using a pseudo-random number generator. The matrices are designed to force partial pivoting to be performed in Gaussian Elimination.

 

 

What is the Top500?

The Top500 list the 500 fastest computer system being used today. In 1993 the collection was started and has been updated every 6 months since then. The report lists the sites that have the 500 most powerful computer systems installed. The best Linpack benchmark performance achieved is used as a performance measure in ranking the computers. The TOP500 list has been updated twice a year since June 1993.

 

How can I get my computer listed on the Top500?

To be listed on the Top500 list you have to run the software that can be found at http://www.netlib.org/benchmark/hpl/ and the performance of the benchmark run must be within the range of the 500 fasted computers for that period of time.

 

What is HPL?

HPL is a software package that solves a (random) dense linear system in double precision (64 bits) arithmetic on distributed-memory computers. It can thus be regarded as a portable as well as freely available implementation of the High Performance Computing Linpack Benchmark.

 

For HPL What problem size N should I run ?

In order to find out the best performance of your system, the largest problem size fitting in memory is what you should aim for. The amount of memory used by HPL is essentially the size of the coefficient matrix. So for example, if you have 4 nodes with 256 Mb of memory on each, this corresponds to 1 Gb total, i.e., 125 M double precision (8 bytes) elements. The square root of that number is 11585. One definitely needs to leave some memory for the OS as well as for other things, so a problem size of 10000 is likely to fit. As a rule of thumb, 80 % of the total amount of memory is a good guess. If the problem size you pick is too large, swapping will occur, and the performance will drop. If multiple processes are spawn on each node (say you have 2 processors per node), what counts is the available amount of memory to each process.

 

For HPL what block size NB should I use ?

HPL uses the block size NB for the data distribution as well as for the computational granularity. From a data distribution point of view, the smallest NB, the better the load balance. You definitely want to stay away from very large values of NB. From a computation point of view, a too small value of NB may limit the computational performance by a large factor because almost no data reuse will occur in the highest level of the memory hierarchy. The number of messages will also increase. Efficient matrix-multiply routines are often internally blocked. Small multiples of this blocking factor are likely to be good block sizes for HPL. The bottom line is that "good" block sizes are almost always in the [32 .. 256] interval. The best values depend on the computation / communication performance ratio of your system. To a much less extent, the problem size matters as well. Say for example, you empirically found that 44 was a good block size with respect to performance. 88 or 132 are likely to give slightly better results for large problem sizes because of a slightly higher flop rate.

 

For HPL what process grid ratio P x Q should I use ?

This depends on the physical interconnection network you have. Assuming a mesh or a switch HPL "likes" a 1:k ratio with k in [1..3]. In other words, P and Q should be approximately equal, with Q slightly larger than P. Examples: 2 x 2, 2 x 4, 2 x 5, 3 x 4, 4 x 4, 4 x 6, 5 x 6, 4 x 8 ... If you are running on a simple Ethernet network, there is only one wire through which all the messages are exchanged. On such a network, the performance and scalability of HPL is strongly limited and very flat process grids are likely to be the best choices: 1 x 4, 1 x 8, 2 x 4 ...

 

For HPL what about the one processor case ?

HPL has been designed to perform well for large problem sizes on hundreds of nodes and more. The software works on one node and for large problem sizes, one can usually achieve pretty good performance on a single processor as well. For small problem sizes however, the overhead due to message-passing, local indexing and so on can be significant.

 

For HPL why so many options in HPL.dat ?

There are quite a few reasons. First off, these options are useful to determine what matters and what does not on your system. Second, HPL is often used in the context of early evaluation of new systems. In such a case, everything is usually not quite working right, and it is convenient to be able to vary these parameters without recompiling. Finally, every system has its own peculiarities and one is likely to be willing to empirically determine the best set of parameters. In any case, one can always follow the advice provided in the tuning section of the HPL document and not worry about the complexity of the input file.

 

Can HPL be Outperformed ?

Certainly. There is always room for performance improvements. Specific knowledge about a particular system is always a source of performance gains. Even from a generic point of view, better algorithms or more efficient formulation of the classic ones are potential winners.

 

Can I use Strassen’s Method when doing the matrix multiples in the HPL benchmark or for the Top500 run?

The normal matrix multination algorithm requires n3 + O(n2) multiplications and a