HPL Performance Results

The performance achieved by this software package on a few machine configurations is shown below. These results are only provided for illustrative purposes. By the time you read this, those systems have changed, they may not even exist anymore and one can surely not exactly reproduce the state in which these machines were when those measurements have been obtained. To obtain accurate figures on your system, it is absolutely necessary to download the software and run it there.


4 AMD Athlon K7 500 Mhz (256 Mb) - (2x) 100 Mbs Switched - 2 NICs per node (channel bonding)

OS Linux 6.2 RedHat (Kernel 2.2.14)
C compiler gcc (egcs-2.91.66 egcs-1.1.2 release)
C flags -fomit-frame-pointer -O3 -funroll-loops
MPI MPIch 1.2.1
BLAS ATLAS (Version 3.0 beta)
Comments 09 / 00

Performance (Gflops) w.r.t Problem size on 4 nodes.
GRID 2000 5000 8000 10000
1 x 4 1.28 1.73 1.89 1.95
2 x 2 1.17 1.68 1.88 1.93
4 x 1 0.81 1.43 1.70 1.80


8 Duals Intel PIII 550 Mhz (512 Mb) - Myrinet

OS Linux 6.1 RedHat (Kernel 2.2.15)
C compiler gcc (egcs-2.91.66 egcs-1.1.2 release)
C flags -fomit-frame-pointer -O3 -funroll-loops
MPI MPI GM (Version 1.2.3)
BLAS ATLAS (Version 3.0 beta)
Comments UTK / ICL - Torc cluster - 09 / 00

Performance (Gflops) w.r.t Problem size on 8- and 16-processors grids.
GRID 2000 5000 8000 10000 15000 20000
2 x 4 1.76 2.32 2.51 2.58 2.72 2.73
4 x 4 2.27 3.94 4.46 4.68 5.00 5.16


Compaq 64 nodes (4 ev67 667 Mhz processors per node) AlphaServer SC

OS Tru64 Version 5
C compiler cc Version 6.1
C flags -arch host -tune host -std -O5
MPI -lmpi -lelan
BLAS CXML
Comments ORNL / NCCS - falcon - 09 / 00

In the table below, each row corresponds to a given number of cpus (or processors) and nodes. The first row for example is denoted by 1 / 1, i.e., 1 cpu / 1 node. Rmax is given in Gflops, and the value of Nmax in fact corresponds to 351 Mb per cpu for all machine configurations.

CPUS / NODES GRID N 1/2 Nmax Rmax (Gflops) Parallel Efficiency
1 / 1 1 x 1 150 6625 1.136 1.000
4 / 1 2 x 2 800 13250 4.360 0.960
16 / 4 4 x 4 2300 26500 17.00 0.935
64 / 16 8 x 8 5700 53000 67.50 0.928
256 / 64 16 x 16 14000 106000 263.6 0.906

For Rmax shown in the table, the parallel efficiency per cpu has been computed using the performance achieved by HPL on 1 cpu. That is fair, since the CXML matrix multiply routine was achieving at best 1.24 Gflops for large matrix operands on one cpu, it would have been difficult for a sequential Linpack benchmark implementation to achieve much more than 1.136 Gflops on this same cpu. For constant load (as in the table 351 Mb per cpu for Nmax), HPL scales almost linearly as it should.

The authors acknowledge the use of the Oak Ridge National Laboratory Compaq computer, funded by the Department of Energy's Office of Science and Energy Efficiency programs.


[Home] [Copyright and Licensing Terms] [Algorithm] [Scalability] [Performance Results] [Documentation] [Software] [FAQs] [Tuning] [Errata-Bugs] [References] [Related Links]