HPL Results

HPL Performance Results

The performance achieved by this software package on a few machine configurations is shown below. These results are only provided for illustrative purposes. By the time you read this, those systems have changed, they may not even exist anymore and one can surely not exactly reproduce the state in which these machines were when those measurements have been obtained. To obtain accurate figures on your system, it is absolutely necessary to download the software and run it there.

Athlon 4-nodes cluster

Intel PIII 8-duals cluster

Compaq 64 nodes AlphaServer SC

4 AMD Athlon K7 500 Mhz (256 Mb) - (2x) 100 Mbs Switched - 2 NICs per node (channel bonding)

OS	Linux 6.2 RedHat (Kernel 2.2.14)
C compiler	gcc (egcs-2.91.66 egcs-1.1.2 release)
C flags	-fomit-frame-pointer -O3 -funroll-loops
MPI	MPIch 1.2.1
BLAS	ATLAS (Version 3.0 beta)
Comments	09 / 00

Performance (Gflops) w.r.t Problem size on 4 nodes.

GRID 2000 5000 8000 10000

1 x 4 1.28 1.73 1.89 1.95

2 x 2 1.17 1.68 1.88 1.93

4 x 1 0.81 1.43 1.70 1.80

GRID	2000	5000	8000	10000
1 x 4	1.28	1.73	1.89	1.95
2 x 2	1.17	1.68	1.88	1.93
4 x 1	0.81	1.43	1.70	1.80

8 Duals Intel PIII 550 Mhz (512 Mb) - Myrinet

OS	Linux 6.1 RedHat (Kernel 2.2.15)
C compiler	gcc (egcs-2.91.66 egcs-1.1.2 release)
C flags	-fomit-frame-pointer -O3 -funroll-loops
MPI	MPI GM (Version 1.2.3)
BLAS	ATLAS (Version 3.0 beta)
Comments	UTK / ICL - Torc cluster - 09 / 00

Performance (Gflops) w.r.t Problem size on 8- and 16-processors grids.

GRID 2000 5000 8000 10000 15000 20000

2 x 4 1.76 2.32 2.51 2.58 2.72 2.73

4 x 4 2.27 3.94 4.46 4.68 5.00 5.16

GRID	2000	5000	8000	10000	15000	20000
2 x 4	1.76	2.32	2.51	2.58	2.72	2.73
4 x 4	2.27	3.94	4.46	4.68	5.00	5.16

Compaq 64 nodes (4 ev67 667 Mhz processors per node) AlphaServer SC

OS	Tru64 Version 5
C compiler	cc Version 6.1
C flags	-arch host -tune host -std -O5
MPI	-lmpi -lelan
BLAS	CXML
Comments	ORNL / NCCS - falcon - 09 / 00

In the table below, each row corresponds to a given number of cpus (or processors) and nodes. The first row for example is denoted by 1 / 1, i.e., 1 cpu / 1 node. Rmax is given in Gflops, and the value of Nmax in fact corresponds to 351 Mb per cpu for all machine configurations.

CPUS / NODES	GRID	N 1/2	Nmax	Rmax (Gflops)	Parallel Efficiency
1 / 1	1 x 1	150	6625	1.136	1.000
4 / 1	2 x 2	800	13250	4.360	0.960
16 / 4	4 x 4	2300	26500	17.00	0.935
64 / 16	8 x 8	5700	53000	67.50	0.928
256 / 64	16 x 16	14000	106000	263.6	0.906

For Rmax shown in the table, the parallel efficiency per cpu has been computed using the performance achieved by HPL on 1 cpu. That is fair, since the CXML matrix multiply routine was achieving at best 1.24 Gflops for large matrix operands on one cpu, it would have been difficult for a sequential Linpack benchmark implementation to achieve much more than 1.136 Gflops on this same cpu. For constant load (as in the table 351 Mb per cpu for Nmax), HPL scales almost linearly as it should.

The authors acknowledge the use of the Oak Ridge National Laboratory Compaq computer, funded by the Department of Energy's Office of Science and Energy Efficiency programs.

[Home] [Copyright and Licensing Terms] [Algorithm] [Scalability] [Performance Results] [Documentation] [Software] [FAQs] [Tuning] [Errata-Bugs] [References] [Related Links]