LAPACK Benchmark

This section contains performance numbers for selected LAPACK driver routines. These routines provide complete solutions for the most common problems of numerical linear algebra, and are the routines users are most likely to call:

We only present data on DGESDD for singular values only, and not DGESVD, because both use the same algorithm. We include both DGESVD and DGESDD for computing all the singular values and singular vectors to illustrate the speedup of the new algorithm DGESDD over its predecessor DGESVD: For 1000-by-1000 matrices DGESDD is between 6 and 7 times faster than DGESVD on most machines.

The above drivers are timed on a variety of computers. In addition, we present data on fewer machines to compare the performance of the five different routines for solving linear least squares problems, and several different routines for the symmetric eigenvalue problem. Again, the purpose is to illustrate the performance improvements in LAPACK 3.0.

Data is provided for PCs, shared memory parallel computers, and high performance workstations. All timings were obtained by using the machine-specific optimized BLAS available on each machine. For machines running the Linux operating system, the ATLAS[102] BLAS were used. In all cases the data consisted of 64-bit floating point numbers (double precision). For each machine and each driver, a small problem (N=100 with LDA=101) and a large problem (N=1000 with LDA=1001) were run. Block sizes NB = 1, 16, 32 and 64 were tried, with data only for the fastest run reported in the tables below. For DGEEV, ILO=1 and IHI=N. The test matrices were generated with randomly distributed entries. All run times are reported in seconds, and block size is denoted by nb. The value of nb was chosen to make N=1000 optimal. It is not necessarily the best choice for N=100. See Section 6.2 for details.

The performance data is reported using three or four statistics. First, the run-time in seconds is given. The second statistic measures how well our performance compares to the speed of the BLAS, specifically DGEMM. This ``equivalent matrix multiplies'' statistic is calculated as

We also include several figures comparing the speed of several routines for the symmetric eigenvalue problem and several least squares drivers to highlight the performance improvements in LAPACK 3.0.

First consider Figure 3.1, which compares the performance of three routines, DSTEQR, DSTEDC and DSTEGR, for computing all the eigenvalues and eigenvectors of a symmetric tridiagonal matrix. The times are shown on a Compaq AlphaServer DS-20 for matrix dimensions from 100 to 1000. The symmetric tridiagonal matrix was obtained by taking a random dense symmetric matrix and reducing it to tridiagonal form (the performance can vary depending on the distribution of the eigenvalues of the matrix, but the data shown here is typical). DSTEQR (used in driver DSYEV) was the only algorithm available in LAPACK 1.0, DSTEDC (used in driver DSYEVD) was introduced in LAPACK 2.0, and DSTEGR (used in driver DSYEVR) was introduced in LAPACK 3.0. As can be seen, for large matrices DSTEGR is about 14 times faster than DSTEDC and nearly 50 times faster than DSTEQR.

Next consider Figure 3.2, which compares the performance of four driver routines, DSYEV, DSYEVX, DSYEVD and DSYEVR, for computing all the eigenvalues and eigenvectors of a dense symmetric matrix. The times are shown on an IBM Power 3 for matrix dimensions from 100 to 2000. The symmetric matrix was chosen randomly. The cost of these drivers is essentially the cost of phases 1 and 3 (reduction to tridiagonal form and backtransformation) plus the cost of phase 2 (the symmetric tridiagonal eigenproblem) discussed in the last paragraph. Since the cost of phases 1 and 3 is large, performance differences in phase 2 are no longer as visible. We note that if we had chosen a test matrix with a large cluster of nearby eigenvalues, then the cost of DSYEVX would have been much larger, without significantly affecting the timings of the other drivers. DSYEVR is the driver of choice.

Finally consider Figure 3.3, which compares the performance of five drivers for the linear least squares problem, DGELS, DGELSY, DGELSX, DGELSD and DGELSS, which are shown in order of decreasing speed. DGELS is the fastest. DGELSY and DGELSX use QR with pivoting, and so handle rank-deficient problems more reliably than DGELS but can be more expensive. DGELSD and DGELSS use the SVD, and so are the most reliable (and expensive) ways to solve rank deficient least squares problems. DGELS, DGELSX and DGELSS were in LAPACK 1.0, and DGELSY and DGELSD were introduced in LAPACK 3.0. The times are shown on a Compaq AlphaServer DS-20 for squares matrices with dimensions from 100 to 1000, and for one right-hand-side. The matrices were chosen at random (which means they are full rank). First consider DGELSY, which is meant to replace DGELSX. We can see that the speed of DGELSY is nearly indistinguishable from the fastest routine DGELS, whereas DGELSX is over 2.5 times slower for large matrices. Next consider DGELSD, which is meant to replace DGELSS. It is 3 to 5 times slower than the fastest routine, DGELS, whereas its predecessor DGELSS was 7 to 34 times slower. Thus both DGELSD and DGELSY are significantly faster than their predecessors.

**Figure 3.1:** Timings of routines for computing all eigenvalues and eigenvectors of a symmetric tridiagonal matrix. The upper graph shows times in seconds on a Compaq AlphaServer DS-20. The lower graph shows times relative to the fastest routine DSTEGR, which appears as a horizontal line at 1.
$\begin{figure} \centerline{\psfig{file=SEPtbw.eps,width=4.5in}}\centerline{\psfig{file=SEPrbw.eps,width=4.5in}}\end{figure}$

**Figure 3.2:** Timings of driver routines for computing all eigenvalues and eigenvectors of a dense symmetric matrix. The upper graph shows times in seconds on an IBM Power3. The lower graph shows times relative to the fastest routine DSYEVR, which appears as a horizontal line at 1.
$\begin{figure} \centerline{\psfig{file=SEPDtbw.eps,width=4.5in}}\centerline{\psfig{file=SEPDrbw.eps,width=4.5in}}\end{figure}$

**Figure 3.3:** Timings of driver routines for the least squares problem. The upper graph shows times in seconds on a Compaq AlphaServer DS-20. The lower graph shows times relative to the fastest routine DGELS, which appears as a horizontal line at 1.
$\begin{figure} \centerline{\psfig{file=LLStbw.eps,width=4.5in}}\centerline{\psfig{file=LLSrbw.eps,width=4.5in}}\end{figure}$

**Table 3.12:** Execution time and Megaflop rates for DGEMV and DGEMM
	DGEMV				DGEMM
	Values of n=m=k
	100		1000		100		1000
	Time	Mflops	Time	Mflops	Time	Mflops	Time	Mflops
Dec Alpha Miata	.0151	66	27.778	36	.0018	543	1.712	584
Compaq AlphaServer DS-20	.0027	376	8.929	112	.0019	522	2.000	500
IBM Power 3	.0032	304	2.857	350	.0018	567	1.385	722
IBM PowerPC	.0435	23	40.000	25	.0063	160	4.717	212
Intel Pentium II	.0075	134	16.969	59	.0031	320	3.003	333
Intel Pentium III	.0071	141	14.925	67	.0030	333	2.500	400
SGI O2K (1 proc)	.0046	216	4.762	210	.0018	563	1.801	555
SGI O2K (4 proc)	5.000	0.2	2.375	421	.0250	40	0.517	1936
Sun Ultra 2 (1 proc)	.0081	124	17.544	57	.0033	302	3.484	287
Sun Enterprise 450 (1 proc)	.0037	267	11.628	86	.0021	474	1.898	527

**Table 3.13:** ``Standard'' floating point operation counts for LAPACK drivers for n-by-n matrices
Driver	Options	Operation
		Count
xGESV	1 right hand side	$.67 \cdot N^3$
xGEEV	eigenvalues only	$10.00 \cdot N^3$
xGEEV	eigenvalues and right eigenvectors	$26.33 \cdot N^3$
xGES{VD,DD}	singular values only	$2.67 \cdot N^3$
xGES{VD,DD}	singular values and left and right singular vectors	$6.67 \cdot N^3$

**Table 3.14:** Performance of DGESV for n-by-n matrices
	No. of		Values of n
	proc.	nb	100			1000
			Time	$\frac{\rm Time}{\rm T(MM)}$	Mflops	Time	$\frac{\rm Time}{\rm T(MM)}$	Mflops
Dec Alpha Miata	1	28	.004	2.2	164	1.903	1.11	351
Compaq AlphaServer DS-20	1	28	.002	1.05	349	1.510	0.76	443
IBM Power 3	1	32	.003	1.67	245	1.210	0.87	552
Intel Pentium II	1	40	.006	1.94	123	2.730	0.91	245
Intel Pentium III	1	40	.005	1.67	136	2.270	0.91	294
SGI Origin 2000	1	64	.003	1.67	227	1.454	0.81	460
SGI Origin 2000	4	64	.004	0.16	178	1.204	2.33	555
Sun Ultra 2	1	64	.008	2.42	81	5.460	1.57	122
Sun Enterprise 450	1	64	.006	2.86	114	3.698	1.95	181

**Table 3.15:** Performance of DGEEV, eigenvalues only
	No. of		Values of n
	proc.	nb	100				1000
					True	Synth			True	Synth
			Time	$\frac{\rm Time}{\rm T(MM)}$	Mflops	Mflops	Time	$\frac{\rm Time}{\rm T(MM)}$	Mflops	Mflops
Dec Alpha Miata	1	28	.157	87.22	70	64	116.480	68.04	81	86
Compaq AS DS-20	1	28	.044	23.16	423	228	52.932	26.47	177	189
IBM Power 3	1	32	.060	33.33	183	167	91.210	65.86	103	110
Intel Pentium II	1	40	.100	32.26	110	100	107.940	35.94	87	93
Intel Pentium III	1	40	.080	26.67	137	133	91.230	36.49	103	110
SGI Origin 2000	1	64	.074	41.11	148	135	54.852	30.46	172	182
SGI Origin 2000	4	64	.093	3.72	117	107	42.627	82.45	222	235
Sun Ultra 2	1	64	.258	78.18	43	38	246.151	70.65	38	41
Sun Enterprise 450	1	64	.178	84.76	62	56	163.141	85.95	57	61

**Table 3.16:** Performance of DGEEV, eigenvalues and right eigenvectors
	No. of		Values of n
	proc.	nb	100				1000
					True	Synth			True	Synth
			Time	$\frac{\rm Time}{\rm T(MM)}$	Mflops	Mflops	Time	$\frac{\rm Time}{\rm T(MM)}$	Mflops	Mflops
Dec Alpha Miata	1	28	.308	171.11	86	86	325.650	190.22	73	81
Compaq AS DS-20	1	28	.092	48.42	290	287	159.409	79.70	149	165
IBM Power 3	1	32	.130	72.22	204	203	230.650	166.53	103	114
Intel Pentium II	1	40	.200	64.52	133	132	284.020	94.58	84	93
Intel Pentium III	1	40	.170	56.67	156	155	239.070	95.63	100	110
SGI Origin 2000	1	64	.117	65.00	228	226	197.455	109.64	121	133
SGI Origin 2000	4	64	.159	6.36	167	166	146.975	284.28	164	179
Sun Ultra 2	1	64	.460	139.39	58	58	601.732	172.71	39	44
Sun Enterprise 450	1	64	.311	148.10	85	85	418.011	220.24	57	63

**Table 3.17:** Performance of DGESDD, singular values only
	No. of		Values of n
	proc.	nb	100				1000
					True	Synth			True	Synth
			Time	$\frac{\rm Time}{\rm T(MM)}$	Mflops	Mflops	Time	$\frac{\rm Time}{\rm T(MM)}$	Mflops	Mflops
Dec Alpha Miata	1	28	.043	23.89	61	61	36.581	21.37	73	73
Compaq AS DS-20	1	28	.011	5.79	236	236	11.789	5.89	226	226
IBM Power 3	1	32	.020	11.11	133	133	8.090	5.84	330	330
Intel Pentium II	1	40	.040	12.90	67	67	29.120	9.70	92	92
Intel Pentium III	1	40	.030	10.00	89	89	25.830	10.33	103	103
SGI Origin 2000	1	64	.024	13.33	113	113	12.407	6.89	215	215
SGI Origin 2000	4	64	.058	2.32	46	46	4.926	9.53	541	541
Sun Ultra 2	1	64	.088	26.67	30	30	60.478	17.36	44	44
Sun Enterprise 450	1	64	.060	28.57	92	45	47.813	25.19	56	56

**Table 3.18:** Performance of DGESVD, singular values and left and right singular vectors
	No. of		Values of n
	proc.	nb	100				1000
					True	Synth			True	Synth
			Time	$\frac{\rm Time}{\rm T(MM)}$	Mflops	Mflops	Time	$\frac{\rm Time}{\rm T(MM)}$	Mflops	Mflops
Dec Alpha Miata	1	28	.222	123.33	77	30	320.985	187.49	48	21
Compaq AS DS-20	1	28	.053	27.89	326	126	142.843	71.42	107	47
IBM Power 3	1	32	.070	38.89	245	95	251.940	181.91	61	26
Intel Pentium II	1	40	.150	48.39	114	44	282.550	94.09	54	24
Intel Pentium III	1	40	.120	40.00	142	56	244.690	97.88	62	27
SGI Origin 2000	1	64	.074	41.11	232	90	176.134	97.80	87	38
SGI Origin 2000	4	64	.145	5.80	118	46	198.656	384.25	77	34
Sun Ultra 2	1	64	.277	83.94	62	24	570.290	163.69	27	12
Sun Enterprise 450	1	64	.181	86.19	95	37	402.456	212.04	38	17

**Table 3.19:** Performance of DGESDD, singular values and left and right singular vectors
	No. of		Values of n
	proc.	nb	100				1000
					True	Synth			True	Synth
			Time	$\frac{\rm Time}{\rm T(MM)}$	Mflops	Mflops	Time	$\frac{\rm Time}{\rm T(MM)}$	Mflops	Mflops
Dec Alpha Miata	1	28	.055	30.56	123	121	47.206	27.57	141	141
Compaq AS DS-20	1	28	.021	11.05	310	318	20.658	10.33	323	323
IBM Power 3	1	32	.025	13.89	268	267	15.230	11.00	438	438
Intel Pentium II	1	40	.060	19.35	112	111	44.270	14.74	151	151
Intel Pentium III	1	40	.050	16.67	134	133	38.930	15.57	171	171
SGI Origin 2000	1	64	.035	19.44	189	191	24.985	13.87	267	267
SGI Origin 2000	4	64	.091	3.64	73	73	8.779	16.89	759	760
Sun Ultra 2	1	64	.149	45.15	45	45	93.417	26.81	72	71
Sun Enterprise 450	1	64	.102	48.57	66	65	70.597	37.20	94	94