This section contains performance numbers for selected LAPACK driver routines. These routines provide complete solutions for the most common problems of numerical linear algebra, and are the routines users are most likely to call:

- Solve an
-by-*n*system of linear equations with 1 right hand side using DGESV. side*n* - Find only the eigenvalues of an
-by-*n*nonsymmetric matrix using DGEEV.*n* - Find the eigenvalues and right eigenvectors of an
-by-*n*nonsymmetric matrix using*n*

DGEEV. DSYEVD. matrix using DSYEVD. - Find only the singular values of an
-by-*n*matrix using DGESDD.*n* - Find the singular values and right and left singular vectors of an
-by-*n*matrix using*n*

DGESVD. - Find the singular values and right and left singular vectors of an
-by-*n*matrix using*n*

DGESDD.

We only present data on DGESDD for singular values only, and not DGESVD, because both use the same algorithm. We include both DGESVD and DGESDD for computing all the singular values and singular vectors to illustrate the speedup of the new algorithm DGESDD over its predecessor DGESVD: For 1000-by-1000 matrices DGESDD is between 6 and 7 times faster than DGESVD on most machines.

The above drivers are timed on a variety of computers. In addition, we present data on fewer machines to compare the performance of the five different routines for solving linear least squares problems, and several different routines for the symmetric eigenvalue problem. Again, the purpose is to illustrate the performance improvements in LAPACK 3.0.

Data is provided for PCs, shared memory parallel
computers, and high performance workstations. All timings were obtained by
using the machine-specific optimized BLAS available on each machine. For
machines running the Linux operating system, the ATLAS[102] BLAS
were used.
In all cases the data consisted
of 64-bit floating point numbers (double precision).
For each machine and each driver, a small problem
(** N=100** with

The performance data is reported using three or four statistics. First, the run-time in seconds is given. The second statistic measures how well our performance compares to the speed of the BLAS, specifically DGEMM. This ``equivalent matrix multiplies'' statistic is calculated as

DGEMV (TRANS='N') and DGEMM (TRANSA='N', TRANSB='N') is provided in Table 3.12, along with the clock speed for each machine in Tables 3.2, 3.3, 3.4, 3.5, and 3.6. The third statistic is the true megaflop rating. For the eigenvalue and singular value drivers, a fourth ``synthetic megaflop'' statistic is also presented. We provide this statistic because the number of floating point operations needed to find eigenvalues and singular values depends on the input data, unlike linear equation solving or linear least squares solving with DGELS, and on the algorithm. The synthetic megaflop rating is defined to be a ``standard'' number of flops required to solve the problem, divided by the run-time in microseconds. This ``standard'' number of flops is taken to be the average for a standard algorithm over a variety of problems, as given in Table 3.13. In all cases we ignore terms of

We also include several figures comparing the speed of several routines for the symmetric eigenvalue problem and several least squares drivers to highlight the performance improvements in LAPACK 3.0.

First consider Figure 3.1, which compares the performance of three routines, DSTEQR, DSTEDC and DSTEGR, for computing all the eigenvalues and eigenvectors of a symmetric tridiagonal matrix. The times are shown on a Compaq AlphaServer DS-20 for matrix dimensions from 100 to 1000. The symmetric tridiagonal matrix was obtained by taking a random dense symmetric matrix and reducing it to tridiagonal form (the performance can vary depending on the distribution of the eigenvalues of the matrix, but the data shown here is typical). DSTEQR (used in driver DSYEV) was the only algorithm available in LAPACK 1.0, DSTEDC (used in driver DSYEVD) was introduced in LAPACK 2.0, and DSTEGR (used in driver DSYEVR) was introduced in LAPACK 3.0. As can be seen, for large matrices DSTEGR is about 14 times faster than DSTEDC and nearly 50 times faster than DSTEQR.

Next consider Figure 3.2, which compares the performance of four driver routines, DSYEV, DSYEVX, DSYEVD and DSYEVR, for computing all the eigenvalues and eigenvectors of a dense symmetric matrix. The times are shown on an IBM Power 3 for matrix dimensions from 100 to 2000. The symmetric matrix was chosen randomly. The cost of these drivers is essentially the cost of phases 1 and 3 (reduction to tridiagonal form and backtransformation) plus the cost of phase 2 (the symmetric tridiagonal eigenproblem) discussed in the last paragraph. Since the cost of phases 1 and 3 is large, performance differences in phase 2 are no longer as visible. We note that if we had chosen a test matrix with a large cluster of nearby eigenvalues, then the cost of DSYEVX would have been much larger, without significantly affecting the timings of the other drivers. DSYEVR is the driver of choice.

Finally consider Figure 3.3, which compares the performance of five drivers for the linear least squares problem, DGELS, DGELSY, DGELSX, DGELSD and DGELSS, which are shown in order of decreasing speed. DGELS is the fastest. DGELSY and DGELSX use QR with pivoting, and so handle rank-deficient problems more reliably than DGELS but can be more expensive. DGELSD and DGELSS use the SVD, and so are the most reliable (and expensive) ways to solve rank deficient least squares problems. DGELS, DGELSX and DGELSS were in LAPACK 1.0, and DGELSY and DGELSD were introduced in LAPACK 3.0. The times are shown on a Compaq AlphaServer DS-20 for squares matrices with dimensions from 100 to 1000, and for one right-hand-side. The matrices were chosen at random (which means they are full rank). First consider DGELSY, which is meant to replace DGELSX. We can see that the speed of DGELSY is nearly indistinguishable from the fastest routine DGELS, whereas DGELSX is over 2.5 times slower for large matrices. Next consider DGELSD, which is meant to replace DGELSS. It is 3 to 5 times slower than the fastest routine, DGELS, whereas its predecessor DGELSS was 7 to 34 times slower. Thus both DGELSD and DGELSY are significantly faster than their predecessors.

DGEMV | DGEMM | |||||||

Values of n=m=k |
||||||||

100 |
1000 |
100 |
1000 |
|||||

Time | Mflops | Time | Mflops | Time | Mflops | Time | Mflops | |

Dec Alpha Miata | .0151 | 66 | 27.778 | 36 | .0018 | 543 | 1.712 | 584 |

Compaq AlphaServer DS-20 | .0027 | 376 | 8.929 | 112 | .0019 | 522 | 2.000 | 500 |

IBM Power 3 | .0032 | 304 | 2.857 | 350 | .0018 | 567 | 1.385 | 722 |

IBM PowerPC | .0435 | 23 | 40.000 | 25 | .0063 | 160 | 4.717 | 212 |

Intel Pentium II | .0075 | 134 | 16.969 | 59 | .0031 | 320 | 3.003 | 333 |

Intel Pentium III | .0071 | 141 | 14.925 | 67 | .0030 | 333 | 2.500 | 400 |

SGI O2K (1 proc) | .0046 | 216 | 4.762 | 210 | .0018 | 563 | 1.801 | 555 |

SGI O2K (4 proc) | 5.000 | 0.2 | 2.375 | 421 | .0250 | 40 | 0.517 | 1936 |

Sun Ultra 2 (1 proc) | .0081 | 124 | 17.544 | 57 | .0033 | 302 | 3.484 | 287 |

Sun Enterprise 450 (1 proc) | .0037 | 267 | 11.628 | 86 | .0021 | 474 | 1.898 | 527 |

megaflop SSYEV/DSYEV.

Driver | Options | Operation |

Count | ||

xGESV | 1 right hand side | |

xGEEV | eigenvalues only | |

xGEEV | eigenvalues and right eigenvectors | |

xGES{VD,DD} | singular values only | |

xGES{VD,DD} | singular values and left and right singular vectors |

No. of | Values of n |
|||||||

proc. | nb |
100 | 1000 | |||||

Time | Mflops | Time | Mflops | |||||

Dec Alpha Miata | 1 | 28 | .004 | 2.2 | 164 | 1.903 | 1.11 | 351 |

Compaq AlphaServer DS-20 | 1 | 28 | .002 | 1.05 | 349 | 1.510 | 0.76 | 443 |

IBM Power 3 | 1 | 32 | .003 | 1.67 | 245 | 1.210 | 0.87 | 552 |

Intel Pentium II | 1 | 40 | .006 | 1.94 | 123 | 2.730 | 0.91 | 245 |

Intel Pentium III | 1 | 40 | .005 | 1.67 | 136 | 2.270 | 0.91 | 294 |

SGI Origin 2000 | 1 | 64 | .003 | 1.67 | 227 | 1.454 | 0.81 | 460 |

SGI Origin 2000 | 4 | 64 | .004 | 0.16 | 178 | 1.204 | 2.33 | 555 |

Sun Ultra 2 | 1 | 64 | .008 | 2.42 | 81 | 5.460 | 1.57 | 122 |

Sun Enterprise 450 | 1 | 64 | .006 | 2.86 | 114 | 3.698 | 1.95 | 181 |

No. of | Values of n |
|||||||||

proc. | nb |
100 | 1000 | |||||||

True | Synth | True | Synth | |||||||

Time | Mflops | Mflops | Time | Mflops | Mflops | |||||

Dec Alpha Miata | 1 | 28 | .157 | 87.22 | 70 | 64 | 116.480 | 68.04 | 81 | 86 |

Compaq AS DS-20 | 1 | 28 | .044 | 23.16 | 423 | 228 | 52.932 | 26.47 | 177 | 189 |

IBM Power 3 | 1 | 32 | .060 | 33.33 | 183 | 167 | 91.210 | 65.86 | 103 | 110 |

Intel Pentium II | 1 | 40 | .100 | 32.26 | 110 | 100 | 107.940 | 35.94 | 87 | 93 |

Intel Pentium III | 1 | 40 | .080 | 26.67 | 137 | 133 | 91.230 | 36.49 | 103 | 110 |

SGI Origin 2000 | 1 | 64 | .074 | 41.11 | 148 | 135 | 54.852 | 30.46 | 172 | 182 |

SGI Origin 2000 | 4 | 64 | .093 | 3.72 | 117 | 107 | 42.627 | 82.45 | 222 | 235 |

Sun Ultra 2 | 1 | 64 | .258 | 78.18 | 43 | 38 | 246.151 | 70.65 | 38 | 41 |

Sun Enterprise 450 | 1 | 64 | .178 | 84.76 | 62 | 56 | 163.141 | 85.95 | 57 | 61 |

No. of | Values of n |
|||||||||

proc. | nb |
100 | 1000 | |||||||

True | Synth | True | Synth | |||||||

Time | Mflops | Mflops | Time | Mflops | Mflops | |||||

Dec Alpha Miata | 1 | 28 | .308 | 171.11 | 86 | 86 | 325.650 | 190.22 | 73 | 81 |

Compaq AS DS-20 | 1 | 28 | .092 | 48.42 | 290 | 287 | 159.409 | 79.70 | 149 | 165 |

IBM Power 3 | 1 | 32 | .130 | 72.22 | 204 | 203 | 230.650 | 166.53 | 103 | 114 |

Intel Pentium II | 1 | 40 | .200 | 64.52 | 133 | 132 | 284.020 | 94.58 | 84 | 93 |

Intel Pentium III | 1 | 40 | .170 | 56.67 | 156 | 155 | 239.070 | 95.63 | 100 | 110 |

SGI Origin 2000 | 1 | 64 | .117 | 65.00 | 228 | 226 | 197.455 | 109.64 | 121 | 133 |

SGI Origin 2000 | 4 | 64 | .159 | 6.36 | 167 | 166 | 146.975 | 284.28 | 164 | 179 |

Sun Ultra 2 | 1 | 64 | .460 | 139.39 | 58 | 58 | 601.732 | 172.71 | 39 | 44 |

Sun Enterprise 450 | 1 | 64 | .311 | 148.10 | 85 | 85 | 418.011 | 220.24 | 57 | 63 |

No. of | Values of n |
|||||||||

proc. | nb |
100 | 1000 | |||||||

True | Synth | True | Synth | |||||||

Time | Mflops | Mflops | Time | Mflops | Mflops | |||||

Dec Alpha Miata | 1 | 28 | .043 | 23.89 | 61 | 61 | 36.581 | 21.37 | 73 | 73 |

Compaq AS DS-20 | 1 | 28 | .011 | 5.79 | 236 | 236 | 11.789 | 5.89 | 226 | 226 |

IBM Power 3 | 1 | 32 | .020 | 11.11 | 133 | 133 | 8.090 | 5.84 | 330 | 330 |

Intel Pentium II | 1 | 40 | .040 | 12.90 | 67 | 67 | 29.120 | 9.70 | 92 | 92 |

Intel Pentium III | 1 | 40 | .030 | 10.00 | 89 | 89 | 25.830 | 10.33 | 103 | 103 |

SGI Origin 2000 | 1 | 64 | .024 | 13.33 | 113 | 113 | 12.407 | 6.89 | 215 | 215 |

SGI Origin 2000 | 4 | 64 | .058 | 2.32 | 46 | 46 | 4.926 | 9.53 | 541 | 541 |

Sun Ultra 2 | 1 | 64 | .088 | 26.67 | 30 | 30 | 60.478 | 17.36 | 44 | 44 |

Sun Enterprise 450 | 1 | 64 | .060 | 28.57 | 92 | 45 | 47.813 | 25.19 | 56 | 56 |

No. of | Values of n |
|||||||||

proc. | nb |
100 | 1000 | |||||||

True | Synth | True | Synth | |||||||

Time | Mflops | Mflops | Time | Mflops | Mflops | |||||

Dec Alpha Miata | 1 | 28 | .222 | 123.33 | 77 | 30 | 320.985 | 187.49 | 48 | 21 |

Compaq AS DS-20 | 1 | 28 | .053 | 27.89 | 326 | 126 | 142.843 | 71.42 | 107 | 47 |

IBM Power 3 | 1 | 32 | .070 | 38.89 | 245 | 95 | 251.940 | 181.91 | 61 | 26 |

Intel Pentium II | 1 | 40 | .150 | 48.39 | 114 | 44 | 282.550 | 94.09 | 54 | 24 |

Intel Pentium III | 1 | 40 | .120 | 40.00 | 142 | 56 | 244.690 | 97.88 | 62 | 27 |

SGI Origin 2000 | 1 | 64 | .074 | 41.11 | 232 | 90 | 176.134 | 97.80 | 87 | 38 |

SGI Origin 2000 | 4 | 64 | .145 | 5.80 | 118 | 46 | 198.656 | 384.25 | 77 | 34 |

Sun Ultra 2 | 1 | 64 | .277 | 83.94 | 62 | 24 | 570.290 | 163.69 | 27 | 12 |

Sun Enterprise 450 | 1 | 64 | .181 | 86.19 | 95 | 37 | 402.456 | 212.04 | 38 | 17 |

No. of | Values of n |
|||||||||

proc. | nb |
100 | 1000 | |||||||

True | Synth | True | Synth | |||||||

Time | Mflops | Mflops | Time | Mflops | Mflops | |||||

Dec Alpha Miata | 1 | 28 | .055 | 30.56 | 123 | 121 | 47.206 | 27.57 | 141 | 141 |

Compaq AS DS-20 | 1 | 28 | .021 | 11.05 | 310 | 318 | 20.658 | 10.33 | 323 | 323 |

IBM Power 3 | 1 | 32 | .025 | 13.89 | 268 | 267 | 15.230 | 11.00 | 438 | 438 |

Intel Pentium II | 1 | 40 | .060 | 19.35 | 112 | 111 | 44.270 | 14.74 | 151 | 151 |

Intel Pentium III | 1 | 40 | .050 | 16.67 | 134 | 133 | 38.930 | 15.57 | 171 | 171 |

SGI Origin 2000 | 1 | 64 | .035 | 19.44 | 189 | 191 | 24.985 | 13.87 | 267 | 267 |

SGI Origin 2000 | 4 | 64 | .091 | 3.64 | 73 | 73 | 8.779 | 16.89 | 759 | 760 |

Sun Ultra 2 | 1 | 64 | .149 | 45.15 | 45 | 45 | 93.417 | 26.81 | 72 | 71 |

Sun Enterprise 450 | 1 | 64 | .102 | 48.57 | 66 | 65 | 70.597 | 37.20 | 94 | 94 |