.B .nr BT ''-%-'' .he '''' .pl 11i .de fO 'bp .. .wh -.5i fO .LP .nr LL 6.5i .ll 6.5i .nr LT 6.5i .lt 6.5i .ta 5.0i .ft 3 .bp .R .sp 1i .ce 100 .R .sp .5i . .sp 10 ARGONNE NATIONAL LABORATORY .br 9700 South Cass Avenue .br Argonne, Illinois 60439 .sp .6i .ps 12 .ft 3 Squeezing the Most out of Eigenvalue Solvers on High-Performance Computers .ps 11 .sp 3 Jack J. Dongarra, Linda Kaufman and, Sven Hammarling .sp 3 .ps 10 .ft 1 Mathematics and Computer Science Division .sp 2 Technical Memorandum No. 46 .sp .7i January, 1985 .pn 0 .he ''-%-'' .he '''' .bp . .sp 10 .B .ce 1 .ps 11 ABSTRACT .R .ps 10 .sp .5i This paper describes modifications to many of the standard algorithms used in computing eigenvalues and eigenvectors of matrices. These modifications can dramatically increase the performance of the underlying software on high performance computers without resorting to assembler language, without significantly influencing the floating point operation count, and without affecting the roundoff error properties of the algorithms. The techniques are applied to a wide variety of algorithms and are beneficial in various architectural settings. .in .bp .ft 3 .ps 11 .bp .LP .LP .EQ gsize 11 delim $$ .EN .TL .ps 12 .in 0 Squeezing the Most out of Eigenvalue Solvers\h'.35i' .br on High-Performance Computers\h'.35i' .AU .ps 11 .in 0 Jack J. Dongarra\|$size -1 {"" sup \(dg}$\h'.15i' .AI .ps 10 .in 0 Mathematics and Computer Science Division\h'.20i' Argonne National Laboratory\h'.20i' Argonne, Illinois 60439\h'.20i' .AU .ps 11 .in 0 Linda Kaufman\h'.20i' .AI .ps 10 .in 0 AT&T Bell Laboratories\h'.20i' Murray Hill, New Jersey 07974\h'.20i' .AU .ps 11 .in 0 Sven Hammarling\h'.20i' .AI .ps 10 .in 0 Numerical Algorithms Group Ltd.\h'.20i' NAG Central Office, Mayfield House\h'.20i' 256 Banbury Road, Oxford OX2 7DE\h'.20i' .FS .ps 9 .vs 11p $size -1 {"" sup \(dg}$\|Work supported in part by the Applied Mathematical Sciences subprogram of the Office of Energy Research, U. S. Department of Energy, under Contract W-31-109-Eng-38. .FE .QS .sp 2 .ps 10 .in .25i .ll -.25i .I Abstract \(em This paper describes modifications to many of the standard algorithms used in computing eigenvalues and eigenvectors of matrices. These modifications can dramatically increase the performance of the underlying software on high performance computers without resorting to assembler language, without significantly influencing the floating point operation count, and without affecting the roundoff error properties of the algorithms. The techniques are applied to a wide variety of algorithms and are beneficial in various architectural settings. .in .ll .QE .nr PS 11 .nr VS 16 .nr PD 0.5v .SH Introduction .PP On high-performance vector computers like the CRAY-1, CRAY X-MP, Fujitsu VP, Hitachi S-810, and Amdahl 1200, there are three basic performance levels \(em \fIscalar, vector, \f1and \f2super-vector. \f1For example, on the CRAY-1 [5,7,10], these levels produce the following execution rates: .KS .TS center; c c. Performance Level Rate of Execution* _ _ Scalar 0\(mi4 MFLOPS Vector 4\(mi50 MFLOPS Super-Vector 50\(mi160 MFLOPS .TE .KE .FS .ps 9 .vs 11p .ll .sp .25v *\|MFLOPS is an acronym for Million FLoating-point OPerations (additions or multiplications) per Second. .FE .R Scalar performance is obtained when no advantage is taken of the special features of the machine architecture. Vector performance is obtained by using the vector instructions to eliminate loop overhead and to take full advantage of the pipelined functional units. Super-vector performance is obtained by using vector registers to reduce the number of memory references and thus avoid letting the paths to/from memory become a bottleneck. .PP Typically, programs written in Fortran run at scalar or vector speeds, so that one must resort to assembler language (or assembler language kernels) to improve performance. But in [2], Dongarra and Eisenstat describe a technique for attaining super-vector speeds \fIfrom Fortran\fP for certain algorithms in numerical linear algebra. They noticed that many algorithms had the basic form .EQ delim $$ .EN .nf .sp .KS .B Algorithm A: .R For $i ~=~$ 1 to $m$ $ y ~<-~ alpha sub i x sub i ~+~ y $ End .KE .fi .sp where $ alpha sub i $ is a scalar and $x sub i $ and $y$ are vectors. Unfortunately, when this algorithm is implemented in a straightforward way, the CRAY, Fujitsu, and Hitachi Fortran compilers do not recognize that it is the "\f2same\f1 $y$" acted upon every time, and issues a store vector $y$ and a load vector $y$ command between each vector addition. Thus the path to/from memory becomes the bottleneck. The compilers generate vector code of the general form: .DS .I .TS center; l. Load vector \fRY\fP Load scalar \fR$ alpha sub i $ \fP Load vector \fRX(I)\fP Multiply scalar \fR $ alpha sub i $ \fP times vector \fRX(I)\fP Add result to vector \fRY\fP Store result in \fRY\fP .TE .R .DE This gives 2 vector operations to 3 vector memory references. Moreover because of the concept called "chaining" on the CRAY, Fujitsu, and Hitachi, the time for the vector multiply and add is practically insignificant. In most circumstances these may be initiated soon after the loading of the vector X(I) has begun, and for vectors of significant length the load, multiply and add may be thought of as practically simultaneous operations. .PP Dongarra and Eisenstat showed that if one unrolled the loop several times*, the number of memory references could be reduced and execution times often decreased by a factor of 2 or 3. .FS .ps 9 .vs 11p .ll .sp .25v *\|The loops have been unrolled to different depths on different machines depending on the effect; on the CRAY the depth is 16 and on the Fujitsu and Hitachi the depth is 8. .FE For example unrolling Algorithm A to a depth of two gives: .sp .nf .KS .B Algorithm A.2: .R For $i ~=~$ 2 to $m$ in steps of 2 $ y ~<-~ alpha sub i-1 x sub i-1 ~+~ alpha sub i x sub i ~+~ y $ End if ($m$ is odd) $ y ~<-~ alpha sub m x sub m ~+~ y $ .KE .fi .sp The compilers generate vector code of the general form: .DS .I .TS center; l. Load vector \fRY\fP Load scalar \fR$ alpha sub i-1 $ \fP Load vector \fRX(I-1)\fP Multiply scalar \fR $ alpha sub i-1 $ \fP times vector \fRX(I-1)\fP Add result to vector \fRY\fP Load scalar \fR$ alpha sub i $ \fP Load vector \fRX(I)\fP Multiply scalar \fR $ alpha sub i $ \fP times vector \fRX(I)\fP Add result to vector \fRY\fP Store result in \fRY\fP .TE .R .DE This gives 4 vector operations to 4 vector memory references. The larger the ratio of vector operations to vector memory references becomes the better the performance of the program segment. This is the result of vector memory vector memory operations, i.e. loads and stores, costing as much as other vector operations. When the loop is unrolled to a depth of 8 there are 16 vector operations to 10 vector memory references. Dongarra and Eisenstat incorporated this idea into two "kernel" subroutines: SMXPY, which added a matrix times a vector to another vector ($ Ax ~+~ y$), and SXMPY, which added a vector times a matrix to another vector ($ x sup T A ~+~ y sup T $). They showed that several linear system solvers could be rewritten using these kernel subroutines. .PP In this paper we try to apply the same concept to algorithms used in solving the eigenvalue problem. Normally these problems are solved in several steps: .sp .nf (1) Reduce the problem to a simpler problem (e.g., a tridiagonal matrix if the matrix was symmetric), (2) Solve the eigenproblem for the simpler problem, (3) If eigenvectors are requested, transform the eigenvectors of the simplified problem to those of the original problem. .fi .sp For symmetric problems, step (2) usually has the fewest floating point operations, while for nonsymmetric matrices step (2) has the most floating point operations. Because steps (1) and (3) often involve transformations that can be forced into the form of Algorithm A, we will concentrate our efforts on these steps. In certain cases speeding up these steps will not significantly affect the overall time required to solve the eigenproblem, but in other cases, such as the symmetric generalized eigenvalue problem, we will be speeding up the most time-consuming portion of the whole operation. Sometimes part of the algorithm simply has a matrix by vector multiplication; then application of Dongarra and Eisenstat's idea is straightforward. At other times, the code will need to be radically transformed to fit the form of Algorithm A. .PP In Section 2 we describe some underlying ideas that can be used to decrease memory references in various subroutines in the matrix eigenvalue package EISPACK [6,11,13]. In Section 3 we apply the concepts of Section 2 to specific subroutines in EISPACK and provide execution timing information on the CRAY-1 for subroutines provided in EISPACK 3, the current version of EISPACK [3], (The appendix contains execution timing information on the Hitachi S-810/20, and Fujitsu VP-200 (Amdahl 1200)). (In [4] we presented reprogramming of selected subroutines that are radically different from the original or are representative of a class of changes that might be applied to several subroutines.) .EQ delim $$ .EN .SH 2. Underlying Ideas .PP In this section we outline some of the underlying methods that occur throughout the algorithms used in the EISPACK package. We also discuss how they can be implemented to decrease vector memory references, without significantly increasing the number of floating point operations. .SH 2.1 Transformations .PP Many of the algorithms implemented in EISPACK have the following form: .sp .nf .KS .B Algorithm B: .R For i = 1,.... Generate matrix $T sub i$ Perform transformation $A sub i+1 ~<-~ T sub i A sub i T sub i sup -1$ End . .KE .fi .sp Because we are applying similarity transformations, the eigenvalues of $A sub i+1$ are those of $A sub i$. In this section we examine various types of transformation matrices $T sub i$. .SH 2.1.1 Stabilized elementary transformations .PP Stabilized elementary transformation matrices have the form $T~=~PL$, where $P$ is a permutation matrix, required to maintain numerical stability [12], and $L$ has the form .sp .KS .EQ left ( ~ matrix { ccol { 1 above nothing above nothing above nothing above nothing above nothing above nothing } ccol { nothing above . above nothing above nothing above nothing above nothing above nothing } ccol { nothing above nothing above 1 above nothing above nothing above nothing above nothing } ccol { nothing above nothing above nothing above 1 above * above . above * } ccol { nothing above nothing above nothing above nothing above 1 above nothing above nothing } ccol { nothing above nothing above nothing above nothing above nothing above . above nothing } ccol { nothing above nothing above nothing above nothing above nothing above nothing above 1 } } ~ right ) . .EN .KE The inverse of $L$ has the same structure as $L$. When $L sup -1 $ is applied on the right of a matrix, one has a subalgorithm with the exact form of Algorithm A, which can be implemented using SMXPY. Unfortunately, when applying $L$ on the left as in Algorithm B, one does not get the same situation. The vector $y$ changes, but the vector $x$ remains the same. However, in the sequence of transformations in Algorithm B, $T sub i$ consists of a matrix $L sub i$ whose off-diagonals are nonzero, only in the $i sup th$ column, and at the $i sup th$ step, one might apply transformations $T sub 1 $ through $T sub i$ only to the $i sup th $ row of the matrix. Subsequent row transformations from the left will not affect this row, and one can implement this part of the algorithm using SXMPY. .PP This idea was incorporated in the subroutine ELMHES, which will be discussed in Section 3. .SH 2.1.2 Householder transformations .PP In most of the algorithms the transformation matrices $T sub i$ are Householder matrices of the form .EQ Q ~=~ I - beta u u sup T , ~where ~beta u sup T u ~=~ 2, .EN so that $Q$ is orthogonal. To apply $Q$ from the left to a matrix $A$, one would proceed as follows: .KS .TS center; l. \f3Algorithm C\f1: 1. $v sup T ~=~ u sup T A$ 2. Replace $A$ by $A - beta uv sup T$ .TE .KE Naturally the first step in Algorithm C can be implemented using SXMPY, but the second step, the rank-one update, does not fall into the form of Algorithm A. However, when applying a sequence of Householder transformations, one may mitigate the circumstances somewhat by combining more than one transformation and thus performing a higher than rank one update to $A$. This is somewhat akin to the technique of loop unrolling discussed earlier. We give two illustrative examples. .PP Firstly suppose that we wish to form $(I- alpha ww sup T )A(I- beta uu sup T )$, where for a similarity transformation $alpha ~=~ beta$ and $w ~=~ u$. This is normally formed by first applying the left hand transformation as in Algorithm C, and then similarly applying the right hand transformation. But we may replace the two rank one updates by a single rank two update using the following algorithm. .KS .TS center; l. \f3Algorithm D.1 1. $v sup T ~=~ w sup T A$ 2. $x ~=~ Au$ 3. $y sup T ~=~ v sup T -( beta w sup T x ) u sup T$ 4. Replace $A $ by $ A - beta xu sup T - alpha w y sup T$ .TE .KE As a second example suppose that we wish to form $(I- alpha ww sup T )(I- beta uu sup T )A$, then as with Algorithm D.1 we might proceed as follows. .KS .TS center; l. \f3Algorithm D.2\f1: 1. $v sup T ~=~ w sup T A$ 2. $x sup T ~=~ u sup T A$ 3. $y sup T ~=~ v sup T -( beta w sup T u ) x sup T$ 4. Replace $A $ by $ A - beta ux sup T - alpha w y sup T$ .TE .KE In both cases we can see that Steps 1 and 2 can be achieved by calls to SXMPY and SMXPY. Step 3 is a simple vector operation and Step 4 is now a rank-two correction, and one gets four vector memory references for each four vector floating point operations (rather than the three vector memory references for every two vector floating point operations, as in Step 2 of Algorithm C). The increased saving is not as much as is realized with the initial substitution of SXMPY for the inner products in Step 1 of Algorithm C, but it more than pays for the additional $2n$ operations incurred at Step 3 and exemplifies a technique that might pay off in certain situations. This technique was used to speed up a number of routines that require Householder transformations. .SH 2.1.3 Plane Rotations .PP Some of the most time consuming subroutines in EISPACK, e.g. HQR2, QZIT, IMTQL2, TQL2 spend most of their time applying transformations in 2 or 3 planes to rows or columns of matrices. We have been able to speed up the application of these transformations by only about 15%, but if one is spending 90% of one's computation time here, the total effect is greater than improving the part which only contributes 10% of the total computation time. .PP First of all we should mention that on the CRAY-1 the time required by a 3 multiply Householder transformation in 2 planes is hardly less than that required by a 4 multiply Givens transformation [12]. Thus once again the computation time is influenced more by the number of vector memory references than by the number of floating point operations. We were able to eliminate several vector loads and stores by noticing that one of the planes used in one transformation is usually present in the next. Thus a typical Givens code which originally looked like .sp .KS .nf For $i=1$ to $n-1$ Compute $c sub i$ and $s sub i$ For $j=1$ to $n$ $t ~<-~h sub ji$ $h sub ji ~<-~ c sub i t + s sub i h sub j,i+1$ $h sub j,i+1 ~<-~ s sub i t - c sub i h sub j,i+1$ End End .fi .KE .sp would become .sp .nf .KS For $j=1$ to $n$ $t sub j ~<-~ h sub j1$ End For $i=1$ to $n-1$ Compute $c sub i$ and $s sub i$ For $j=1$ to $n$ $h sub ji ~<-~ c sub i t sub j + s sub i h sub j,i+1$ $t sub j ~<-~ s sub i t sub j - c sub i h sub j,i+1$ End End For $j=1$ to $n$ $h sub jn ~<-~t sub j$ End .KE .fi .sp and a typical Householder code which looked like .sp .KS .nf For $i=1$ to $n-1$ Compute $q sub i , x sub i$ and $ y sub i$ For $j=1$ to $n$ $p ~<-~ h sub ji + q sub i h sub j,i+1$ $h sub ji ~<-~ h sub ji + p x sub i$ $h sub j,i+1 ~<-~ h sub j,i+1 + p y sub i$ End End .KE .sp .fi would become .sp .nf .KS For $i=1$ to $ n-2$ in steps of 2 Compute $q sub i , x sub i , y sub i , q sub i+1 , x sub i+1 ,$ and $y sub i+1$ For $j=1$ to $n$ $p ~<-~ h sub ji + q sub i h sub j,i+1$ $r ~<-~ h sub j,i+1 + q sub i+1 h sub j,i+2 + y sub i p$ $h sub ji ~<-~ h sub ji + p x sub i$ $h sub j,i+1 ~<-~ h sub j,i+1 + p y sub i + r x sub i+1 $ $h sub j,i+2 ~<-~ h sub j,i+2 + r y sub i+1 $ End End .fi .KE .sp Notice that for the Householder transformations we have actually increased the number of multiplications in total but still the amount of time has decreased. For a three plane Householder transformation, like that found in HQR2, unrolling the loop twice causes about a 10% drop in execution time. .PP Inserting the modified Givens into a code like TQL2 is an easy task. Changing codes like HQR2 to use the unrolled Householders is rather unpleasant. .SH 2.2 Triangular Solvers .PP Assume one has an $n times n$ nonsingular lower triangular matrix $L$ and an $n times m$ matrix $B$, and wishes to solve .EQ (2.1) LY~=~B . .EN .PP If $m~=~1$ one might normally proceed as follows: .sp .KS .nf \f3Algorithm E\f1: $y ~<-~ b$ For $i ~=~1$ to $n$ $y sub i ~<-~ y sub i /l sub ii$ For $j~=~ i+1$ to $n$ $y sub j ~<-~ y sub j ~-~l sub ji y sub i$ (2.2) End End .fi .KE .sp Equation (2.2) almost looks like Algorithm A, but the length of the $y$ vector decreases. Unrolling the $i$ loop once decreases the number of vector memory references from 3 for every 2 vector floating point operations to 4 for every 4 vector floating point operations. The unrolled code would be of the following form: .sp .KS .nf \f3Algorithm F\f1: $y ~<-~ b$ For $i~=~1$ to $n-1$ in steps of 2 $y sub i ~<-~y sub i /l sub ii$ $y sub i+1 ~<-~ (y sub i+1 ~-~ l sub i+1,i y sub i ) /l sub i+1,i+1$ For $j~=~i+2,...,n$ $y sub j ~<-~ y sub j ~-~y sub i l sub ji ~-~ y sub i+1 l sub j,i+1$ End End If ($n~mod~2 ~!=~ 0$) $y sub n ~<-~ y sub n /l sub nn$ .fi .KE .sp On the CRAY-1 the ratio of execution times of Algorithm F to Algorithm E is 1.5 as Table 2.1 indicates. .PP However, when $m$ is sufficiently large that it makes computational sense to treat vectors of length $m$, one can do much, much better by computing $Y$ by rows rather than repeating either Algorithm E or Algorithm F for each column. Let $Y sub j$ denote the first $j$ rows of the matrix $Y$, $y sub j sup T$ denote its $j sup th$ row, and $l sub j sup T$ denote the $j sup th$ row of $L$. Then one might proceed as follows: .sp .KS .nf \f3Algorithm G\f1: $Y ~<-~ B$ For $j~=~1$ to $n$ $y sub j sup T ~<-~ b sub j sup T ~-~ l sub j sup T Y sub j-1$ (2.3) $y sub j sup T ~<-~ y sub j sup T / l sub jj$ End .fi .KE .sp Step (2.3) can be implemented using SXMPY. Obviously, working by rows is superior if $m$ is sufficiently large. Since Algorithm G uses vectors of length $m$, when $m$ is small one should use Algorithm F. We have discussed triangular solvers using a lower triangular matrix. One can implement the last three algorithms for an upper triangular matrix, and Algorithm G would determine the last row first and work backwards. Triangular solvers occur in the EISPACK subroutines REDUC and REBAK used in the solution of the symmetric generalized eigenvalue problem $Ax~=~ lambda B x$. Inserting calls to SXMPY and SMXPY decreases the time required by these subroutines to such an extent that the time required for the generalized eigenvalue problem is not appreciably more than that required for the standard eigenvalue problem on the high performance computers under discussion. .KS .TS center; c s s s s c s s s s c c c c c n n n n n. Table 2.1: CRAY-1 Times in $10 sup -3$ Seconds for Triangular Solvers _ $n$ $m$ Algorithm E Algorithm F Algorithm G _ 100 1 .506 .340 5.14 25 12.5 8.32 6.92 100 49.9 33.3 14.4 _ 200 1 1.55 1.02 17.2 25 38.6 25.5 24.1 100 151 102 52.2 200 308 202 93.7 _ 300 1 3.15 2.05 35.9 25 78.5 51.1 51.8 150 472 306 162 300 940 613 290 .TE .KE .EQ delim $$ .EN .SH 2.3 Matrix Multiplication with Symmetric Packed Storage .EQ delim $$ .EN .PP The algorithms in EISPACK that deal with symmetric matrices permit the user to specify only the lower triangular part of the matrix. There are routines requiring that a two-dimensioned array be provided, using only the information in the lower portion and routines accommodating the matrix packed into a one-dimensional array. The normal scheme for doing this matrix vector product would be .sp .KS .nf \f3Algorithm H\f1 For $j$=1 to $n$ $t ~<-~ y sub j$ For $i=j+1$ to $n$ $y sub i ~<-~ y sub i ~+~ a sub ij x sub j $ $t ~<-~ t ~+~ a sub ij x sub i $ End $y sub j ~<-~ t ~+~ a sub jj x sub j$ End .fi .KE .sp Certainly one might consider stepping the outer loop by 2, doing two inner products followed by a rank-two correction. Another alternative in the same vein, which unfortunately would not be amenable to a subroutine that packed the symmetric matrix into a one-dimensional array, is the following: .sp .KS .nf \f3Algorithm I\f1 For $i$ = 1 to $n-1$ in steps of 2 For $j$ = 1 to $i-1$ $y sub j ~<-~ y sub j ~+~ a sub ij x sub i ~+~ a sub i+1,j x sub i+1$ End For $j$ = $i+1$ to $n$ $y sub j ~<-~ y sub j ~+~ a sub ji x sub i ~+~ a sub j,i+1 x sub i+1$ End $y sub i ~<-~ y sub i ~+~ a sub i+1,i x sub i+1 ~+~ a sub ii x sub i$ End .KE .fi .sp .PP A less obvious technique is a divide-and-conquer approach. If we consider referencing a symmetric matrix in a matrix vector product where the matrix is specified in the lower triangular matrix, we have .EQ left ( matrix { ccol { y sub 1 above y sub 2 } } right ) mark ~=~ left ( matrix { ccol { y sub 1 above y sub 2 } } right ) ~+~ left ( matrix { ccol { T sub 1 above B } ccol { B sup T above T sub 2 } } right ) ~ times ~ left ( matrix { ccol { x sub 1 above x sub 2 } } right ) , .EN where $T sub 1 $ and $ T sub 2 $ are symmetric matrices stored in the lower portion and $ B$ is full. This can be written as .sp .KS .nf \f3Algorithm J\f1: Set $ y sub 1 ~=~ T sub 1 x sub 1 ~+~ B sup T x sub 2 $ Set $ y sub 2 ~=~ B x sub 1 ~+~ T sub 2 x sub 2 $ . .fi .KE .sp .PP In writing the matrix multiply this way, two things should be noted. There are two square $ n over 2 ~ times n over 2 $ full matrix vector multiplications, and two symmetric matrix vector products. .PP Table 2.2 gives a comparison on the CRAY 1 of Algorithm H (a standard approach), to Algorithm I, Algorithm HJ (where $T sub 1 x sub 1$ and $T sub 2 x sub 2$ of Algorithm J are done according to Algorithm H), and Algorithm IJ (where these are done according to Algorithm I). .KS .ce Table 2.2 .ce Comparison of Execution Times on the CRAY 1 .ce for Symmetric Matrix Multiply .TS center; l c s s l l l l n n n n. Order Ratio of Execution Times Alg. H/Alg. I Alg. H/Alg. HJ Alg. H/Alg. IJ _ 100 2.14 1.24 2.13 200 1.87 1.43 2.12 300 1.78 1.53 2.08 .TE .KE .PP When the matrix is packed in a one-dimensional array stored by column, the same divide and conquer approach can be applied. .EQ delim $$ .EN .SH 3. Subroutines in EISPACK .SH 3.1 The Unsymmetric Eigenvalue Problem .PP In this section we investigate methods for the efficient implementation of the algorithms that deal with the standard eigenvalue problem .EQ A x ~=~ lambda x , .EN where $A$ is a real general matrix. The algorithms for dealing with this problem follow the form .sp .IP (1) Reduce A to upper Hessenberg form (ELMHES or ORTHES). .IP (2) Find the eigensystem of the upper Hessenberg matrix. .IP (3) If eigenvectors are requested, back transform the eigenvectors of the Hessenberg matrix to form the eigenvectors of the original system (ELMBAK or ORTBAK). .in For this particular problem, most of the time is spent in the second step but as was discussed in section 2.1.3 this part is not easy to vectorize so we concentrate our discussion on Steps 1 and 3. .EQ delim $$ .EN .SH 3.1.1 ELMHES and ORTHES .PP In the subroutine ELMHES, which reduces a matrix to upper Hessenberg form while preserving the eigenvalues of the original matrix, a sequence of stabilized elementary transformations are used. The transformations are of the form .EQ T sub k ... T sub 2 T sub 1 A T sub 1 sup -1 T sub 2 sup -1 ... T sub k sup -1 . .EN This set of transformations has the effect of reducing the first $k$ columns of $A$ to upper Hessenberg form. .PP The transformations can be applied in such a way that matrix-vector operations are used in the time consuming part. At the $k sup th$ stage of the reduction we apply the previous $k-1$ transformations on the left side of the reduced $A$ to the last $(n-k)$ elements of the $k+1 sup st$ row. Then, on the right the inverse of the $k sup th$ transformation is applied to the reduced matrix $A$ followed by the application on the left of all $k$ transformations to the elements below the diagonal of the $k sup th$ column. Because of the structure of the transformations (see 2.1.1) both these steps are simple matrix-vector multiplication. The application of transformations from the left follows essentially the algorithm given in Dongarra and Eisenstat [2] for the LU decomposition of a matrix. In the original EISPACK codes at the $k sup th$ stage permutations from the left are applied to only the last $n-k+1$ columns of the matrix. In our new code in order to use SMXPY and SXMPY, we must apply these permutations to the whole matrix. Thus the elements below the subdiagonal of the $A$ matrix which are necessary for finding the eigenvectors might be slightly scrambled and hence the user must use the modified ELMBAK given in section 3.1.3. .PP The subroutine ORTHES uses Householder orthogonal similarity transformations to reduce $A$ to upper Hessenberg form. At the $k sup th$ stage we perform the operation .EQ Q sub k A Q sub k sup T , .EN where $Q sub k ~=~ I ~-~ beta u u sup T $. As shown in Algorithm D.1, the usual two rank one updates may be replaced by a rank one update to the first $k$ rows of $A$ followed by a rank two update to rows $k+1$ through $n$. In this case Algorithm D.1 becomes .KS .EQ I 1.~~~ v sup T ~=~ u sup T A .EN .EQ I 2.~~~ x ~=~ Au .EN .EQ I 3.~~~ y sup T ~=~ v sup T -( beta u sup T x ) u sup T .EN .EQ I 4.~~~ Replace ~A ~by ~ A - beta ( xu sup T + u y sup T ) .EN .KE .PP Seeing the transformations applied in this way leads to a straightforward matrix-vector implementation. Table 3.1 reports the comparison between the EISPACK implementations and the ones just described. Significant speedups are accomplished using these constructs. .sp .KS .ce Table 3.1 .ce Comparison of Execution on the CRAY 1 .ce for Routine ELMHES and ORTHES .TS center; l l s s l l s s l c c s l c c c n n n n. Order Ratio of Execution Times EISPACK / MV version ELMHES ORTHES rank 1 only rank 2 _ _ _ _ 50 1.5 2.0 2.5 100 2.2 1.9 2.5 150 2.4 1.8 2.4 .TE .KE .SH 3.1.2 ELTRAN and ORTRAN .PP If all the eigenvectors are requested, one might choose to use either ELTRAN or ORTRAN (depending on whether one used ELMHES or ORTHES) followed by a call to HQR2 rather than finding the eigenvectors using INVIT and then back transforming using ELMBAK or ORTBAK. ELTRAN requires no floating point operations, but because of the use of stabilized elementary transformations in ELMHES, it may require swapping of various rows of the partial eigenvector matrix being constructed. Because ELMHES has changed, the swapping in ELTRAN is slightly different. ORTRAN applies the Householder transformations determined in ORTHES to the identity matrix. By combining two Householder transformations we can perform a rank two update to $I$ using the technique described in section 2.1.2, and this realizes a cut in the execution time for this routine by a factor of two. .SH 3.1.3 ELMBAK and ORTBAK .PP Both ELMBAK and ORTBAK compute the eigenvectors of the original matrix given the eigenvectors of the upper Hessenberg matrix and the transformations used to reduce the original matrix. This requires that a set of transformations be applied on the left to the matrix of eigenvectors in reverse order. The reduction is of the form $T A T sup -1 T X ~=~ lambda T X ,$ where $~ T ~=~ T sub n-1 ... T sub 2 T sub 1 $. The eigenvectors, say $Y$, of the reduced matrix $H$ are found using, .EQ I H ~=~ T A T sup -1 ~~ and ~~H Y ~=~ lambda Y , .EN .sp then the eigenvectors for the original problem are computed as, .sp .EQ I X ~=~ T sup -1 Y ~=~ T sub 1 sup -1 T sub 2 sup -1 ... T sub n sup -1 Y . .EN .PP The original EISPACK subroutines use $T$ as a product of transformations as given above. For ELMBAK we use a slightly different approach. As in Section 2.1.1, each $T sub i$ may be written as $L sub i P sub i$, where $P sub i$ is a permutation matrix and $L sub i$ is a lower triangular matrix. On output from the new ELMHES, let $B$ be the $(n-1) times (n-1)$ lower triangular matrix below the subdiagonal of the reduced $A$. Let $C$ be the unit lower triangular matrix .EQ C ~=~ left ( matrix { ccol { 1 above 0 above . above . above . above 0 } ccol { nothing above 1 above nothing above nothing above nothing above nothing } ccol { nothing above nothing above . above nothing above B above nothing } ccol { nothing above nothing above nothing above . above nothing above nothing } ccol { nothing above nothing above nothing above nothing above . above nothing } ccol { nothing above nothing above nothing above nothing above nothing above 1 } } right ) . .EN Then one can show that $T sup -1 ~=~P sub 1 P sub 2 ......P sub n-1 C$. .PP Since ORTBAK involves a product of Householder transformations, reducing the number of vector memory references is again a straightforward task. Dramatic improvements are seen in these back transformation routines, as shown in Tables 3.2 below. Originally ELMBAK was 2.4 times faster than ORTBAK; in the MV version it only enjoys an advantage of 1.9 over ORTBAK using $(n-1)/2$ rank 2 changes. .sp .KS .ce Table 3.2 .ce Comparison of Execution on the CRAY 1 .ce for EISPACK Routines ORTBAK and ELMBAK .TS center; l c s s l c s s l c c s l c c c n n n n. Order Ratio of Execution Times EISPACK/ MV Version ELMBAK ORTBAK Rank 1 Rank 2 _ _ _ _ 50 2.2 2.8 3.6 100 2.6 2.5 3.3 150 2.7 2.3 3.0 .TE .KE .EQ delim $$ .EN .SH 3.2 The Symmetric Eigenvalue Problem .EQ delim $$ .EN .PP In this section we look at the methods for efficient implementation of the algorithms that deal with the symmetric eigenvalue problem .EQ A x ~=~ lambda x , .EN where $A$ is a real symmetric matrix. The algorithms for dealing with this problem have two possible paths .sp .B Path 1 .sp .IP (1) Transform $A$ to tridiagonal form (TRED1). .IP (2) Find the eigenvalues of the tridiagonal matrix (IMTQLV). .IP (3) If eigenvectors are requested, find the eigenvectors of the tridiagonal matrix by inverse iteration (TINVIT). .IP (4) If eigenvectors are requested, back transform the vectors of the tridiagonal matrix to form the eigenvectors of the original system (TRBAK1). .in or .br .B Path 2 .sp .IP (1) Transform $A$ to tridiagonal form, accumulating the transformations (TRED2). .IP (2) Find the eigenvalues of the tridiagonal matrix and accumulate the transformations to give the eigenvectors of the original matrix (IMTQL2). .sp .in .PP On conventional serial machines, Path 2 typically requires nearly twice as much time as Path 1. On vector machines however we do not see this relationship. For EISPACK, Path 2 is slightly faster and after the modification described below requires roughly the same amount of time. This is the result of two problems in routine TINVIT. First, TINVIT has not been modified to induce vectorization at any level. One can achieve an increase in performance by vectorizing across the eigenvectors being computed. We have not presented an algorithm of this form since it requires a different technique to achieve performance and cannot run at supervector rates. The time spent in TINVIT on serial machines is inconsequential with respect to the total time to execute Path 1. However, on vector machines TINVIT becomes a significant contributor to the total execution time of the path. The second factor influencing performance for Path 1 is that the current version of TINVIT has a call to an auxiliary routine, PYTHAG, in an inner loop of the algorithm. PYTHAG is used to safely and portably compute the square root of the sum of squares. If TINVIT is modified to replace the call to PYTHAG by a simple square root, the time for TINVIT becomes more attractive by about 30%. .PP We note that the advantage of Path 2 is that near orthogonality of the eigenvectors is guaranteed, while with Path 1 one may see some degradation in this property for eigenvectors corresponding to close eigenvalues. Both paths give excellent eigensystems in the sense that they are eigensystems of a problem close to the original problem [12,13]. .PP We will now describe the implementation of routines TRED1 and TRED2 using matrix vector operations. .SH 3.2.1 TRED1, TRED2 and TRBAK1 .PP Routine TRED1 or TRED2 reduce a real symmetric matrix to a symmetric tridiagonal matrix using orthogonal similarity transformations. An $n times n$ matrix requires $n-2$ transformations each of which introduces zeros into a particular row and column of the matrix, while preserving symmetry and preserving the zeros introduced by previous transformations. TRED1 is used to just compute the tridiagonal matrix, while TRED2 in addition to computing the tridiagonal matrix also returns the orthogonal matrix which would transform the original matrix to this tridiagonal matrix. TRBAK1 forms the eigenvectors of the real symmetric matrix from the eigenvectors of the symmetric tridiagonal matrix determined by TRED1. This orthogonal matrix will later be used in computing the eigenvectors of the original matrix. These subroutines deal with the real symmetric matrix as stored in the lower triangle of an array. .PP The sequence of transformations applied to the matrix $A$ is of the form .EQ I A sub i+1 ~<-~ Q sub i A sub i Q sub i ~, i ~=~ 1,2,...,n-2 , .EN where $Q$ is a Householder matrix of the form described in section 2.1.2. Each of the similarity transformations is applied as in Algorithm D.1 with the simplification that $w$ and $u$ are the same, so that application becomes: .KS .TS center; l. 1. $x ~=~ Au$ 2. $y sup T ~=~ x sup T -( beta u sup T x ) u sup T$ 3. Replace $A $ by $ A - beta xu sup T - beta u y sup T$ .TE .KE Since the matrix $A$ is symmetric and stored in the lower triangle of the array, the matrix vector operation in step 1 follows the form described in Section 2.3 as implemented in Algorithm IJ. .PP TRED2 differs from TRED1 in that the transformation matrices are accumulated in an array $Z$. The sequence of transformations applied to the matrix $Z$ is of the form .EQ I Z sub n-2 ~=~ Q sub n-2 .EN .EQ I Z sub i ~<-~ Q sub i Z sub i+1 ~, i ~=~ n-3,...,2,1 . .EN This can be implemented in a straightforward manner as in Algorithm C of section 2.1.2 using matrix vector multiply and a rank one update. Since all transformations are available at the time they are to be accumulated, more than one transformation can be accumulated at a time, say two at a time, thus giving a rank two update. This then gives an implementation that has the form of Algorithm D.2 in section 2.1.2. .PP When TRED1 and TRED2 are implemented as described, significant improvements in the execution time can be realized on vector machines. Table 3.3 displays the execution time for the current EISPACK versions of TRED1 and TRED2 as well as the modified matrix vector implementations, referred to as TRED1V and TRED2V. .PP TRBAK1 applies the transformations used by TRED1 to reduce the matrix to tridiagonal form. This can be organized as in TRED2, by matrix vector multiplication and a rank-2 update. The table below shows the improvement in performance when this is implemented. .sp .KS .ce Comparison of Execution on the CRAY 1 .ce for EISPACK Routine TRBAK1 .TS center; l l l l n n. Order Ratio of Execution Times EISPACK/MV Version _ _ 50 4.20 100 3.66 .TE .KE .sp .SH 3.3 The Symmetric Generalized Eigenvalue Problem .PP In this section we consider methods for increasing the efficiency of the subroutines in EISPACK for solving the generalized eigenvalue problem .EQ Ax ~=~ lambda B x , .EN where $A$ and $B$ are symmetric matrices and $B$ is positive definite. In EISPACK this problem is solved in the following steps, with the name of the corresponding subroutine in the package given in parenthesis: .sp .in .5i .IP (1) Factor $B$ into $LL sup T$ and form $C~=~L sup -1 AL sup -T $ (REDUC). .br .IP (2) Solve the symmetric eigenvalue problem $C y ~=~ lambda y $. .br .IP (3) If eigenvectors are requested, transform the eigenvectors of $C$ into those of the original system (REBAK). .sp .in In general the majority of the execution time is spent in REDUC and REBAK, and it will be these routines on which we will concentrate. .SH 3.3.1 REDUC .PP REDUC has three main sections: .KS .TS center; l. 1. Find the Cholesky factors of B, i.e., finding lower triangular $L$ such that $B~=~LL sup T$ . 2. Find the upper triangle of $E~=~L sup -1 A$ . 3. Find the lower triangle of $C ~=~ L sup -1 E sup T$ . .TE .KE .PP Step 1, the Cholesky factorization, was discussed in Dongarra and Eisenstat [2]; its inner loop can be replaced by a call to SMXPY. Step 2 is a lower triangular solve. The original code in EISPACK follows the suggestion in Section 2.2 and computes $E$ by rows. Thus it is a simple matter to replace the inner loop by a call to SMXPY. Step 3 is another lower triangular solve. The EISPACK encoding computes $C$ by columns and uses the fact that $C$ is symmetric. Thus the first $i-1$ elements of the $i sup th$ column of $C$ are already known before the code commences to work on the $i sup th$ column. For the $i sup th$ column REDUC has 2 inner loops. The first updates the last $n-i$ elements with the previous known $i-1$ elements. The second does a lower triangle solve with an $(n-1) times (n-1) $ matrix as in Algorithm E of Section 2.2. The first loop can be easily implemented using a SMXPY; the only hope for easily increasing the efficiency of the second loop is using Algorithm F of 2.2. .PP Thus it is straightforward to replace three of the four inner loops of REDUC by SMXPY, and this is accomplished by REDUC3 listed in Table 3.3. The decrease in execution time of REDUC3 from REDUC is quite surprising considering the changes being made affect only how the matrix is accessed. REDUCV replaces the fourth loop of REDUC by Algorithm F of Section 2.2. It produces a further modest saving. .PP REDUC4 replaces the two inner loops of Step 3 of REDUC by a modification of Algorithm G which computes only the first $i$ elements of the $i sup th$ row of $C$ rather than the whole row. Because $C$ is symmetric, these first $i$ elements are also the first $i$ elements of the $i sup th$ column of $C$. Thus by the end of the $i sup th$ stage of Step 3 of REDUC4, the top $i times i$ submatrix of $C$ has been computed while at the same stage of REDUC and REDUC3, the first $i$ columns of $C$ have been computed. REDUC4, as Table 3.3 indicates, is the least expensive of the subroutines, but it has one major drawback. REDUC, REDUC3, and REDUCV overwrite only the lower triangular portions of the A and B matrices while forming $L,~E,$ and $C$. REDUC4 overwrites the whole $A$ matrix. .sp .SH 3.3.2 REBAK .PP The subroutine REBAK takes the eigenvectors $Y$ of the standard symmetric eigenproblem and forms those of the original problem $X$ by multiplying $Y$ by $L sup -T$. Thus it is an upper triangular solve with many righthand sides. The original REBAK computes $X$ one column at a time using inner products. REBAKV uses the upper triangular version of Algorithm G to compute $X$. The difference in computation times given in Table 3.3 for REBAK and REBAKV is really remarkable considering that they require the same number of floating point operations. .EQ delim $$ .EN .KS .TS center; c s s c s s c s s c | c | c l | n | n. Table 3.3 CRAY -1 Times (in $10 sup -2$ sec) for the Symmetric Generalized Eigenvalue Problem _ Subroutine n=100 n=200 _ REDUC 16.9 85.5 REDUC3 4.26 23.6 REDUCV 3.62 19.5 REDUC4 3.00 16.1 _ TRED1 6.94 38.5 TRED1V 4.95 29.7 _ TRED2 14.3 84.5 TRED2V 8.31 51.3 _ TQL1 7.58 29.1 TQL2 19.8 117 _ REBAK 9.79 52.5 REBAKV 2.20 15.3 _ No vectors total old 32.92 165.4 total new 16.15 78.3 _ Vectors total old 60.8 339.6 total new 33.9 203.1 .TE .KE .SH 3.4 The Singular Value Decomposition .PP The singular value decomposition (the SVD) of an $m times n$ matrix $A$ is given by .EQ A~=~U SIGMA V sup T , .EN where $U$ is an $m times m$ orthogonal matrix, $V$ is an $n times n$ orthogonal matrix, and $SIGMA$ is an $m times n$ diagonal matrix containing the singular values of $A$, which are the non-negative square roots of the eigenvalues of $A sup T A$. Amongst the many applications of the SVD algorithm is the solution of least squares problems and the determination of the condition number of a matrix. .PP The algorithm implemented in EISPACK's SVD is usually considered to have two stages: .KS .TS center; l. (1) Determine $Q$ and $Z$ such that $J~=~QAZ$ is bidiagonal (2) Iteratively reduce $J$ to a diagonal matrix .TE .KE .PP In typical applications where $m >> n$ the first stage of the SVD calculation is the most time-consuming. The matrices $Q$ and $Z$ are the product of Householder transformations and, as described in Section 2.1, the number of vector memory references can be reduced by replacing all vector matrix multiplications by calls to SXMPY as is done in the subroutine SVDI given in Table 3.4. Moreover, since each Householder transformation from the left is followed by a Householder transformation from the right, one may use Algorithm D.1, and this is implemented in subroutine SVDV in Table 3.4. The second stage of the SVD calculation involves plane rotations which, when only the singular values are requested, do not involve any vector operations. .PP When the singular vectors are requested, the Householder transformations which form $Q$ and $Z$ are accumulated in reverse order. Here again we can use the techniques described earlier. In the second stage, plane rotations are applied to vectors, and it is not easy to decrease the number of vector memory references as was discussed in section 2.1.3. .PP Chan [1] has described a modification of SVD which, when $m>>n$, requires up to 50% fewer floating point operations. Chan suggests first applying Householder transformations from the left to reduce $A$ to triangular form before applying the traditional SVD algorithm. Thus the Householder transformations applied from the right, which are designed to annihilate elements above the super-diagonal, would be applied to vectors of length $n$ rather than to vectors of length $m$. Unfortunately, on the CRAY-1 Chan's suggestion seems to produce at most a 10% speedup in execution time. When the inner product loops in all Householder transformation applications are replaced by calls to SMXPY, the execution times are still about the same as SVDV. .KS .TS center; c s s s s s s c s s s s s s c s s s s s s c| c s s | c s s c| n n n | n n n l| n n n | n n n. Table 3.4 CRAY-1 Times (in $10 sup -2$ sec) for the Singular Value Decomposition _ m 100 200 n 10 50 100 10 50 100 _ No singular vectors: SVD 1.02 10.8 30.9 1.92 18.4 54.7 SVDI 0.57 7.57 23.1 0.94 11.5 37.3 SVDV 0.38 6.1 19.4 0.55 8.2 28.0 _ With singular vectors: SVD 1.31 19.8 70.1 2.41 32.6 115 SVDI 0.85 16.5 61.8 1.43 25.6 97.3 SVDV 0.68 13.5 51.0 1.09 20.6 79.5 .TE .KE .SH Conclusions .PP In this paper we have shown how to make some of the subroutines in EISPACK more efficient on vector machines such as the CRAY-1. We have concentrated our efforts in speeding up programs that already run at vector speed but because of bottlenecks caused by referencing vectors in memory do not run at super vector speeds. We have not considered subroutines which currently run at scalar speeds, like BANDR [8] on small bandwidth problems, TINVIT, and BISECT which can all be vectorized. .PP Our techniques for speeding up the eigenvalue solvers do not significantly change the number of floating point operations, only the number of vector loads and stores. Since we have been able to obtain speedups in the range of 2 to 5, vector loads and stores seems to be the dominant factor in determining the time required by an algorithm. Thus the traditional merit function, the number of floating point operations, seems to be not as relevant for these machines as for the scalar machines. .PP For the most part we have been able to isolate computationally intense sections of codes into well defined modules. This has made some of the programs shorter and has made their mathematical function clearer. Some of the techniques used to gain better performance could be done by an extremely clever vectorization compiler. However, this is not usually the case. Certainly a clever compiler would not know that one could delay transformations as is done in the new ELMHES. .PP Our techniques will always produce faster code, even on machines with conventional architecture. We have never resorted to assembly language. Thus our programs are transportable. Moreover there is still room for some improvement by using some assembly language modules in critical places. .SH References .IP [1] T. Chan, "An Improved Algorithm for Computing the Singular Values Decomposition", .I ACM Transactions on Mathematical Software .R \fB8\fP (1982), 72-89. .IP [2] .R J.J. Dongarra and S.C. Eisenstat, "Squeezing the Most out of an Algorithm in CRAY Fortran," .I ACM Trans. Math. Software, .R Vol. 10, No. 3, 221-230, (1984). .IP [3] J.J. Dongarra and C. B. Moler, .I EISPACK - A Package for Solving Matrix Eigenvalue Problems, .R in Sources and Development of Mathematical Software, Ed. W. R. Cowell, Prentice-Hall, New Jersey, (1984). .IP [4] Jack J. Dongarra, Linda Kaufman and, Sven Hammarling, .I Squeezing the Most out of Eigenvalue Solvers on High-Performance Computers, .R Argonne National Laboratory, ANL-MCSD-TM/46, January 1985. .IP [5] Kirby Fong and Thomas L. Jordan, .I Some Linear Algebra Algorithms and Their Performance on the CRAY-1, .R Los Alamos Scientific Laboratory Report, UC-32, June 1977. .IP [6] B.S. Garbow, J.M. Boyle, J.J. Dongarra, and C.B. Moler, .I Matrix Eigensystem Routines - EISPACK Guide Extension, .R Lecture Notes in Computer Science, Vol. 51, Springer-Verlag, Berlin, 1977. .IP [7] R.W. Hockney and C.R. Jesshope, .I Parallel Computers, .R J.W. Arrowsmith Ltd, Bristol, Great Britain, 1981. .IP [8] L. Kaufman, "Banded Eigenvalue Solvers on Vector Machines", .I ACM Transactions on Mathematical Software .R Vol. 10, No. 1, 73-86, (1984). .IP [9] 4 C. Lawson, R. Hanson, D. Kincaid, and F. Krogh, ``Basic Linear Algebra Subprograms for Fortran Usage,'' .I ACM Transactions on Mathematical Software .R \fB5\fP (1979), 308-371. .IP [10] R.M. Russell, .I The CRAY-1 Computer System, .R CACM, 21, 1 (January 1978), 63-72. .IP [11] B.T. Smith, J.M. Boyle, J.J. Dongarra, B.S. Garbow, Y. Ikebe, V.C. Klema, and C.B. Moler, .I Matrix Eigensystem Routines - EISPACK Guide, .R Lecture Notes in Computer Science, Vol. 6, 2nd edition, Springer-Verlag, Berlin, 1976. .IP [12] J.H. Wilkinson, .I The Algebraic Eigenvalue Problem, .R Oxford University Press, London (1965). .IP [13] J.H. Wilkinson and C. Reinsch, eds. .I Handbook for Automatic Computation, Vol II, Linear Algebra, .R Springer-Verlag, New York (1971). .bp .SH Appendix .PP This appendix contain timing information in the form of ratios of increased performance over the existing EISPACK routines on the Fujitsu VP-200 and Hitachi 810/20 as they existed in September 1984. The matrix vector multiply routines, SMXPY and SXMPY have been unrolled to a depth of eight for both the Fujitsu and Hitachi machines. A depth of eight gives the best performance on these machines. Subsequent hardware and software changes made affect the timing information to some extent. $n$ refers to the order of the matrix, ratio is the execution time for the current version of the EISPACK routine divided by the time for the modified version. .TS center; l s s l l|l c c|c n n|n. ELMHES Hitachi S-810/20 Fujitsu VP-200 n ratio ratio EISPACK/MV Version EISPACK/MV Version _ _ _ 50 1.1 1.1 100 1.6 1.6 150 1.9 1.8 200 2.0 1.8 250 2.1 1.8 300 2.2 1.9 .TE .TS center; l s s|s s l l s|l s l l l|l l c c c|c c n n n|n n. ORTHES ORTBAK Hitachi S-810/20 Fujitsu VP-200 ORTHES ORTBAK ORTHES ORTBAK n ratio ratio ratio ratio EISPACK/MV EISPACK/MV EISPACK/MV EISPACK/MV _ _ _ _ _ 50 1.8 3.6 1.9 3.2 100 2.1 4.6 2.3 3.2 150 2.2 4.9 2.5 3.6 200 2.2 4.6 2.7 3.6 250 2.2 4.0 2.8 3.9 300 2.2 3.8 2.9 4.0 .TE (Routines ORTHES and ORTBAK here are implemented using rank 1 updates only.) .KS .TS center; l l l s l l |l s c c |c c n n |n n. REDUC Hitachi S-810/20 Fujitsu VP-200 n ratio ratio EISPACK/MV Version EISPACK/MV Version _ _ _ _ 50 1.7 1.8 100 2.1 2.2 150 2.3 2.4 200 2.4 2.5 250 2.5 2.6 300 2.5 2.6 .TE .KE .TS center; l l s s s s s l l s l s s s c c c c c c c l|n n|n n n n. SVD Hitachi S-810/20 m = 100 m = 200 n=50 n=100 n=50 n=100 n=150 n=200 _ _ _ _ _ _ _ novect 1.7 1.6 2.0 1.9 1.8 1.7 vect 2.0 1.7 3.0 2.5 2.2 2.0 .TE .TS center; l l s s s s s l l s l s s s c c c c c c c l|n n|n n n n. SVD Fujitsu VP-200 m = 100 m = 200 n=50 n=100 n=50 n=100 n=150 n=200 _ _ _ _ _ _ _ novect 1.6 1.5 1.9 1.8 1.7 1.7 vect 1.9 1.6 2.7 2.4 2.1 1.7 .TE ($no~vect$ refers to computing just the singular values and $vect$ refers to computing both the singular values and left and right singular vectors. $m$ is the number of rows and $n$ the number of columns in the matrix.)