.B
.nr BT ''-%-''
.he ''''
.pl 11i
.de fO
'bp
..
.wh -.5i fO
.LP
.nr LL 6.5i
.ll 6.5i
.nr LT 6.5i
.lt 6.5i
.ta 5.0i
.ft 3
.bp
.R
.sp 1i
.ce 100
.R
.sp .5i
 .
.sp 10
ARGONNE NATIONAL LABORATORY
.br
9700 South Cass Avenue
.br
Argonne, Illinois  60439
.sp .6i
.ps 12
.ft 3
Squeezing the Most out of Eigenvalue Solvers

on High-Performance Computers
.ps 11
.sp 3
Jack J. Dongarra, Linda Kaufman and, Sven Hammarling
.sp 3
.ps 10
.ft 1
Mathematics and Computer Science Division
.sp 2
Technical Memorandum No. 46
.sp .7i
January, 1985
.pn 0
.he ''-%-''
.he ''''
.bp
 .
.sp 10
.B
.ce 1
.ps 11
ABSTRACT
.R
.ps 10
.sp .5i
This paper describes modifications to many of the standard
algorithms used in computing eigenvalues and eigenvectors of matrices.
These modifications can dramatically
increase the performance of the underlying software 
on high performance computers
without resorting to assembler language, without significantly influencing the
floating point operation count, and without affecting the 
roundoff error properties of the algorithms.
The techniques are applied to a wide variety of algorithms
and are beneficial in various architectural settings.
.in
.bp
.ft 3
.ps 11
.bp
.LP
.LP
.EQ
gsize 11
delim $$
.EN
.TL
.ps 12
.in 0
Squeezing the Most out of Eigenvalue Solvers\h'.35i'
.br
on High-Performance Computers\h'.35i'     
.AU
.ps 11
.in 0
Jack J. Dongarra\|$size -1 {"" sup \(dg}$\h'.15i'    
.AI
.ps 10
.in 0
Mathematics and Computer Science Division\h'.20i'   
Argonne National Laboratory\h'.20i'   
Argonne, Illinois 60439\h'.20i'
.AU
.ps 11
.in 0
Linda Kaufman\h'.20i'
.AI
.ps 10
.in 0
AT&T Bell Laboratories\h'.20i'   
Murray Hill, New Jersey 07974\h'.20i'
.AU
.ps 11
.in 0
Sven Hammarling\h'.20i'
.AI
.ps 10
.in 0
Numerical Algorithms Group Ltd.\h'.20i'
NAG Central Office, Mayfield House\h'.20i'
256 Banbury Road, Oxford OX2 7DE\h'.20i'
.FS
.ps 9
.vs 11p
$size -1 {"" sup \(dg}$\|Work supported in part by the Applied Mathematical
Sciences subprogram of the Office of Energy Research,
U. S. Department of Energy, under Contract W-31-109-Eng-38.
.FE
.QS
.sp 2
.ps 10
.in .25i
.ll -.25i
.I Abstract
\(em This paper describes modifications to many of the standard
algorithms used in computing eigenvalues and eigenvectors of matrices.
These modifications can dramatically
increase the performance of the underlying software 
on high performance computers
without resorting to assembler language, without significantly influencing the
floating point operation count, and without affecting the 
roundoff error properties of the algorithms.
The techniques are applied to a wide variety of algorithms
and are beneficial in various architectural settings.
.in 
.ll 
.QE
.nr PS 11
.nr VS 16
.nr PD 0.5v
.SH
Introduction
.PP
On high-performance vector computers like the CRAY-1, CRAY X-MP, Fujitsu VP,
Hitachi S-810, and Amdahl 1200,
there are three basic performance levels \(em \fIscalar, vector, 
\f1and \f2super-vector. \f1For example, on the CRAY-1 [5,7,10],
these levels produce the following execution rates:
.KS
.TS
center;
c c.
Performance Level	Rate of Execution*
_	_
Scalar	0\(mi4 MFLOPS
Vector	4\(mi50 MFLOPS
Super-Vector	50\(mi160 MFLOPS
.TE
.KE
.FS
.ps 9
.vs 11p
.ll
.sp .25v
*\|MFLOPS is an acronym for Million FLoating-point OPerations (additions or
multiplications) per Second.
.FE
.R
Scalar performance is obtained when no advantage is taken
of the special features of the machine architecture.
Vector performance is obtained by using the
vector instructions
to eliminate loop overhead and to take full advantage of the pipelined functional
units.
Super-vector performance is obtained by using vector
registers to reduce the number of memory references and thus avoid letting the
paths to/from memory become a bottleneck.
.PP
Typically, programs written in Fortran run at scalar or vector speeds, so that
one must resort to assembler language (or assembler language kernels) to improve
performance.
But in [2], Dongarra and Eisenstat  describe a technique for attaining super-vector speeds \fIfrom
Fortran\fP
for certain algorithms in numerical linear algebra.
They noticed that many algorithms had the basic form
.EQ
delim $$
.EN
.nf
.sp
.KS
.B
          Algorithm A:
.R
          For $i ~=~$ 1  to $m$
                $ y ~<-~ alpha sub i x sub i ~+~ y $
          End
.KE
.fi
.sp
where $ alpha sub i $ is a scalar and $x sub i $ and $y$ are vectors.
Unfortunately, when this algorithm is implemented in a straightforward
way, the CRAY, Fujitsu, and Hitachi
Fortran compilers do not recognize that it
is the "\f2same\f1 $y$" acted upon every time, and issues a store
vector $y$ and a load vector $y$ command between each vector addition.
Thus the path to/from memory becomes the bottleneck. 
The compilers generate vector code of the general form:
.DS  
.I
.TS
center;
l.
Load vector \fRY\fP  
Load scalar \fR$ alpha sub i $ \fP  
Load vector \fRX(I)\fP  
Multiply scalar \fR $ alpha sub i $ \fP times vector \fRX(I)\fP  
Add result to vector \fRY\fP  
Store result in \fRY\fP  
.TE
.R
.DE
This gives 2 vector operations to 3 vector memory references.
Moreover because of the concept called "chaining" on the CRAY, Fujitsu,
and Hitachi, the
time for the vector multiply and add is practically insignificant. 
In most circumstances
these may be initiated soon after the loading of the vector X(I)
has begun, and for vectors of significant length
the load, multiply and add  may be thought of
as practically simultaneous operations.
.PP
Dongarra
and Eisenstat showed that if one unrolled the loop several times*,
the number of memory references could be reduced and execution times
often decreased by a factor of 2 or 3.
.FS
.ps 9
.vs 11p
.ll
.sp .25v
*\|The loops have been unrolled to different depths on different
machines depending on the effect; on the CRAY the depth is 16
and on the Fujitsu and Hitachi the depth is 8.
.FE
For example unrolling Algorithm A to a depth of two gives:
.sp
.nf
.KS
.B
          Algorithm A.2:
.R
          For $i ~=~$ 2  to $m$ in steps of 2
                $ y ~<-~ alpha sub i-1 x sub i-1 ~+~ alpha sub i x sub i ~+~ y $
          End
          if ($m$ is odd) $ y ~<-~ alpha sub m x sub m ~+~ y $
.KE
.fi
.sp
The compilers generate vector code of the general form:
.DS  
.I
.TS
center;
l.
Load vector \fRY\fP  
Load scalar \fR$ alpha sub i-1 $ \fP  
Load vector \fRX(I-1)\fP  
Multiply scalar \fR $ alpha sub i-1 $ \fP times vector \fRX(I-1)\fP  
Add result to vector \fRY\fP  
Load scalar \fR$ alpha sub i $ \fP  
Load vector \fRX(I)\fP  
Multiply scalar \fR $ alpha sub i $ \fP times vector \fRX(I)\fP  
Add result to vector \fRY\fP  
Store result in \fRY\fP  
.TE
.R
.DE
This gives 4 vector operations to 4 vector memory references.
The larger the ratio of vector operations to vector memory
references becomes the better the performance of the program segment.
This is the result of vector memory vector memory operations, i.e.
loads and stores, costing as much as other vector operations.
When the loop is unrolled to a depth of 8 there are 
16 vector operations to 10 vector memory references.
Dongarra and Eisenstat incorporated this idea into two "kernel" subroutines: SMXPY, which added a
matrix times a vector to another vector ($ Ax ~+~ y$), and SXMPY, which added
a vector times a matrix to another vector ($ x sup T A ~+~ y sup T $).
They showed that several linear system solvers could be
rewritten
using these kernel
subroutines.
.PP
In this paper we try to apply the same concept to algorithms used
in solving the eigenvalue problem.
Normally these problems are solved in several steps:
.sp
.nf
         (1) Reduce the problem to a simpler problem (e.g., a tridiagonal 
               matrix if the matrix was symmetric),
         (2) Solve the eigenproblem for the simpler problem,
         (3) If eigenvectors are requested, transform the eigenvectors of
               the simplified problem to those of the original problem.
.fi
.sp
For symmetric problems, step (2) usually has the fewest floating point
operations, while
for nonsymmetric matrices step (2) has the most floating point operations.
Because steps (1) and (3) often involve transformations that can
be forced into the form of Algorithm A, we will concentrate our
efforts on these steps. In certain cases speeding up these steps
will not significantly affect the overall time required to solve the
eigenproblem, but in other cases, such as the symmetric generalized
eigenvalue problem, we will be speeding up the most time-consuming
portion of the whole operation.
Sometimes part of the algorithm simply has a matrix by vector
multiplication; then application of Dongarra and Eisenstat's
idea is straightforward.
At other times, the code will need to be radically transformed to fit the
form of Algorithm A.
.PP
In Section 2 we describe some underlying ideas that can be
used to decrease memory references in various subroutines in the matrix
eigenvalue package EISPACK [6,11,13].
In Section 3 we apply the concepts of Section 2 to specific
subroutines in EISPACK and provide execution timing information 
on the CRAY-1 
for subroutines provided in EISPACK 3, the current version of EISPACK [3],
(The appendix contains execution timing information
on the Hitachi S-810/20, and Fujitsu VP-200 (Amdahl 1200)).
(In [4] we presented
reprogramming of selected subroutines
that are radically different
from the original or are representative of a class of changes
that might be applied to several subroutines.)
.EQ
delim $$
.EN
.SH
2. Underlying Ideas
.PP
In this section we outline some of the underlying methods
that occur throughout the algorithms used in the
EISPACK package. We also discuss how they can
be implemented to decrease vector memory references,
without significantly increasing the number of floating
point operations.
.SH
2.1 Transformations
.PP
Many of the algorithms implemented in EISPACK have the following form:
.sp
.nf
.KS
.B
           Algorithm B:
.R
           For i = 1,....
               Generate matrix $T sub i$
               Perform transformation $A sub i+1 ~<-~ T sub i A sub i T sub i sup -1$
           End .
.KE
.fi
.sp
Because we are applying similarity transformations, the
eigenvalues of $A sub i+1$ are those of $A sub i$.
In this section we examine various types of transformation matrices $T sub i$.
.SH
2.1.1 Stabilized elementary transformations
.PP
Stabilized elementary transformation matrices have the form $T~=~PL$, where
$P$ is a permutation matrix, required to maintain numerical stability [12],
and $L$ has the form
.sp
.KS
.EQ
left ( ~ matrix {
ccol { 1 above nothing above nothing above nothing above nothing above nothing above nothing }
ccol { nothing above . above nothing above nothing above nothing above nothing above nothing }
ccol { nothing above nothing above 1 above nothing above nothing above nothing above nothing }
ccol { nothing above nothing above nothing above 1 above * above . above * }
ccol { nothing above nothing above nothing above nothing above 1 above nothing above nothing }
ccol { nothing above nothing above nothing above nothing above nothing above . above nothing }
ccol { nothing above nothing above nothing above nothing above nothing above nothing above 1 }
} ~ right ) .
.EN
.KE
The inverse of $L$ has the same structure as $L$.
When $L sup -1 $ is applied on the right  of a matrix, one has a subalgorithm
with the exact form of Algorithm A, which can
be implemented using SMXPY.
Unfortunately, when applying $L$ on the left as in Algorithm B, one does
not get the same situation. The vector $y$ changes, but the
vector $x$ remains the same.
However, in the sequence of transformations in Algorithm B, $T sub i$
consists of a matrix $L sub i$ whose off-diagonals are nonzero, only in the $i sup th$
column, and at the $i sup th$ step, one might apply transformations
$T sub 1 $ through $T sub i$ only to the $i sup th $ row 
of the matrix. Subsequent row transformations from the left will not affect
this row, and one can implement this part of the algorithm using SXMPY.
.PP
This idea was incorporated in the subroutine ELMHES, which will
be discussed in Section 3.
.SH   
2.1.2 Householder transformations
.PP
In most of the algorithms the transformation matrices $T sub i$ are
Householder matrices of the form
.EQ
Q ~=~ I - beta u u sup T , ~where ~beta  u sup T u ~=~ 2,
.EN
so that $Q$ is orthogonal.
To apply $Q$ from the left to a matrix $A$, one would proceed as follows:
.KS
.TS
center;
l.
\f3Algorithm C\f1:
1. $v sup T ~=~ u sup T A$
2. Replace $A$ by $A - beta uv sup T$
.TE
.KE
Naturally the first step in Algorithm C can be implemented using
SXMPY, but the second step, the rank-one update, does not fall into
the form of Algorithm A.
However, when applying a sequence of Householder transformations,
one may mitigate the circumstances somewhat by combining more than
one transformation and thus performing a higher than rank one update
to $A$. This is somewhat akin to the technique of loop unrolling
discussed earlier. We give two illustrative examples.
.PP
Firstly suppose that we wish
to form $(I- alpha ww sup T )A(I- beta uu sup T )$,
where for a similarity transformation $alpha ~=~ beta$ and $w ~=~ u$.
This is normally formed by first applying the left hand transformation
as in Algorithm C, and then similarly applying the right hand transformation.
But we may replace the two rank one updates by a single rank two update
using the following algorithm.
.KS
.TS
center;
l.
\f3Algorithm D.1
1. $v sup T ~=~ w sup T A$
2. $x ~=~ Au$
3. $y sup T ~=~ v sup T -( beta w sup T x ) u sup T$
4. Replace $A $ by $ A - beta xu sup T - alpha w y sup T$
.TE
.KE
As a second example suppose that we wish to form
$(I- alpha ww sup T )(I- beta uu sup T )A$,
then as with Algorithm D.1 we might proceed as follows.
.KS
.TS
center;
l.
\f3Algorithm D.2\f1:
1. $v sup T ~=~ w sup T A$
2. $x sup T ~=~ u sup T A$
3. $y sup T ~=~ v sup T -( beta w sup T u ) x sup T$
4. Replace $A $ by $ A - beta ux sup T - alpha w y sup T$
.TE
.KE
In both cases we can see that Steps 1 and 2 can be achieved
by calls to SXMPY and SMXPY.
Step 3 is a simple vector operation and
Step 4 is now a rank-two correction, and one gets four vector memory references
for each four vector floating point operations (rather than the three
vector memory references for every two vector floating point operations, as
in Step 2 of Algorithm C). The increased saving 
is not as much as is realized with
the initial substitution of SXMPY for the inner products
in Step 1 of Algorithm C, but it more than pays for the additional
$2n$ operations incurred at Step 3 and
exemplifies a technique that might
pay off in certain situations.
This technique was used to speed up a number of routines that
require Householder transformations.
.SH
2.1.3 Plane Rotations
.PP
Some of the most time consuming subroutines in EISPACK, e.g. HQR2, QZIT,
IMTQL2, TQL2 spend most of their time applying transformations in 2 or 3 planes
to rows or columns of matrices.
We have been able to speed up the application of these transformations by only
about 15%, but if one is spending 90% of one's computation time here, the total effect
is greater than improving the part which only contributes 10% of the total computation time.
.PP
First of all we should mention that on the CRAY-1 the time required by a 3 multiply
Householder transformation in 2 planes is hardly less than that
required by a 4 multiply Givens transformation [12].
Thus once again the computation time is influenced more by the number of vector memory
references than by the number of floating point operations.
We were able to eliminate several vector loads and stores by 
noticing that one of the planes used in one transformation
is usually present in the next.
Thus a typical Givens code which originally looked like
.sp
.KS
.nf
        For $i=1$ to $n-1$
            Compute $c sub i$ and $s sub i$
            For $j=1$ to $n$
                 $t ~<-~h sub ji$
                 $h sub ji  ~<-~ c sub i t + s sub i h sub j,i+1$
                 $h sub j,i+1  ~<-~ s sub i t - c sub i h sub j,i+1$
            End
        End
.fi
.KE
.sp
would become
.sp
.nf
.KS
        For $j=1$ to $n$
            $t sub j ~<-~ h sub j1$
        End
        For $i=1$ to $n-1$
            Compute $c sub i$ and $s sub i$
            For $j=1$ to $n$
                 $h sub ji ~<-~ c sub i t sub j + s sub i h sub j,i+1$
                 $t sub j ~<-~ s sub i t sub j - c sub i h sub j,i+1$
            End
        End
        For $j=1$ to $n$
            $h sub jn ~<-~t sub j$
        End
.KE
.fi
.sp
and a typical Householder code which looked like
.sp
.KS
.nf
        For $i=1$ to $n-1$
            Compute $q sub i , x sub i$ and $ y sub i$
            For $j=1$ to $n$
                 $p ~<-~ h sub ji + q sub i h sub j,i+1$
                 $h sub ji ~<-~ h sub ji + p x sub i$
                 $h sub j,i+1 ~<-~ h sub j,i+1 + p y sub i$
            End
        End
.KE
.sp
.fi
would become
.sp
.nf
.KS
        For $i=1$ to $ n-2$ in steps of 2
            Compute $q sub i , x sub i , y sub i , q sub i+1 , x sub i+1 ,$ and $y sub i+1$
            For $j=1$ to $n$
                 $p ~<-~ h sub ji + q sub i h sub j,i+1$
                 $r ~<-~ h sub j,i+1 + q sub i+1 h sub j,i+2 + y sub i p$
                 $h sub ji ~<-~ h sub ji + p x sub i$
                 $h sub j,i+1 ~<-~ h sub j,i+1 + p y sub i + r x sub i+1 $
                 $h sub j,i+2 ~<-~ h sub j,i+2 + r y sub i+1 $
            End
        End
.fi
.KE
.sp
Notice that for the Householder transformations we have actually increased the
number of multiplications in total but still the amount of time has decreased.
For a three plane Householder transformation, like that found in HQR2,
unrolling the loop twice causes about a 10% drop in execution time.
.PP
Inserting the modified Givens into a code like TQL2 is an easy task.
Changing codes like HQR2 to use the unrolled Householders is rather
unpleasant.
.SH
2.2 Triangular Solvers
.PP
Assume one has an $n times n$ nonsingular lower triangular matrix $L$
and an $n times m$ matrix $B$, and wishes to solve
.EQ (2.1)
LY~=~B .
.EN
.PP
If $m~=~1$ one might normally proceed as follows:
.sp
.KS
.nf
          \f3Algorithm E\f1:
          $y ~<-~ b$
          For $i ~=~1$ to $n$
               $y sub i ~<-~ y sub i /l sub ii$
               For $j~=~ i+1$ to $n$
                    $y sub j ~<-~ y sub j ~-~l sub ji y sub i$                (2.2)
               End
          End
.fi
.KE
.sp
Equation (2.2) almost looks like Algorithm A, but the length of the $y$ vector
decreases. Unrolling the $i$ loop once decreases the number of vector
memory references from 3 for every 2 vector floating point operations to
4 for every 4 vector floating point operations. The unrolled code would
be of the following form:
.sp
.KS
.nf
          \f3Algorithm F\f1:
          $y ~<-~ b$
          For $i~=~1$ to $n-1$ in steps of 2
               $y sub i ~<-~y sub i /l sub ii$
               $y sub i+1 ~<-~ (y sub i+1 ~-~ l sub i+1,i y sub i ) /l sub i+1,i+1$
               For $j~=~i+2,...,n$
                    $y sub j ~<-~ y sub j ~-~y sub i l sub ji ~-~ y sub i+1 l sub j,i+1$
               End
          End
          If ($n~mod~2 ~!=~ 0$) $y sub n ~<-~ y sub n /l sub nn$
.fi
.KE
.sp
On the CRAY-1 the ratio of execution times of Algorithm F to Algorithm E
is 1.5 as Table 2.1 indicates.
.PP
However, when $m$ is sufficiently large that it makes computational sense to treat vectors of length $m$, one can do much, much better by computing $Y$ by rows rather than
repeating either Algorithm E or Algorithm F for each column.
Let $Y sub j$ denote the first $j$ rows of the matrix $Y$, 
$y sub j sup T$ denote its $j sup th$ row, and 
$l sub j sup T$ denote the $j sup th$ row of $L$. 
Then one might proceed as follows:
.sp
.KS
.nf
          \f3Algorithm G\f1:
          $Y ~<-~ B$
          For $j~=~1$ to $n$
               $y sub j sup T ~<-~ b sub j sup T ~-~ l sub j sup T Y sub j-1$                            (2.3)
               $y sub j sup T ~<-~ y sub j sup T / l sub jj$
          End
.fi
.KE
.sp
Step (2.3) can be implemented using SXMPY.
Obviously, working
by rows is superior if $m$ is sufficiently large.
Since Algorithm G uses vectors of length $m$, when $m$ is small one should use
Algorithm F.
We have discussed triangular solvers using a lower triangular
matrix. One can implement the last three algorithms for an upper triangular
matrix, and Algorithm G would determine the last row first and work backwards.
Triangular solvers occur in the EISPACK subroutines REDUC and REBAK used in the
solution of the symmetric generalized eigenvalue problem $Ax~=~ lambda B x$.
Inserting calls to SXMPY and SMXPY decreases the time required by these
subroutines to such an extent that the time required for the generalized
eigenvalue problem is not appreciably more than that required for the
standard eigenvalue problem on the high performance computers under discussion.
.KS
.TS
center;
c s s s s
c s s s s
c c c c c
n n n n n.
Table 2.1: CRAY-1 Times in $10 sup -3$ Seconds
for Triangular Solvers
_
$n$	$m$	Algorithm E	Algorithm F	Algorithm G
_
100	1	.506	.340	5.14
	25	12.5	8.32	6.92
	100	49.9	33.3	14.4
_
200	1	1.55	1.02	17.2
	25	38.6	25.5	24.1
	100	151	102	52.2
	200	308	202	93.7
_
300	1	3.15	2.05	35.9
	25	78.5	51.1	51.8
	150	472	306	162
	300	940	613	290
.TE
.KE
.EQ
delim $$
.EN
.SH
2.3 Matrix Multiplication with Symmetric Packed Storage
.EQ
delim $$
.EN
.PP
The algorithms in EISPACK that deal with symmetric matrices
permit the user to specify only the lower triangular part of the
matrix. There are routines requiring that a two-dimensioned array 
be provided, using only the information in the lower portion and routines
accommodating the matrix packed into a one-dimensional array.
The normal scheme for doing this matrix vector product
would be
.sp
.KS
.nf
        \f3Algorithm H\f1
         For $j$=1 to $n$
             $t ~<-~ y sub j$
             For $i=j+1$ to $n$
                 $y sub i ~<-~ y sub i ~+~ a sub ij x sub j $
                 $t ~<-~ t ~+~ a sub ij x sub i $
             End
             $y sub j ~<-~ t ~+~ a sub jj x sub j$
         End
.fi
.KE
.sp
Certainly one might consider stepping the outer loop by 2, 
doing two inner products followed by a rank-two correction.
Another alternative in the same vein, which unfortunately would not
be amenable to a subroutine that packed the symmetric matrix into
a one-dimensional array, is the following:
.sp
.KS
.nf
        \f3Algorithm I\f1
         For $i$ = 1 to $n-1$ in steps of 2 
             For $j$ = 1 to $i-1$
                 $y sub j ~<-~ y sub j ~+~ a sub ij x sub i ~+~ a sub i+1,j x sub i+1$
             End
             For $j$ = $i+1$ to $n$
                 $y sub j ~<-~ y sub j ~+~ a sub ji x sub i ~+~ a sub j,i+1 x sub i+1$
             End
             $y sub i ~<-~ y sub i ~+~ a sub i+1,i x sub i+1 ~+~ a sub ii x sub i$
         End
.KE
.fi
.sp
.PP
A less obvious technique is a divide-and-conquer approach.
If we consider referencing a symmetric matrix in a matrix vector product
where the matrix is specified
in the lower triangular matrix, we have
.EQ
left ( matrix { ccol { y sub 1 above y sub 2 } } right ) mark ~=~ 
left ( matrix { ccol { y sub 1 above y sub 2 } } right ) ~+~ 
left ( matrix { ccol { T sub 1 above B } 
                ccol { B sup T above T sub 2 } } right ) ~ times ~
left ( matrix { ccol { x sub 1 above x sub 2 } } right ) ,
.EN
where $T sub 1 $ and $ T sub 2 $ are symmetric matrices stored
in the lower portion and $ B$ is full.
This can be written as
.sp
.KS
.nf
          \f3Algorithm J\f1:
          Set $ y sub 1 ~=~ T sub 1  x sub 1 ~+~ B sup T  x sub 2 $
          Set $ y sub 2 ~=~ B  x sub 1 ~+~ T sub 2  x sub 2 $ .
.fi
.KE
.sp
.PP
In writing the matrix multiply this way, two things should be noted.
There are two square $ n over 2 ~ times n over 2 $ full matrix
vector multiplications, and two symmetric matrix vector products.
.PP
Table 2.2 gives a comparison on the CRAY 1 of Algorithm H (a standard approach),
to Algorithm I, Algorithm HJ (where $T sub 1 x sub 1$ and
$T sub 2 x sub 2$ of Algorithm J are done according to Algorithm H), and
Algorithm IJ (where these are done according to Algorithm I).
.KS
.ce 
Table 2.2
.ce
Comparison of Execution Times on the CRAY 1
.ce
for Symmetric Matrix Multiply
.TS
center;
l c s s
l l l l
n n n n.
 Order 	Ratio of Execution Times
	Alg. H/Alg. I	Alg. H/Alg. HJ	Alg. H/Alg. IJ 
_
100	2.14	1.24	2.13
200	1.87	1.43	2.12
300	1.78	1.53	2.08
.TE
.KE
.PP
When the matrix is packed in a one-dimensional array stored by
column, the same divide
and conquer
approach can be applied.
.EQ
delim $$
.EN
.SH
3. Subroutines in EISPACK
.SH
3.1 The Unsymmetric Eigenvalue Problem
.PP
In this section we investigate methods for the efficient implementation of
the algorithms that deal with the standard eigenvalue problem
.EQ
A x ~=~ lambda x ,
.EN
where $A$ is a real general matrix.
The algorithms for dealing with this problem follow the form
.sp
.IP (1) 
Reduce A to upper Hessenberg form (ELMHES or ORTHES).
.IP (2) 
Find the eigensystem of the upper Hessenberg matrix.
.IP (3) 
If eigenvectors are requested, back transform the eigenvectors
of the Hessenberg matrix to form the eigenvectors of the original
system (ELMBAK or ORTBAK).

.in
For this particular problem, most of the time is spent in the second step
but as was discussed in section 2.1.3 this part is not easy to vectorize
so we concentrate our discussion on Steps 1 and 3.
.EQ
delim $$
.EN
.SH
3.1.1 ELMHES and ORTHES
.PP
In the subroutine ELMHES, which reduces a matrix to upper Hessenberg form
while preserving the eigenvalues of the original matrix, a sequence of
stabilized elementary transformations are used.
The transformations are of the form
.EQ
T sub k ... T sub 2 T sub 1 A T sub 1 sup -1 T sub 2 sup -1 ... T sub k sup -1 .
.EN
This set of transformations has the effect of reducing
the first $k$ columns of $A$ to upper Hessenberg
form.
.PP
The transformations can be applied in such
a way that matrix-vector operations
are used in the time consuming part.
At the $k sup th$ stage of the reduction we apply the previous $k-1$ 
transformations on the left side of the reduced $A$
to the last $(n-k)$ elements of the $k+1 sup st$ row.
Then, on the right the inverse of the $k sup th$ transformation is applied to
the reduced matrix $A$
followed by the application on the left of all $k$ transformations
to the elements below the diagonal of the $k sup th$ column.
Because of the structure of the transformations
(see 2.1.1) both these steps are simple matrix-vector multiplication.
The application of transformations from the left follows essentially
the algorithm given in Dongarra and Eisenstat [2] for the LU decomposition
of a matrix.
In the original EISPACK codes 
at the $k sup th$ stage
permutations from the left 
are applied to only the last $n-k+1$ columns of the matrix.
In our new code in order to use SMXPY and SXMPY, we must apply
these permutations to the whole matrix.
Thus the elements below the subdiagonal of the $A$ matrix which are necessary
for finding the eigenvectors might be slightly scrambled and hence
the user must use the modified ELMBAK given in section 3.1.3.
.PP
The subroutine ORTHES
uses Householder orthogonal similarity transformations to reduce
$A$ to upper Hessenberg form.
At the $k sup th$ stage we perform the operation
.EQ
Q sub k A Q sub k sup T ,
.EN
where $Q sub k ~=~ I ~-~ beta u u sup T $. 
As shown in Algorithm D.1,
the 
usual two rank one
updates may be replaced by a rank one update to the first
$k$ rows of $A$ followed by a rank two update to rows $k+1$ through $n$.
In this case Algorithm D.1 becomes
.KS
.EQ I
1.~~~ v sup T ~=~ u sup T A
.EN
.EQ I
2.~~~ x ~=~ Au
.EN
.EQ I
3.~~~ y sup T ~=~ v sup T -( beta u sup T x ) u sup T
.EN
.EQ I
4.~~~ Replace ~A ~by ~  A - beta  ( xu sup T + u y sup T )
.EN
.KE
.PP
Seeing the transformations applied in this way leads to a straightforward
matrix-vector implementation. Table 3.1 reports the comparison
between the EISPACK implementations and the ones
just described.
Significant speedups are accomplished using these constructs.
.sp
.KS
.ce
Table 3.1
.ce
Comparison of Execution on the CRAY 1
.ce
for Routine ELMHES and ORTHES
.TS
center;
l l s s
l l s s
l c c s
l c c c
n n n n.
Order	Ratio of Execution Times
	EISPACK / MV version
	ELMHES	ORTHES
		rank 1 only	rank 2
_	_	_	_

50	1.5	2.0	2.5
100	2.2	1.9	2.5
150	2.4	1.8	2.4
.TE
.KE

.SH
3.1.2 ELTRAN and ORTRAN
.PP
If all the eigenvectors are requested, one might choose to
use either ELTRAN or ORTRAN (depending on whether one used ELMHES or ORTHES)
followed by a call to HQR2 rather than finding the eigenvectors
using INVIT and then back transforming using ELMBAK or ORTBAK.
ELTRAN
requires no floating point operations, but because of the
use of stabilized elementary transformations in ELMHES, it may
require swapping of various rows of the partial eigenvector matrix
being constructed. Because ELMHES has changed, the swapping
in ELTRAN is slightly different.
ORTRAN applies the Householder transformations determined in ORTHES
to the identity matrix.
By combining two Householder transformations
we can perform a rank two update to $I$
using the technique described in section 2.1.2, and 
this realizes a cut in 
the execution time for this routine by a factor of two.
.SH
3.1.3 ELMBAK and ORTBAK

.PP
Both ELMBAK and ORTBAK compute the eigenvectors of the original matrix
given the eigenvectors of the upper Hessenberg matrix and
the transformations used to reduce the original matrix. This requires that
a set of transformations be applied on the left to
the matrix of eigenvectors in reverse order.
The reduction is of the form
$T A T sup -1 T X ~=~ lambda T X ,$ where $~ T ~=~ T sub n-1 ... T sub 2 T sub 1 $.
The eigenvectors, say $Y$, of the reduced matrix $H$ are found using,
.EQ I
H ~=~ T A T sup -1
~~ and ~~H Y ~=~ lambda Y ,
.EN
.sp
then the eigenvectors for the original problem are computed as,
.sp
.EQ I
X ~=~ T sup -1 Y 
~=~ T sub 1 sup -1 T sub 2 sup -1 ... T sub n sup -1 Y .
.EN
.PP
The original EISPACK subroutines use $T$ as a product of transformations as
given above.
For ELMBAK we use a slightly different approach.
As in Section 2.1.1, each $T sub i$ may be written as $L sub i P sub i$,
where $P sub i$ is a permutation matrix and $L sub i$ is a lower
triangular matrix. On output from the new ELMHES, 
let $B$ be the $(n-1) times (n-1)$ lower triangular matrix below the
subdiagonal of the reduced $A$.
Let $C$ be the unit lower triangular matrix
.EQ
C ~=~ left (
matrix {
ccol { 1 above 0 above . above . above  . above 0 }
ccol { nothing above 1 above nothing above nothing above nothing  above nothing }
ccol { nothing above nothing above . above nothing above B above nothing }
ccol { nothing above nothing above nothing above . above nothing above nothing }
ccol { nothing above nothing above nothing above nothing above . above nothing }
ccol { nothing above nothing above nothing above nothing above nothing above 1 }
}
right ) .
.EN
Then one can show that $T sup -1 ~=~P sub 1 P sub 2 ......P sub n-1 C$.
.PP
Since ORTBAK involves a product of Householder transformations, reducing
the number of vector memory references is again a straightforward task.
Dramatic improvements are seen in these back transformation
routines, as shown in Tables 3.2 below. 
Originally ELMBAK was 2.4 times faster than ORTBAK; in the MV version
it only enjoys an advantage of 1.9 over ORTBAK using $(n-1)/2$ rank 2
changes.
.sp
.KS
.ce
Table 3.2
.ce
Comparison of Execution on the CRAY 1
.ce
for EISPACK Routines ORTBAK and ELMBAK
.TS
center;
l  c s s
l c s s
l c c s
l c c c
n n n n.
Order	Ratio of Execution Times
	EISPACK/ MV Version
	ELMBAK	ORTBAK
		Rank 1	Rank 2
_	_	_	_

50	2.2	2.8	3.6
100	2.6	2.5	3.3
150	2.7	2.3	3.0
.TE
.KE


.EQ
delim $$
.EN
.SH
3.2 The Symmetric Eigenvalue Problem
.EQ
delim $$
.EN
.PP
In this section we look at the methods for efficient implementation
of the algorithms that deal with the symmetric eigenvalue problem
.EQ
A x ~=~ lambda x ,
.EN
where $A$ is a real symmetric matrix.
The algorithms for dealing with this problem have two possible paths
.sp
.B
Path 1
.sp
.IP (1) 
Transform $A$ to tridiagonal form (TRED1).
.IP (2) 
Find the eigenvalues of the tridiagonal matrix (IMTQLV).
.IP (3) 
If eigenvectors are requested,
find the eigenvectors of the tridiagonal matrix by inverse iteration (TINVIT).
.IP (4)
If eigenvectors are requested, back transform the vectors
of the tridiagonal matrix to form the eigenvectors of the original
system (TRBAK1).
.in
or
.br
.B
Path 2
.sp
.IP (1) 
Transform $A$ to tridiagonal form, accumulating the transformations (TRED2).
.IP (2) 
Find the eigenvalues of the tridiagonal matrix and accumulate the 
transformations to give the eigenvectors of the original matrix (IMTQL2).
.sp
.in
.PP
On conventional serial machines, Path 2 typically requires nearly
twice as much time as Path 1. 
On vector machines however we do not see this relationship.
For EISPACK, Path 2 is slightly faster and after the modification
described below
requires roughly the same amount of time.
This is the result of two problems in routine TINVIT.
First, TINVIT has not
been modified to induce vectorization at any level.
One can achieve an increase in performance by vectorizing
across the eigenvectors being computed. We have not presented
an algorithm of this form since it requires a different
technique to achieve performance and cannot run at supervector
rates. 
The time spent in TINVIT on serial machines is inconsequential with respect
to the total time to execute Path 1.
However, on vector machines TINVIT becomes a significant contributor to the
total execution time of the path.
The second factor influencing performance for Path 1 is that
the current version of TINVIT has a call to an auxiliary routine, PYTHAG,
in an inner loop of the algorithm. PYTHAG is used to safely and portably
compute the square root of the sum of squares. If TINVIT is modified
to replace the call to PYTHAG by a simple square root, the
time for TINVIT becomes more attractive by about 30%.
.PP
We note that the advantage of Path 2 is that near orthogonality
of the eigenvectors is guaranteed, while with Path 1
one may see some degradation in this property for eigenvectors corresponding
to close eigenvalues. Both paths give excellent eigensystems
in the sense that they are eigensystems of a problem
close to the original problem [12,13].
.PP
We will now describe the implementation of routines TRED1 and TRED2
using matrix vector operations.
.SH
3.2.1 TRED1, TRED2 and TRBAK1
.PP
Routine TRED1 or TRED2 reduce a real symmetric matrix 
to a symmetric tridiagonal
matrix using orthogonal similarity transformations.
An $n times n$ matrix requires $n-2$ transformations each of which
introduces zeros into a particular row and column of the matrix,
while preserving symmetry and preserving the zeros introduced
by previous transformations.
TRED1 is used to just compute the tridiagonal matrix, while 
TRED2 in addition to computing the tridiagonal matrix also
returns the orthogonal matrix which would transform the original matrix
to this tridiagonal matrix.
TRBAK1 forms the eigenvectors of the real symmetric matrix from the
eigenvectors of the symmetric tridiagonal matrix determined by TRED1.
This orthogonal matrix will
later be used in computing the eigenvectors of the original matrix.
These subroutines deal with the real symmetric matrix as stored in the
lower triangle of an array.
.PP
The sequence of transformations applied to the matrix $A$ is of the form
.EQ I
A sub i+1 ~<-~ Q sub i A sub i Q sub i ~, i ~=~ 1,2,...,n-2 ,
.EN
where $Q$ is a Householder matrix of the form described in 
section 2.1.2.
Each of the similarity transformations is applied as in Algorithm D.1
with the simplification that $w$ and $u$ are the same, so that application
becomes:
.KS
.TS
center;
l.
1. $x ~=~ Au$
2. $y sup T ~=~ x sup T -( beta u sup T x ) u sup T$
3. Replace $A $ by $ A - beta xu sup T - beta u y sup T$
.TE
.KE
Since the matrix $A$ is symmetric and stored in the lower triangle of the array,
the matrix vector operation in step 1 follows the form described in Section
2.3 as implemented in Algorithm IJ.
.PP
TRED2 differs from TRED1 in that the transformation matrices are
accumulated in an array $Z$.
The sequence of transformations applied to the matrix $Z$ is of the form
.EQ I
Z sub n-2 ~=~ Q sub n-2
.EN
.EQ I
Z sub i ~<-~ Q sub i Z sub i+1 ~, i ~=~ n-3,...,2,1 .
.EN
This can be implemented in a straightforward manner
as in Algorithm C of section 2.1.2 using matrix vector multiply and
a rank one update. Since all
transformations are available at the time they are to be accumulated,
more than one transformation can be accumulated at a time, say two at a time,
thus giving a rank two update.
This then gives an implementation that has the form
of Algorithm D.2 in section 2.1.2.
.PP
When TRED1 and TRED2 are implemented as described, significant improvements
in the execution time can be realized on vector machines.
Table 3.3 displays the execution time for the current EISPACK versions
of TRED1 and TRED2 as well as the modified matrix vector implementations,
referred to as TRED1V and TRED2V.
.PP
TRBAK1 applies the transformations used by TRED1 to reduce the matrix
to tridiagonal form. This can be organized as in TRED2, by matrix
vector multiplication and a rank-2 update. The table below
shows the improvement in performance when this is implemented.
.sp
.KS
.ce
Comparison of Execution on the CRAY 1
.ce
for EISPACK Routine TRBAK1
.TS
center;
l l
l l
n n.
Order	Ratio of Execution Times
	EISPACK/MV Version
_	_

50	4.20
100	3.66
.TE
.KE
.sp
.SH
3.3 The Symmetric Generalized Eigenvalue Problem
.PP
In this section we consider methods for increasing the efficiency of the subroutines
in EISPACK for solving the generalized eigenvalue problem
.EQ
Ax ~=~ lambda B x ,
.EN
where $A$ and $B$ are symmetric matrices and $B$ is positive definite. In EISPACK
this problem is solved in the following steps, with the name of the
corresponding subroutine in the package given in parenthesis:
.sp
.in .5i
.IP (1) 
Factor $B$ into $LL sup T$ and form $C~=~L sup -1 AL sup -T $ (REDUC).
.br
.IP (2) 
Solve the symmetric eigenvalue problem $C y ~=~ lambda y $.
.br
.IP (3) 
If eigenvectors are requested, transform the eigenvectors of $C$ into those
of the original system (REBAK).
.sp
.in
In general the majority of the execution time is spent in REDUC and REBAK, and it
will be these routines on which we will concentrate. 
.SH
3.3.1 REDUC
.PP
REDUC has three main sections:

.KS
.TS
center;
l.
1. Find the Cholesky factors of B, i.e., finding lower 
        triangular $L$ such that $B~=~LL sup T$ .
2. Find the upper triangle of $E~=~L sup -1 A$ .
3. Find the lower triangle of $C ~=~ L sup -1 E sup T$ .
.TE
.KE
.PP
Step 1, the Cholesky factorization, 
was discussed in Dongarra and Eisenstat [2];
its inner loop can be replaced by a call to SMXPY.
Step 2 is a lower triangular solve. The original code in EISPACK follows
the suggestion in Section 2.2 and computes $E$ by rows. Thus it is a
simple matter to replace the inner loop by a call to SMXPY.
Step 3 is another lower triangular solve. The EISPACK encoding computes
$C$ by columns and uses the fact that $C$ is symmetric. Thus
the first $i-1$ elements of the $i sup th$ column of $C$ are already
known before the code commences to work on the $i sup th$ column.
For the $i sup th$ column REDUC has 2 inner loops. The first updates
the last $n-i$ elements with the previous known $i-1$ elements.
The second does a lower triangle solve with an $(n-1) times (n-1) $ matrix
as in Algorithm E of Section 2.2. The first loop can be easily 
implemented using a SMXPY;
the only hope for easily increasing the efficiency of the second loop is using
Algorithm F of 2.2.
.PP
Thus it is straightforward to replace three of the four inner loops of REDUC
by SMXPY, and this is accomplished by REDUC3 listed in Table 3.3.
The decrease in execution time of REDUC3 from REDUC is quite surprising
considering 
the changes being made affect only how the matrix is accessed.
REDUCV replaces the fourth loop of REDUC by Algorithm F of Section 2.2.
It produces a further modest saving.
.PP
REDUC4 replaces the two inner loops of Step 3 of REDUC by a modification
of Algorithm G which computes only the first $i$ elements of
the $i sup th$ row of $C$ rather than the whole row. Because $C$ is
symmetric, these first $i$ elements are also the first $i$ elements of the
$i sup th$ column of $C$. Thus by the end of the $i sup th$ stage of 
Step 3 of REDUC4, the top $i times i$ submatrix of $C$ has been 
computed while at the same stage of REDUC and REDUC3, the first $i$ columns
of $C$ have been computed. REDUC4, as Table 3.3 indicates, is the least expensive
of the subroutines, but it has one major drawback. REDUC, REDUC3, and REDUCV
overwrite only the lower triangular portions of the A and B matrices while
forming $L,~E,$ and $C$. REDUC4 overwrites the whole $A$ matrix.
.sp
.SH
3.3.2 REBAK
.PP
The subroutine REBAK takes the eigenvectors $Y$ of the standard symmetric 
eigenproblem and
forms those of the original problem $X$ by multiplying $Y$ by $L sup -T$.
Thus it is an upper triangular solve with many righthand sides.
The original REBAK computes $X$ one column at a time using inner
products.
REBAKV uses the upper triangular version of Algorithm G 
to compute $X$. The difference in computation times given in Table 3.3
for REBAK and REBAKV is really remarkable considering that they require
the same number of floating point operations. 
.EQ
delim $$
.EN
.KS
.TS
center;
c s s
c s s
c s s
c | c | c
l | n | n.
Table 3.3
CRAY -1 Times (in $10 sup -2$ sec) 
for the Symmetric Generalized Eigenvalue Problem
_
Subroutine	n=100	n=200
_
REDUC	16.9	85.5
REDUC3	4.26	23.6
REDUCV	3.62	19.5
REDUC4	3.00	16.1
_
TRED1	6.94	38.5
TRED1V	4.95	29.7
_
TRED2	14.3	84.5
TRED2V	8.31	51.3
_
TQL1	7.58	29.1
TQL2	19.8	117
_
REBAK	9.79	52.5
REBAKV	2.20	15.3
_
No vectors
total old	32.92	165.4
total new	16.15	78.3
_
Vectors
total old	60.8	339.6
total new	33.9	203.1
.TE
.KE
.SH
3.4 The Singular Value Decomposition
.PP
The singular value decomposition (the SVD) of an $m times n$ matrix $A$
is given by
.EQ
A~=~U SIGMA V sup T ,
.EN
where $U$ is an $m times m$ orthogonal matrix, $V$ is an $n times n$
orthogonal matrix, and $SIGMA$ is an $m times n$ diagonal matrix containing
the singular values of $A$, which are the non-negative square roots of
the eigenvalues of $A sup T A$.
Amongst the many applications of the SVD algorithm is the 
solution of least squares problems
and the determination of the condition number of a matrix.
.PP
The algorithm implemented
in EISPACK's SVD is usually considered to have two stages:
.KS
.TS
center;
l.
(1) Determine $Q$ and $Z$ such that $J~=~QAZ$ is bidiagonal
(2) Iteratively reduce $J$ to a diagonal matrix
.TE
.KE
.PP
In typical applications where $m >> n$ 
the first stage of the SVD calculation is the most time-consuming.
The matrices $Q$ and $Z$ are the product of Householder transformations
and, as described in Section 2.1, the number of vector memory references
can be reduced by replacing all vector matrix multiplications by calls
to SXMPY as is done in the subroutine SVDI given in Table 3.4.
Moreover, since each Householder transformation from the left
is followed by a Householder transformation from the right, one may
use Algorithm D.1,
and this is implemented in subroutine SVDV in Table 3.4.
The second stage of the SVD calculation involves plane rotations
which, when only the singular values are requested, do not involve
any vector operations.
.PP
When the singular vectors are requested, the Householder transformations
which form $Q$ and $Z$ are accumulated in reverse order. Here again
we can use the techniques described earlier.
In the second stage,
plane rotations are applied to vectors, and it is not easy to
decrease the number of vector memory references as was discussed
in section 2.1.3.
.PP
Chan [1] has described a modification of SVD which, when $m>>n$, requires
up to 50% fewer floating point operations. Chan suggests first applying
Householder transformations from the left to reduce $A$ to triangular
form before applying the traditional SVD algorithm. Thus the
Householder transformations applied from the right, which
are designed to annihilate elements above the super-diagonal, would
be applied to vectors of length $n$ rather than to vectors of
length $m$.
Unfortunately, on the CRAY-1 Chan's suggestion seems to produce at most a 10%
speedup in execution time. When the inner product loops in all Householder
transformation applications are replaced by calls to SMXPY, the execution
times are still about the same as SVDV.
.KS
.TS
center;
c s s s s s s
c s s s s s s
c s s s s s s
c| c s s | c s s
c| n n n | n n n
l| n n n | n n n.
Table 3.4 
CRAY-1 Times (in $10 sup -2$ sec)
for the Singular Value Decomposition
_
m	100	200
n	10	50	100	10	50	100
_
No singular vectors:
SVD	1.02	10.8	30.9	1.92	18.4	54.7
SVDI	0.57	7.57	23.1	0.94	11.5	37.3
SVDV	0.38	6.1	19.4	0.55	8.2	28.0
_
With singular vectors:
SVD	1.31	19.8	70.1	2.41	32.6	115
SVDI	0.85	16.5	61.8	1.43	25.6	97.3
SVDV	0.68	13.5	51.0	1.09	20.6	79.5
.TE
.KE

.SH
Conclusions
.PP
In this paper we have shown how to make some of the subroutines
in EISPACK more efficient on vector machines such as the CRAY-1.
We have concentrated our efforts in speeding up programs
that already run at vector speed but because of bottlenecks
caused by referencing vectors in memory do not run at
super vector speeds. We have not considered subroutines
which currently run at scalar speeds, like BANDR [8] on small bandwidth
problems, TINVIT, and BISECT which 
can all be vectorized.
.PP
Our techniques for speeding up the eigenvalue solvers do not significantly change
the number of floating point operations, only the number of vector
loads and stores.
Since we have been able to obtain speedups in the range of 2 to 5,
vector loads and stores seems to be the dominant factor in
determining the time required by an algorithm. Thus the traditional
merit function, the number of floating point operations, seems
to be not as relevant for these machines as for the 
scalar machines.
.PP
For the most part we have been able to isolate computationally intense
sections of codes into well defined modules. This has made some of the programs
shorter and has made their mathematical function clearer.
Some of the techniques used to gain better performance could be done by an
extremely clever vectorization compiler. However, this is not usually
the case. Certainly a clever compiler would not know that one could
delay transformations as is done in the new ELMHES.
.PP
Our techniques will always produce faster code, even on machines with conventional
architecture.
We have never resorted to assembly language. Thus our programs
are transportable. Moreover there is still room for some improvement
by using some assembly language modules in critical places.

.SH
References


.IP [1]
T. Chan, "An Improved Algorithm for Computing the Singular Values
Decomposition", 
.I
ACM Transactions on Mathematical Software 
.R 
\fB8\fP (1982), 72-89.
.IP [2] 
.R
J.J. Dongarra and S.C. Eisenstat,
"Squeezing the Most out of an Algorithm in CRAY Fortran,"
.I
ACM Trans. Math. Software,
.R
Vol. 10, No. 3, 221-230, (1984).
.IP [3]
J.J. Dongarra and C. B. Moler, 
.I
EISPACK - A Package for Solving Matrix Eigenvalue Problems,
.R
in Sources and Development of Mathematical Software,
Ed. W. R. Cowell, Prentice-Hall, New Jersey, (1984).
.IP [4]
Jack J. Dongarra, Linda Kaufman and, Sven Hammarling,
.I
Squeezing the Most out of Eigenvalue Solvers
on High-Performance Computers,
.R
Argonne National Laboratory, ANL-MCSD-TM/46, January 1985.
.IP [5]
Kirby Fong and Thomas L. Jordan, 
.I 
Some Linear Algebra Algorithms and Their Performance on the CRAY-1,
.R 
Los Alamos Scientific Laboratory Report, UC-32, June 1977.
.IP [6] 
B.S. Garbow, J.M. Boyle, J.J. Dongarra, and C.B. Moler, 
.I
Matrix Eigensystem Routines - EISPACK Guide Extension, 
.R
Lecture Notes in Computer Science, Vol. 51, Springer-Verlag, Berlin, 1977.
.IP [7]
R.W. Hockney and C.R. Jesshope,
.I
Parallel Computers,
.R
J.W. Arrowsmith Ltd, Bristol, Great Britain, 1981.
.IP [8]
L. Kaufman,
"Banded Eigenvalue Solvers on Vector Machines",
.I
ACM Transactions on Mathematical Software 
.R 
Vol. 10, No. 1, 73-86, (1984).
.IP [9] 4 
C. Lawson, R. Hanson, D. Kincaid, and F. Krogh, 
``Basic Linear Algebra Subprograms for Fortran Usage,'' 
.I 
ACM Transactions on Mathematical Software 
.R 
\fB5\fP (1979), 308-371.  
.IP [10]
R.M. Russell, 
.I
The CRAY-1 Computer System, 
.R
CACM, 21, 1 (January 1978), 63-72.
.IP [11] 
B.T. Smith, J.M. Boyle, J.J. Dongarra, B.S. Garbow, Y. Ikebe,
V.C. Klema, and C.B. Moler, 
.I
Matrix Eigensystem Routines - EISPACK Guide, 
.R
Lecture Notes in Computer Science, Vol. 6, 2nd edition, 
Springer-Verlag, Berlin, 1976.
.IP [12] 
J.H. Wilkinson, 
.I
The Algebraic Eigenvalue Problem, 
.R
Oxford University Press, London (1965).
.IP [13]
J.H. Wilkinson and C. Reinsch, eds. 
.I
Handbook for Automatic Computation, Vol II, Linear Algebra,
.R
Springer-Verlag, New York (1971).
.bp
.SH
Appendix 
.PP
This appendix contain timing information in the form of
ratios of increased performance over the existing EISPACK routines 
on the Fujitsu VP-200
and Hitachi 810/20 as they existed in September 1984.
The matrix vector multiply routines, SMXPY and SXMPY have been unrolled
to a depth of eight for both the Fujitsu and Hitachi machines. A depth
of eight gives the best performance on these machines.
Subsequent hardware and software changes made affect the timing
information to some extent.
$n$ refers to the order of the matrix, ratio is the execution time for the
current version of the EISPACK routine divided by the time for the modified
version.

.TS
center;
l s s
l l|l
c c|c
n n|n.
ELMHES 
	Hitachi S-810/20	Fujitsu VP-200
n	ratio	 ratio
	EISPACK/MV Version	EISPACK/MV Version
_	_	_
50	1.1	1.1
100	1.6	1.6
150	1.9	1.8
200	2.0	1.8
250	2.1	1.8
300	2.2	1.9
.TE

.TS
center;
l s s|s s
l l s|l s
l l l|l l
c c c|c c
n n n|n n.
ORTHES ORTBAK
	Hitachi S-810/20	Fujitsu VP-200
	ORTHES	ORTBAK	ORTHES	ORTBAK
n	ratio	ratio	ratio	ratio
	EISPACK/MV	EISPACK/MV	EISPACK/MV	EISPACK/MV 
_	_	_	_	_
50	1.8	3.6	1.9	3.2
100	2.1	4.6	2.3	3.2
150	2.2	4.9	2.5	3.6
200	2.2	4.6	2.7	3.6
250	2.2	4.0	2.8	3.9
300	2.2	3.8	2.9	4.0

.TE

(Routines ORTHES and ORTBAK here are implemented using rank 1 updates only.)


.KS
.TS
center;
l l  l s 
l l |l s
c c |c c
n n |n n.
REDUC
	Hitachi S-810/20	Fujitsu VP-200
n	ratio	ratio
	EISPACK/MV Version	EISPACK/MV Version
_	_	_	_
50	1.7	1.8
100	2.1	2.2
150	2.3	2.4
200	2.4	2.5
250	2.5	2.6
300	2.5	2.6
.TE
.KE

.TS
center;
l l s s s s s
l l s l s s s
c c c c c c c
l|n n|n n n n.
SVD	Hitachi S-810/20
	m = 100	m = 200
	n=50	n=100 	n=50	n=100	n=150	n=200
_	_	_	_	_	_	_
novect	1.7	1.6	2.0	1.9	1.8	1.7
vect	2.0	1.7	3.0	2.5	2.2	2.0
.TE

.TS
center;
l l s s s s s
l l s l s s s
c c c c c c c
l|n n|n n n n.
SVD	Fujitsu VP-200
	m = 100	m = 200
	n=50	n=100 	n=50	n=100	n=150	n=200
_	_	_	_	_	_	_
novect	1.6	1.5	1.9	1.8	1.7	1.7
vect	1.9	1.6	2.7	2.4	2.1	1.7
.TE
($no~vect$ refers to computing just the singular values and
$vect$ refers to computing both the singular values and left and
right singular vectors. $m$ is the number of rows and $n$ the number of columns in the matrix.)