is a block Arnoldi reduction of length when is a band upper Hessenberg matrix of order , , and For us, a band upper Hessenberg matrix is an upper triangular matrix with subdiagonals. The columns of are an orthogonal basis for the block Krylov subspace

If , then and is the orthogonal reduction of into banded upper Hessenberg form. We assume, for the moment, that is of full rank and further suppose that the elements on the diagonal of are positive. Thus, a straightforward extension of the implicit Q theorem [198, pp. 367-368] gives that is (uniquely) specified by the starting block Note that if , then is a block tridiagonal matrix. Algorithm 7.11 lists an algorithm to compute a block Arnoldi reduction.

We now comment on some steps of Algorithm 7.11.

**(1)**- Here the QR factorization is computed via an iterated
CGS algorithm using a possible correction step.
See [96] for details and the simple test used to determine
whether a correction step is necessary.
One benefit of this scheme is that it allows the use of the Level 2
BLAS [133] matrix-vector multiplication
subroutine
`xGEMV`(also see §10.2). Moreover, this scheme also gives a simple way to fill out a rank-deficient For instance, if a third step of orthogonalization is needed when generating column of then the corresponding column of is linearly dependent on the previous columns of The th diagonal element of is set to zero, and a random unit vector is orthogonalized against and the first columns of **(3)**- This step allows the application of to a group of vectors. This might
prove essential when accessing is expensive. Clearly, the goal is
to amortize the cost of applying over several vectors.
**(4)**- This allows the use of the Level 3 BLAS [134]
matrix-matrix multiplication subroutine
`xGEMM`for computing **(6)**- This is one step of block classical Gram-Schmidt (bCGR).
The Level 3 BLAS [134]
matrix-matrix multiplication subroutine
`xGEMM`is used for computing the rank- update needed for the computation of . To ensure the orthogonality of with , a second step of bCGR is performed except when In this latter case, the simple test in DGKS [96] is used to determine whether a second orthogonalization step is needed. See [295] for details.

The scheme given for computing is equivalent to the one
proposed by Ruhe

[375] (for symmetric matrices) and is the one
used by the implicitly restarted block Lanczos code [24].
Although the approach in [24] cleanly deals with the problem
of rank-deficient , the implementation does not exploit the
ability to apply as in (3) above. Instead, as proposed
in [375], is applied to each column of followed by
computing the corresponding column of and Our
implementation further reorganizes Ruhe's approach so that the
computation of the matrix of coefficients
is
separated from the QR factorization of The advantage is that
steps (3)-(5) reduce the cost of I/O by a factor of the block size and
increase the amount of floating point operations per memory reference.