To obtain the elementary Householder vector ,
the Euclidean norm of the vector, , is required.
The sequential LAPACK routine, `DLARFG`, calls the Level-1 BLAS
routine, `DNRM2`, which computes the norm while guarding against
underflow and overflow.
In the corresponding parallel ScaLAPACK routine, `PDLARFG`,
each process in the column of processes, which holds the vector, ,
computes the global norm safely using the `PDNRM2` routine.

For consistency with LAPACK, we have chosen to store and , and generate when necessary. Although storing might save us some redundant computation, we felt that consistency was more important.

The lower trapezoidal part of , which is a sequence of the Householder vectors, will be accessed in the form,

where is unit lower triangular, and
is .
In the sequential routine, the multiplication involving is divided
into two steps: `DTRMM` with and `DGEMM` with .
However, in the parallel implementation, is contained in one column of
processes. Let be a unit lower trapezoidal matrix containing
the strictly lower trapezoidal part of . is broadcast
rowwise to the other process columns so that every column of processes
has its own copy. This allows us to perform the operations involving
in one step (`DGEMM`), as illustrated in
Figure 7, and not worry about the upper
triangular part of .
This one step multiplication not only simplifies the implementation of
the routine (`PDLARFB`), but may, depending upon the BLAS
implementation, increase the overall performance
of the routine (`PDGEQRF`) as well.

**Figure 7:** The storage scheme of the lower trapezoidal matrix
in ScaLAPACK QR factorization.

**Figure 8:** Performance of the QR factorization on the Intel iPSC/860,
Delta, and Paragon.

Figure 8 shows the performance of the QR factorization routine on the Intel family of concurrent computers. The block size of was used on all of the machines. Best performance was attained with an aspect ratio of . The highest performances of 3.1 Gflops for was obtained on the iPSC/860; 14.6 Gflops for on the Delta; and 21.0 Gflops for on the Paragon. Generally, the QR factorization routine has the best performance of the three factorizations since the updating process of is rich in matrix-matrix operation, and the number of floating point operations is the largest ().

**Figure 9:** Performance of the Cholesky factorization
as a function of the block size on and processes
of the Intel Delta ().

Fri Apr 28 09:37:26 EDT 1995