To obtain the elementary Householder vector , the Euclidean norm of the vector, , is required. The sequential LAPACK routine, DLARFG, calls the Level-1 BLAS routine, DNRM2, which computes the norm while guarding against underflow and overflow. In the corresponding parallel ScaLAPACK routine, PDLARFG, each process in the column of processes, which holds the vector, , computes the global norm safely using the PDNRM2 routine.
For consistency with LAPACK, we have chosen to store and , and generate when necessary. Although storing might save us some redundant computation, we felt that consistency was more important.
The lower trapezoidal part of , which is a sequence of the Householder vectors, will be accessed in the form,
where is unit lower triangular, and is . In the sequential routine, the multiplication involving is divided into two steps: DTRMM with and DGEMM with . However, in the parallel implementation, is contained in one column of processes. Let be a unit lower trapezoidal matrix containing the strictly lower trapezoidal part of . is broadcast rowwise to the other process columns so that every column of processes has its own copy. This allows us to perform the operations involving in one step (DGEMM), as illustrated in Figure 7, and not worry about the upper triangular part of . This one step multiplication not only simplifies the implementation of the routine (PDLARFB), but may, depending upon the BLAS implementation, increase the overall performance of the routine (PDGEQRF) as well.
Figure 7: The storage scheme of the lower trapezoidal matrix in ScaLAPACK QR factorization.
Figure 8: Performance of the QR factorization on the Intel iPSC/860, Delta, and Paragon.
Figure 8 shows the performance of the QR factorization routine on the Intel family of concurrent computers. The block size of was used on all of the machines. Best performance was attained with an aspect ratio of . The highest performances of 3.1 Gflops for was obtained on the iPSC/860; 14.6 Gflops for on the Delta; and 21.0 Gflops for on the Paragon. Generally, the QR factorization routine has the best performance of the three factorizations since the updating process of is rich in matrix-matrix operation, and the number of floating point operations is the largest ().
Figure 9: Performance of the Cholesky factorization as a function of the block size on and processes of the Intel Delta ().