HPL Algorithm

This page provides a high-level description of the algorithm used in this package. As indicated below, HPL contains in fact many possible variants for various operations. Defaults could have been chosen, or even variants could be selected during the execution. Due to the performance requirements, it was decided to leave the user with the opportunity of choosing, so that an "optimal" set of parameters could easily be experimentally determined for a given machine configuration. From a numerical accuracy point of view, all possible combinations are rigorously equivalent to each other even though the result may slightly differ (bit-wise).


Main Algorithm

This software package solves a linear system of order n: A x = b by first computing the LU factorization with row partial pivoting of the n-by-n+1 coefficient matrix [A b] = [[L,U] y]. Since the lower triangular factor L is applied to b as the factorization progresses, the solution x is obtained by solving the upper triangular system U x = y. The lower triangular matrix L is left unpivoted and the array of pivots is not returned.

The data is distributed onto a two-dimensional P-by-Q grid of processes according to the block-cyclic scheme to ensure "good" load balance as well as the scalability of the algorithm. The n-by-n+1 coefficient matrix is first logically partitioned into nb-by-nb blocks, that are cyclically "dealt" onto the P-by-Q process grid. This is done in both dimensions of the matrix.
The right-looking variant has been chosen for the main loop of the LU factorization. This means that at each iteration of the loop a panel of nb columns is factorized, and the trailing submatrix is updated. Note that this computation is thus logically partitioned with the same block size nb that was used for the data distribution.

Panel Factorization

At a given iteration of the main loop, and because of the cartesian property of the distribution scheme, each panel factorization occurs in one column of processes. This particular part of the computation lies on the critical path of the overall algorithm. The user is offered the choice of three (Crout, left- and right-looking) matrix-multiply based recursive variants. The software also allows the user to choose in how many sub-panels the current panel should be divided into during the recursion. Furthermore, one can also select at run-time the recursion stopping criterium in terms of the number of columns left to factorize. When this threshold is reached, the sub-panel will then be factorized using one of the three Crout, left- or right-looking matrix-vector based variant. Finally, for each panel column the pivot search, the associated swap and broadcast operation of the pivot row are combined into one single communication step. A binary-exchange (leave-on-all) reduction performs these three operations at once.

Panel Broadcast

Once the panel factorization has been computed, this panel of columns is broadcast to the other process columns. There are many possible broadcast algorithms and the software currently offers 6 variants to choose from. These variants are described below assuming that process 0 is the source of the broadcast for convenience. "->" means "sends to". The rings variants are distinguished by a probe mechanism that activates them. In other words, a process involved in the broadcast and different from the source asynchronously probes for the message to receive. When the message is available the broadcast proceeds, and otherwise the function returns. This allows to interleave the broadcast operation with the update phase. This contributes to reduce the idle time spent by those processes waiting for the factorized panel. This mechanism is necessary to accomodate for various computation/communication performance ratio.


Look-ahead

Once the panel has been broadcast or say during this broadcast operation, the trailing submatrix is updated using the last panel in the look-ahead pipe: as mentioned before, the panel factorization lies on the critical path, which means that when the kth panel has been factorized and then broadcast, the next most urgent task to complete is the factorization and broadcast of the k+1 th panel. This technique is often refered to as "look-ahead" or "send-ahead" in the literature. This package allows to select various "depth" of look-ahead. By convention, a depth of zero corresponds to no lookahead, in which case the trailing submatrix is updated by the panel currently broadcast. Look-ahead consumes some extra memory to essentially keep all the panels of columns currently in the look-ahead pipe. A look-ahead of depth 1 (maybe 2) is likely to achieve the best performance gain.


Update

The update of the trailing submatrix by the last panel in the look-ahead pipe is made of two phases. First, the pivots must be applied to form the current row panel U. U should then be solved by the upper triangle of the column panel. U finally needs to be broadcast to each process row so that the local rank-nb update can take place. We choose to combine the swapping and broadcast of U at the cost of replicating the solve. Two algorithms are available for this communication operation. The user can select any of the two variants above. In addition, a mix is possible as well. The "binary-exchange" algorithm will be used when U contains at most a certain number of columns. Choosing at least the block size nb as the threshold value is clearly recommended when look-ahead is on.


Backward Substitution

The factorization has just now ended, the back-substitution remains to be done. For this, we choose a look-ahead of depth one variant. The right-hand-side is forwarded in process rows in a decreasing-ring fashion, so that we solve Q * nb entries at a time. At each step, this shrinking piece of the right-hand-side is updated. The process just above the one owning the current diagonal block of the matrix A updates first its last nb piece of x, forwards it to the previous process column, then broadcast it in the process column in a decreasing-ring fashion as well. The solution is then updated and sent to the previous process column. The solution of the linear system is left replicated in every process row.


Checking the Solution

To verify the result obtained, the input matrix and right-hand side are regenerated. Three residuals are computed and a solution is considered as "numerically correct" when all of these quantities are less than a threshold value of the order of 1.0. In the expressions below, eps is the relative (distributed-memory) machine precision.
[Home] [Contact] [Copyright and Licensing Terms] [Algorithm] [Scalability] [Performance Results] [Documentation] [Software] [FAQs] [Tuning] [Errata-Bugs] [References] [Related Links]