- Main Algorithm
- Panel Factorization
- Panel Broadcast
- Look-ahead
- Update
- Backward Substitution
- Checking the Solution

The data is distributed onto a two-dimensional P-by-Q grid of processes according to the block-cyclic scheme to ensure "good" load balance as well as the scalability of the algorithm. The n-by-n+1 coefficient matrix is first logically partitioned into nb-by-nb blocks, that are cyclically "dealt" onto the P-by-Q process grid. This is done in both dimensions of the matrix. |

The right-looking variant has been chosen for the main loop of the LU factorization. This means that at each iteration of the loop a panel of nb columns is factorized, and the trailing submatrix is updated. Note that this computation is thus logically partitioned with the same block size nb that was used for the data distribution. |

At a given iteration of the main loop, and because of the cartesian property of the distribution scheme, each panel factorization occurs in one column of processes. This particular part of the computation lies on the critical path of the overall algorithm. The user is offered the choice of three (Crout, left- and right-looking) matrix-multiply based recursive variants. The software also allows the user to choose in how many sub-panels the current panel should be divided into during the recursion. Furthermore, one can also select at run-time the recursion stopping criterium in terms of the number of columns left to factorize. When this threshold is reached, the sub-panel will then be factorized using one of the three Crout, left- or right-looking matrix-vector based variant. Finally, for each panel column the pivot search, the associated swap and broadcast operation of the pivot row are combined into one single communication step. A binary-exchange (leave-on-all) reduction performs these three operations at once. |

**Increasing-ring**: 0 -> 1; 1 -> 2; 2 -> 3 and so on. This algorithm is the classic one; it has the caveat that process 1 has to send a message.**Increasing-ring (modified)**: 0 -> 1; 0 -> 2; 2 -> 3 and so on. Process 0 sends two messages and process 1 only receives one message. This algorithm is almost always better, if not the best.**Increasing-2-ring**: The Q processes are divided into two parts: 0 -> 1 and 0 -> Q/2; Then processes 1 and Q/2 act as sources of two rings: 1 -> 2, Q/2 -> Q/2+1; 2 -> 3, Q/2+1 -> to Q/2+2 and so on. This algorithm has the advantage of reducing the time by which the last process will receive the panel at the cost of process 0 sending 2 messages.**Increasing-2-ring (modified)**: As one may expect, first 0 -> 1, then the Q-1 processes left are divided into two equal parts: 0 -> 2 and 0 -> Q/2; Processes 2 and Q/2 act then as sources of two rings: 2 -> 3, Q/2 -> Q/2+1; 3 -> 4, Q/2+1 -> to Q/2+2 and so on. This algorithm is probably the most serious competitor to the increasing ring modified variant.**Long (bandwidth reducing)**: as opposed to the previous variants, this algorithm and its follower synchronize all processes involved in the operation. The message is chopped into Q equal pieces that are scattered across the Q processes.

**Long (bandwidth reducing modified)**: same as above, except that 0 -> 1 first, and then the Long variant is used on processes 0,2,3,4 .. Q-1.

**Binary-exchange**: this is a modified variant of the binary-exchange (leave on all) reduction operation. Every process column performs the same operation. The algorithm essentially works as follows. It pretends reducing the row panel U, but at the beginning the only valid copy is owned by the current process row. The other process rows will contribute rows of A they own that should be copied in U and replace them with rows that were originally in the current process row. The complete operation is performed in log(P) steps. For the sake of simplicity, let assume that P is a power of two. At step k, process row p exchanges a message with process row p+2^k. There are essentially two cases. First, one of those two process rows has received U in a previous step. The exchange occurs. One process swaps its local rows of A into U. Both processes copy in U remote rows of A. Second, none of those process rows has received U, the exchange occurs, and both processes simply add those remote rows to the list they have accumulated so far. At each step, a message of the size of U is exchanged by at least one pair of process rows.

**Long**: this is a bandwidth reducing variant accomplishing the same task. The row panel is first spread (using a tree) among the process rows with respect to the pivot array. This is a scatter (V variant for MPI users). Locally, every process row then swaps these rows with the the rows of A it owns and that belong to U. These buffers are then rolled (P-1 steps) to finish the broadcast of U. Every process row permutes U and proceed with the computational part of the update. A couple of notes: process rows are logarithmically sorted before spreading, so that processes receiving the largest number of rows are first in the tree. This makes the communication volume optimal for this phase. Finally, before rolling and after the local swap, an equilibration phase occurs during which the local pieces of U are uniformly spread across the process rows. A tree-based algorithm is used. This operation is necessary to keep the rolling phase optimal even when the pivot rows are not equally distributed in process rows. This algorithm has a complexity in terms of communication volume that solely depends on the size of U. In particular, the number of process rows only impacts the number of messages exchanged. It will thus outperforms the previous variant for large problems on large machine configurations.

- || Ax - b ||_oo / ( eps * ( || A ||_oo * || x ||_oo + || b ||_oo ) * n )