HPL Scalability Analysis

The machine model used for the analysis is first described. This crude model is then used to first estimate the parallel running time of the various phases of the algorithm namely Finally the parallel efficiency of the entire algorithm is estimated according to this machine model. We show that for a given set of parameters HPL is scalable not only with respect to the amount of computation, but also with respect to the communication volume.

The Machine Model

Distributed-memory computers consist of processors that are connected using a message passing interconnection network. Each processor has its own memory called the local memory, which is accessible only to that processor. As the time to access a remote memory is longer than the time to access a local one, such computers are often referred to as Non-Uniform Memory Access (NUMA) machines.

The interconnection network of our machine model is static, meaning that it consists of point-to-point communication links among processors. This type of network is also referred to as a direct network as opposed to dynamic networks. The latter are constructed from switches and communication links. These links are dynamically connected to one another by the switching elements to establish, at run time, the paths between processors memories.

The interconnection network of the two-dimensional machine model considered here is a static, fully connected physical topology. It is also assumed that processors can be treated equally in terms of local performance and that the communication rate between two processors depends on the processors considered.

Our model assumes that a processor can send or receive data on only one of its communication ports at a time (assuming it has more than one). In the literature, this assumption is also referred to as the one-port communication model.

The time spent to communicate a message between two given processors is called the communication time Tc. In our machine model, Tc is approximated by a linear function of the number L of double precision (64-bits) items communicated. Tc is the sum of the time to prepare the message for transmission (alpha) and the time (beta * L) taken by the message of length L to traverse the network to its destination, i.e.,

Tc = alpha + beta L.

Finally, the model assumes that the communication links are bi-directional, that is, the time for two processors to send each other a message of length L is also Tc. A processor can send and/or receive a message on only one of its communication links at a time. In particular, a processor can send a message while receiving another message from the processor it is sending to at the same time.

Since this document is only concerned with regular local dense linear algebra operations, the time taken to perform one floating point operation is assumed to be summarized by three constants gam1, gam2 and gam3. These quantitites are flop rates approximations of the vector-vector, matrix-vector and matrix-matrix operations for each processor. This very crude approximation summarizes all the steps performed by a processor to achieve such a computation. Obviously, such a model neglects all the phenomena occurring in the processor components, such as cache misses, pipeline startups, memory load or store, floating point arithmetic and so on, that may influence the value of these constants as a function of the problem size for example.

Similarly, the model does not make any assumption on the amount of physical memory per node. It is assumed that if a process has been spawn on a processor, one has ensured that enough memory was available on that processor. In other words, swapping will not occur during the modeled computation.

This machine model is a very crude approximation that is designed specifically to illustrate the cost of the dominant factors of our particular case.

Panel Factorization and Broadcast

Let consider an M-by-N panel distributed over a P-process column. Because of the recursive formulation of the panel factorization, it is reasonable to consider that the floating point operations will be performed at matrix-matrix multiply "speed". For every column in the panel a binary-exchange is performed on 2*N data items. When this panel is broadcast, what matters is the time that the next process column will spend in this communication operation. Assuming one chooses the increasing-ring (modified) variant, only one message needs to be taken into account. The execution time of the panel factorization and broadcast can thus be approximated by:

Tpfact( M, N ) = (M/P - N/3) N^2 gam3 + N log(P)( alpha + beta 2 N ) + alpha + beta M N / P.

Trailing Submatrix Update

Let consider the update phase of an N-by-N trailing submatrix distributed on a P-by-Q process grid. From a computational point of view one has to (triangular) solve N right-hand-sides and perform a local rank-NB update of this trailing submatrix. Assuming one chooses the long variant, the execution time of the update operation can be approximated by:

Tupdate( N, NB ) = gam3 ( N NB^2 / Q + 2 N^2 NB / ( P Q ) ) + alpha ( log( P ) + P - 1 ) + 3 beta N NB / Q.

The constant "3" in front of the "beta" term is obtained by counting one for the (logarithmic) spread phase and two for the rolling phase; In the case of bi-directional links this constant 3 should therefore be only a 2.

Backward Substitution

The number of floating point operations performed during the backward substitution in given by N^2 / (P*Q). Because of the lookahead, the communication cost can be approximated at each step by two messages of length NB, i.e., the time to communicate the NB-piece of the solution vector from one diagonal block of the matrix to another. It follows that the execution time of the backward substitution can be approximated by:

Tbacks( N, NB ) = gam2 N^2 / (P Q) + N ( alpha / NB + 2 beta ).

Putting it All Together

The total execution time of the algorithm described above is given by

Sum(k=0,N,NB)[Tpfact( N-k, NB ) + Tupdate( N-k-NB, NB )] + Tbacks( N, NB ).

That is, by only considering only the dominant term in alpha, beta and gam3:

Thpl = 2 gam3 N^3 / ( 3 P Q ) + beta N^2 (3 P + Q) / ( 2 P Q ) + alpha N ((NB + 1) log(P) + P) / NB.

The serial execution time is given by Tser = 2 gam3 N^3 / 3. If we define the parallel efficiency E as the ratio Tser / ( P Q Thpl ), we obtain:

E = 1 / ( 1 + 3 beta (3 P + Q) / ( 4 gam3 N ) + 3 alpha P Q ((NB + 1) log(P) + P) / (2 N^2 NB gam3) ).

This last equality shows that when the memory usage per processor N^2 / (P Q) is maintained constant, the parallel efficiency slowly decreases only because of the alpha term. The communication volume (the beta term) however remains constant. Due to these results, HPL is said to be scalable not only with respect to the amount of computation, but also with respect to the communication volume.

[Home] [Copyright and Licensing Terms] [Algorithm] [Scalability] [Performance Results] [Documentation] [Software] [FAQs] [Tuning] [Errata-Bugs] [References] [Related Links]