# 11.1 Load Balancing as an Optimization Problem

We have seen many times that parallel computing involves breaking problems into parts which execute concurrently. In the simple regular problems seen in the early chapters, especially Chapters 4 and 6, it was usually reasonably obvious how to perform this breakup to optimize performance of the program on a parallel machine. However, in Chapter 9 and even more so in Chapter 12, we will find that the nature of the parallelism is as clear as before, but that it is nontrivial to implement efficiently.

Figure 11.1: Two Possible ``Energy Landscapes'' for an Optimization Problem

A few general remarks are necessary; we use the phrases ``load balancing'' and ``data decomposition'' interchangeably. One needs both ab initio and dynamic distribution and redistribution of data on the parallel machine. We also can examine load balancing at the level of data or of tasks that encapsulate the data and algorithm. In elegant (but currently inefficient) software models with one datum per task, these formulations are equivalent. Our examples will do load balancing at the level of data values, but the task and data distribution problems are essentially equivalent.

Our methods are applicable to general loosely synchronous problems and indeed can be applied to arbitrary problem classes. However, we will choose a particular finite-element problem to illustrate the issues where one needs to distribute a mesh, such as that illustrated in Figure 11.2. Each triangle or element represents a task which communicates with its neighboring three triangles. In doing, for example, a simulation of fluid flow on the mesh, each element of the mesh communicates regularly with its neighbors, and this pattern may be repeated thousands of times.

Figure 11.2: An Unstructured Triangular Mesh Surrounding a Four-Element Airfoil. The mesh is distributed among 16 processors, with divisions shown by heavy lines.

We may classify load-balancing strategies into four broad types, depending on when the optimization is made and whether the cost of the optimization is included in the optimization itself:\

• By Inspection: The load-balancing strategy may be determined by inspection, such as with a rectangular lattice of grid points split into smaller rectangles, so that the load-balancing problem is solved before the program is written. This is illustrated by the QCD decomposition of Figure 4.3.

• Static: The optimization is nontrivial, but may be done by a sequential machine before starting the parallel program, so that the load-balancing problem is solved before the parallel program begins.

• Quasi-Dynamic: The circumstances determining the optimal balance change during program execution, but discretely and infrequently. Because the change is discrete, the load-balance  problem, and hence its solution, remain the same until the next change. If these changes are infrequent enough, any savings made in the subsequent computation make up for the time spent solving the load-balancing problem. The difference between this and the static case is that the load balancing must be carried out in parallel to prevent a sequential bottleneck.

Koller calls these problems adiabatic [Fox:90nn], using a physical analogy where the load balancer can be viewed as a heatbath keeping the problem in equilibrium. In adiabatic systems, changes are sufficiently slow that the heatbath can ``keep up'' and the system evolves from equilibrium state to equilibrium state.

• Dynamic: The circumstances determining the optimal balance change frequently or continuously during execution, so that the cost of the load balancing calculation after each change should be minimized in addition to optimizing the splitting of the actual calculation. This means that there must be a decision made every so often to decide if load balancing is necessary, and how much time to spend on it. The chess program in Section 14.3 shows very irregular dynamic behavior so that statistical load-balancing methods similar to the scattered decomposition of Section 11.2 must be used.

If the mesh is solution-adaptive, that is, if the mesh, and hence the load-balancing problem, change discretely during execution of the code, then it is most efficient to decide the optimal mesh distribution in parallel. In this section, three parallel algorithms, orthogonal recursive bisection  (ORB), eigenvector recursive bisection  (ERB) and a simple parallelization of simulated annealing  (SA) are discussed for load-balancing a dynamic unstructured triangular mesh on 16 processors of an nCUBE  machine.

The test problem is a solution-adaptive Laplace  solver, with an initial mesh of 280 elements, refined in seven stages to 5772 elements. We present execution times for the solver resulting from the mesh distributions using the three algorithms, as well as results on imbalance, communication traffic, and element migration.

In this section, we shall consider the quasi-dynamic case with observations on the time taken to do the load balancing that bear on the dynamic case. The testbed is an unstructured-mesh finite-element code, where the elements are the atoms of the problem, which are to be assigned to processors. The mesh is solution-adaptive, meaning that it becomes finer in places where the solution of the problem dictates refinement.

We shall show that a class of finite-element applications share common load-balancing requirements, and formulate load balancing as a graph-coloring problem. We shall discuss three methods for solving this graph-coloring problem: one based on statistical physics, one derived from a computational neural net, and one cheap and simple method.

We present results from running these three load-balancing methods, both in terms of the quality of the graph-coloring solution (machine-independent results), and in terms of the particular machine (16 processors of an nCUBE) on which the test was run.