We wish to distribute the elements among the processors of the machine to minimize both load imbalance (one processor having more elements than another) and communication between elements.

Our approach here is to write down a cost function which is minimized when the total running time of the code is minimized and is reasonably simple and independent of the details of the code. We then minimize this cost function and distribute the elements accordingly.

The load-balancing problem [Fox:88a;88mm],
may be stated as a graph-coloring problem: Given an
undirected graph of **N** nodes (finite elements), color these nodes with
**P** colors (processors) to minimize a cost function **H** which is
related to the time taken to execute the program for a given coloring.
For DIME applications, it is the finite elements which are to be
distributed among the processors, so the graph to be colored is
actually the dual graph to the mesh, where each graph node corresponds
to an element of the mesh and has (if it is not at a boundary) three neighbors.

We may construct the cost function as the sum of a part that minimizes load imbalance and one that minimizes communication:\

where is the part of the cost function which is minimized when
each processor has equal work, is minimal when communication time
is minimized, and is a parameter expressing the balance between the
two, with related to the number **c** discussed above. If
and were proportional to the times taken for calculation and
communication, then should be inversely proportional to **c**. For
programs with a great deal of calculation compared to communication,
should be small, and vice versa.

As is increased, the number of processors in use will decrease until eventually the communication is so costly that the entire calculation must be done on a single processor.

Let *e, f,* label the nodes of the graph, and be the
color (or processor assignment) of graph node **e**. Then the number of graph
nodes of color **q** is:\

and is proportional to the maximum value of , because the whole calculation runs at the speed of the slowest processor, and the slowest processor is the one with the most graph nodes. This ignores node and link (node-to-node communication) contention, which contribute to idle time.

The formulation as a maximum of is, however, not satisfactory when a perturbation is added to the cost function, such as that from the communication cost function. If, for example, we were to add a linear forcing term proportional to , the cost function would be:\

and the minimum of this perturbed cost function is either if is less than , or , if is larger than this. This discontinuous behavior as a result of perturbations is undesirable, so we use a sum of squares instead, whose minima change smoothly with the magnitude of a perturbation:\

where is a scaling constant to be determined.

We now consider the communication part of the cost function. Let us define the matrix

which is the amount of communication between processors **q** and
**r**, and the notation means that the graph nodes
**e** and **f** are connected by an edge of the graph.

The cost of communication from processors **q** to **r** depends on the
machine architecture; for some parallel machines it may be possible to write
down this metric explicitly. For example, with the early hypercubes, the cost
is the number of bits which are different in the binary representations of
the processor numbers **q** and **r**. The metric may also depend on the
message-passing software, or even on the activities of other users for a
shared machine. A truly portable load balancer would have no option but to
send sample messages around and measure the machine metric, then distribute
the graph appropriately. In this book, however, we shall avoid the question
of the machine metric by simply assuming that all pairs of processors are
equally far apart, except of course a processor may communicate with itself
at no cost.

The cost of sending the quantity of data also depends on the programming: the cost will be much less if it is possible for the messages to be bundled together and sent as one, rather than separately. The major problem is latency: The cost to send a message in any distributed system is the sum of an initial fixed price and one proportional to the size of the message. This is also the case for the pricing of telephone calls, freight shipping, mail service, and many other examples from the everyday world. If the message is large enough, we may ignore latency: For the nCUBE used in Section 11.1.7 of this book, latency may be ignored if the message is longer than a hundred bytes or so. In the tests of Section 11.1.7, most of the messages are indeed long enough to neglect latency, though there is certainly further work needed on load balancing in the presence of this important effect. We also ignore blocking (idling) due to needed resources being unavailable due to contention.

The result of this discussion is that we shall assume that the cost of
communicating the quantity of data is proportional to ,
unless **q=r**, in which case the cost is zero. This is a good
assumption on many new machines, such as the Intel Touchstone series.

We shall now make the assumption that the total communication cost is the sum of the individual communications between processors:\

where is a constant to be determined. Notice that any overlap between calculation and communication is ignored. Here, we have ignored ``global'' contributions to , such as collective communication (global sums or reductions) mentioned in Section 11.1.1.

Substituting the expression for , the expression for the load balance cost function simplifies to

The assumptions made to derive this cost function are significant. The most serious deviation from reality is neglecting the parallelism of communication, so that a minimum of this cost function may have grossly unbalanced communication loads. This turns out not to be the case, however, because when the mesh is equally balanced, there is a lower limit to the amount of boundary, analogous to a bubble having minimal surface area for fixed volume; if we then minimize the sum of surface areas for a set of bubbles of equal volumes, each surface must be minimized and equal.

We may now choose the scaling constants and . A
convenient choice is such that the optimal and
have contributions of about unit size from each processor; the form
of the scaling constant is because the surface area of a
compact shape in **d** dimensions varies as the **d-1** power of the size,
while volume varies as the **d** power. The final form for **H** is

where **d** is the dimensionality of the mesh from which the graph came.

The formalism of this section has a simple physical interpretation

[Fox:86a;88kk;88mm;88tt;88uu], which we introduce
here and discuss further in Section 11.2. The data points
(tasks) to be distributed can be thought of as particles moving around
in the discrete space formed by the processors. This physical system
is controlled by the Hamiltonian (energy function) given in
Equation 11.9. The two terms in the Hamiltonian have simple
physical meanings illustrated in Figure 11.3. The first term
in Equation 11.9 ensures equal work per node and is a
short-range repulsive force trying to push particles away if they land
in the same node. The second term in Equation 11.9 is a
long-range attractive force which links ``particles'' (data points)
which communicate with each other. This force tries to pull particles
together (into the same node) with a strength proportional to the
information needed to be communicated between them. In general, this
communication force depends on the architecture of the interconnect of
the parallel machine, although Equation 11.9 has assumed a
simple form for this. The analogy is preserved in general with the MPP
interconnect architecture translating into a topology for the discrete
space formed by the processors in the analogy. This topology implies a
distance dependence force for the communication term in **H**.
We can also extend the discussion to include the cost of moving data
between processors to rebalance a dynamically changing problem. This
migration cost becomes a third force attracting each particle to the
processor in which it currently resides. Figure 11.3
illustrates these three forces.

**Figure:** Sixteen Data Points Distributed Optimally on Four Processors,
Illustrating the Physical Analogy of Section 11.3.
We take a simple two-dimensional mesh connection for the particles.

Note that the load-balancing problem becomes that of finding the equilibrium state of a system of particles with a ``conflict'' between short-range repulsive (hardcore) and long-range attractive forces. This scenario is qualitatively similar to classical atomic physics problems and leads one to expect that the physically based optimization methods could be effective. This physical analogy is extended in Section 11.2 where we show that the physical system exhibits effects that can be associated with temperature and phase transitions. We also indicate how it needs to be extended for problems with microscopic structure in their temporal properties.

Wed Mar 1 10:19:35 EST 1995