Back to Example

Next: Tiling with resource constraints Up: Tiling as a loop Previous: Example

Back to Example

In Figure 1, we sketch a valid timing for Example 1. The matrix H is the one derived using the scalable communication-to-computation criteria of Boulet et al. [3]:

We check that . Note that the volume of the tile, which represents the number of computations per tile, is given by the determinant of : . The number of communications is the following: each tile sends

24 data items to its right neighbor,
34 data items to its lower neighbor,
and 6 data items to its lower-right neighbor.

Note that the third message (the diagonal communication) may be routed horizontally and then vertically, or the other way round, and even may be combined with any of the first two messages.

Figure 1: Optimal tiling for a computation volume .

It is important to point out that the dependences between tiles are summarized by the vector pair

In other words, the computation of a tile cannot be started before both its left and upper neighbor tiles have been executed.

As stated above, the atomicity constraint implies that inter-processor communications only take place at the end of the processing of each tile. Note that current architectures do allow for communications and computations to overlap, and it is important to point out that the atomicity constraint does not prevent a given processor from simultaneously communicating boundary data of one tile (whose execution it just completed) and starting the computation of another tile. Also, minimizing communication start-up overheads is a ``sine-qua-non'' condition towards achieving good performance. Even though sophisticated routing strategies are designed and implemented in hardware, communication start-up costs remain very expensive as opposed to the elemental time for communicating one data item (and even worse for performing a computation). Frequent exchanges of short messages should therefore be replaced by fewer sends and receives of longer messages. To summarize, in the context of distributed memory architectures, tiling is a technique that permits to optimize communications while increasing the granularity of computations.

While well-suited to a data-parallel approach, the atomicity constraint is unnecessarily restrictive when targeting VLSI processor arrays. In this context, tiling is more like a data partitioning approach, and tiles are entities that are mapped to processors. Communications can occur as soon as the data is available, while processing the last computational points of the current tile. Each tile still executes the same program, up to a translation in time. This approach has been chosen by several authors, including [9, 8, 7, 20, 10, 19]. In several of these papers, resource constraints on the number of communication links and the fixed topology of the communication network are introduced, leading to solving complex optimization problems via Integer Linear Programming (ILP) techniques. We do not present further details of these results, since we are targeting for distributed memory machines, and therefore we want to enforce the atomicity constraint.

Next: Tiling with resource constraints Up: Tiling as a loop Previous: Example

Jack Dongarra
Sat Feb 8 08:17:58 EST 1997