Next: Literature overview Up: Tiling with limited resources Previous: Cover Page Information

Introduction

Tiling is a widely used technique to increase the granularity of computations and the locality of data references. This technique is restricted to perfect loop nests with uniform dependences, which we define as in Banerjee [2] (see also the examples below). The basic idea is to group elemental computation points into tiles that will be viewed as computational units. The larger the tiles, the more efficient the computations are performed using state-of-the-art processors with pipelined arithmetic units and a multi-level memory hierarchy (this is illustrated by recasting numerical linear algebra algorithms in terms of blocked Level 3 BLAS kernels [11, 6]). Also, another advantage of tiling is the decrease of the communication time (which is proportional to the surface of the tile) relatively to the computation time (which is proportional to the volume of the tile). The price to pay for tiling may be an increased latency (if there are data dependences, say, we need to wait for the first processor to complete the whole execution of the first tile before another processor can start the execution of the second one, and so on), as well as some load-imbalance problems (the larger the tile, the more difficult to distribute computations equally among the processors).

Tiling has been studied by several authors and in different contexts. We provide a short review of the existing literature in Section 2.1. Basically, most of the work amounts to partitioning the computation domain of a uniform loop nest into tiles whose shape and size are optimized according to some criteria. Little attention has been paid to distributing the tiles to physical processors and to computing the final scheduling. For example, if each physical processor is assigned several tiles, what should be the computation ordering of these tiles? An in-depth study has been presented by Ohta et al [14], who have extended results of Hiranandani et al. [12] on fine grain pipelining for DOACROSS loops. We survey their work in Section 2.2.

In this paper, we build upon the work of Ohta et al [14]. We reformulate the problem of tiling with limited resources using more realistic assumptions on data dependences and communication-computation overlap than those used in [14], and we provide new results on the ``optimal'' tile size. We also derive an optimal mapping to assign tiles to physical processors. All these results are presented in Section 3. Finally, we state some conclusions in Section 4.

Next: Literature overview Up: Tiling with limited resources Previous: Cover Page Information

Jack Dongarra
Sat Feb 8 08:17:58 EST 1997