Speedup is defined as the ratio of sequential running time to parallel running time. We measure the speedup of our program by timing it directly with different numbers of processors on a standard suite of test searches. These searches are done from the even-numbered Bratko-Kopec positions [Bratko:82a], a well-known set of positions for testing chess programs. Our benchmark consists of doing two successive searches from each position and adding up the total search time for all 24 searches. By varying the depth of search, we can control the average search time of each benchmark.
The speedups we measured are shown in Figure 14.8. Each curve corresponds to a different average search time. We find that speedup is a strong function of the time of the search (or equivalently, its depth). This result is a reflection of the fact that deeper search trees have more potential parallelism and hence more speedup. Our main result is that at tournament speed (the uppermost curve of the figure), our program achieves a speedup of 101 out of a possible 256. Not shown in this figure is our later result: a speedup estimated to be 170 on a 512-node machine.
Figure 14.8: The Speedup of the Parallel Chess Program as a Function of Machine Size and Search Depth. The results are averaged over a representative test set of 24 chess positions. The speedup increases dramatically with search depth, corresponding to the fact that there is more parallelism available in larger searches. The uppermost curve corresponds to tournament play-the program runs more than 100 times faster on 256 nodes as on a single nCUBE node when playing at tournament speed.
The ``double hump'' shape of the curves is also understood: The location of the first dip, at 16 processors, is the location at which the chess tree would like the processor hierarchy to be a one-level hierarchy sometimes, a two-level hierarchy at other times. We always use a one-level hierarchy for 16 processors, so we are suboptimal here. Perhaps this is an indication that a more flexible processor allocation scheme could do somewhat better.