The classic quicksort algorithm is a divide-and-conquer sorting method ([Hoare:62a], [Knuth:73a] pp.118-23). As such, it would seem to be amenable to a concurrent implementation, and with a slight modification (actually an improvement of the standard algorithm) this turns out to be the case.

The standard algorithm begins by picking some item from the list and using this as the splitting key. A loop is entered which takes the splitting key and finds the point in the list where this item will ultimately end up once the sort is completed. This is the first splitting point. While this is being done, all items in the list which are less than the splitting key are placed on the low side of the splitting point, and all higher items are placed on the high side. This completes the first divide. The list has now been broken into two independent lists, each of which still needs to be sorted.

The essential idea of the concurrent (hypercube) quicksort is the same. The
first splitting key is chosen (a global step to be described below) and then
the entire list is split, in parallel, between two halves of the hypercube.
All items higher than the splitting key are sent in one direction in the
hypercube, and all items less are sent the other way. The procedure is then
called recursively, splitting each of the subcubes' lists further. As in
Shellsort, the ring-based labelling of the hypercube is used to define global
order. Once **d** splits occur, there remain no further interprocessor splits
to do, and the algorithm continues by switching to the internal quicksort
mentioned earlier. This is illustrated in Figure 12.33.

**Figure 12.33:** An Illustration of the Parallel Quicksort

So far, we have concentrated on standard quicksort. For quicksort to work well, even on sequential machines, it is essential that the splitting points land somewhere near the median of the list. If this isn't true, quicksort behaves poorly, the usual example being the quadratic time that standard quicksort takes on almost-sorted lists. To counteract this, it is a good idea to choose the splitting keys with some care so as to make evenhanded splits of the list.

**Figure:** Efficiency Data for the Parallel Quicksort described in the
Text. The curves are labelled as in Figure 12.30 and
plotted against the logarithm of the number of items to be sorted.

This becomes much more important on the concurrent computer. In this case,
if the splits are done haphazardly, not only will an excessive number of
operations be necessary, but large load imbalances will also occur.
Therefore, in the concurrent algorithm, the splitting keys are chosen with
some care. One reasonable way to do this is to randomly sample a subset of
the entire list (giving an estimate of the true distribution of the list) and
then pick splitting keys based upon this sample. To save time, all
splitting keys are found at once. This modified algorithm should perhaps be
called *samplesort* and consists of the following steps:

- each processor picks sample of
**l**items at random; - sort the sample of items using the parallel shellsort;
- choose splitting keys as if this was the entire list;
- broadcast splitting keys so that all processors know all splitting keys;
- perform the splits in the
**d**directions of the hypercube; - each processor quicksorts its sublist.

Times and efficiencies for the parallel quicksort algorithm are shown in
Table 12.4. The efficiencies are also plotted in
Figure 12.34. In some cases, the parallel quicksort outperforms
the already high performance of the parallel shellsort discussed earlier.
There are two main sources of inefficiency in this algorithm. The first is a
result of the time wasted sorting the sample. The second is due to remaining
load imbalance in the splitting phases. By varying the sample size **l**, we
achieve a trade-off between these two sources of inefficiency. Chapter 18 of
[Fox:88a] contains more details regarding the choice of **l** and other
ways to compute splitting points.

Before closing, it may be noted that there exists another way of thinking about the parallel quicksort/samplesort algorithm. It can be regarded as a bucketsort, in which each processor of the hypercube comprises one bucket. In the splitting phase, one attempts to determine reasonable limits for the buckets so that approximately equal numbers of items will end up in each bucket. The splitting process can be thought of as an optimal routing scheme on the hypercube which brings each item to its correct bucket. So, our version of quicksort is also a bucketsort in which the bucket limits are chosen dynamically to match the properties of the particular input list.

The sorting work began as a collaboration between Steve Otto and summer students Ed Felten and Scott Karlin. Ed Felten invented the parallel Shellsort; Felten and Otto developed the parallel Quicksort.

Wed Mar 1 10:19:35 EST 1995