Clusters

Next: Processors Up: The Main Architectural Classes Previous: ccNUMA machines

Clusters

The adoption of clusters, collections of workstations/PCs connected by a local network, has virtually exploded since the introduction of the first Beowulf cluster in 1994. The attraction lies in the (potentially) low cost of both hardware and software and the control that builders/users have over their system. The interest for clusters can be seen for instance from the active IEEE Task Force on Cluster Computing (TFCC) which reviews the current status of cluster computing on a regular basis [43]. Also books how to build and maintain clusters have greatly added to their popularity (see, e.g.,[41] and [33]. As the cluster scene becomes relatively mature and an attractive market, large HPC vendors as well as many start-up companies have entered the field and offer more or less ready out-of-the-box cluster solutions for those groups that do not want to build their cluster from scratch.

The number of vendors that sell cluster configurations has become so large that it is not sensible to include all these products in this report. In addition, there is generally a large difference in the usage of clusters and their more integrated counterparts that we discuss in the following sections: clusters are mostly used for capability computing while the integrated machines primarily are used for capacity computing. The first mode of usage meaning that the system is employed for one or a few programs for which no alternative is readily available in terms of computational capabilities. The second way of operating a system is in employing it to the full by using the most of its available cycles by many, often very demanding, applications and users. Traditionally, vendors of large supercomputer systems have learned to provide for this last mode of operation as the precious resources of their systems were required to be used as effectively as possible. By contrast, Beowulf clusters are mostly operated through the Linux operating system (a small minority using Microsoft Windows) where these operating systems either miss the tools or these tools are relatively immature to use a cluster well for capacity computing. However, as clusters become on average both larger and more stable, there is a trend to use them also as computational capacity servers. In [39] is looked at some of the aspects that are necessary conditions for this kind of use like available cluster management tools and batch systems. In the same study also the performance on an application workload was assessed, both on a RISC (Compaq Alpha) based configuration and on Intel Pentium III based systems. An important, but not very surprising conclusion was that the speed of the network is very important in all but the most compute bound applications. Another notable observation was that using compute nodes with more than 1 CPU may be attractive from the point of view of compactness and (possibly) energy and cooling aspects, but that the performance can be severely damaged by the fact that more CPUs have to draw on a common node memory. The bandwidth of the nodes is in this case not up to the demands of memory intensive applications.

Fortunately, there is nowadays a fair choice of communication networks available in clusters. Of course 100 Mb/s Ethernet or Gigabit Ethernet is always possible, which is attractive for economic reasons, but has the drawback of a high latency (≅ 100 µs). Alternatively, there are for instance networks that operate from user space, like Myrinet [24,25], Infiniband, [32] and SCI [20]. The first two have maximum bandwidths in the order of 200 MB/s nd 850 MB/s, respecitvely, and a latency in the range of 7--9 µs. SCI has a bandwidth of 400--500 MB/s theoretically and a latency under 3 µs. The latter solution is more costly but is nevertheless employed in some cluster configurations. The network speeds as shown by Myrinet and, certainly, QsNET and SCI is more or less on par with some integrated parallel systems as discussed later. So, possibly apart from the speed of the processors and of the software that is provided by the vendors of DM-MIMD supercomputers, the distinction between clusters and this class of machines becomes rather small and will undoubtly decrease in the coming years.

The best starting point for the state-of-the-art in cluster computing is given in the TFCC White Paper [43] already mentioned. It gives an pointers to available products, both hardware and software, open questions and the focus of the present research regarding these questions.

Next: Processors Up: The Main Architectural Classes Previous: ccNUMA machines

Aad van der Steen
Thu Oct 7 15:46:16 CEST 2004