Next: 5.2 A ``Packet'' History Up: Express and CrOS Previous: Express and CrOS

5.1 Multicomputer Operating Systems

As already noted in Chapter 2, the initial software used by CP was called CrOS, although its modest functionality hardly justified CrOS being called an operating system. Actually, this is an interesting issue. In our original model, the ``real'' operating system (UNIX in our case) ran on the ``host'' that directly or indirectly (via a network) connected to the hypercube. The nodes of the parallel machine need only provide the minimal services necessary to support user programs. This is the natural mode for all SIMD systems and is still offered by several important MIMD multicomputers. However, systems such as the IBM SP-1, Intel's Paragon series, and Meiko's CS-1 and 2 offer a full UNIX (or equivalent, such as MACH) on each node. This has many advantages, including the ability of the system to be arbitrarily configured-in particular we can consider a multicomputer with N nodes as ``just'' N ``real'' computers connected by a high-performance network. This would lead to particularly good performance on remote disk I/O, such as that needed for the Network File System (NFS). The design of an operating system for the node is partly based on the programming usage paradigm, and partly on the hardware. The original multicomputers all had small node memories ( on the Cosmic Cube) and could not possibly hold UNIX on a node. Current multicomputers such as the CM-5, Paragon, and Meiko CS-2 would consider a normal minimum node memory. This is easily sufficient to hold a full UNIX implementation with the extra functionality needed to support parallel programming. There are some, such as IBM Owego (Execube), Seitz at Caltech (MOSAIC) [Seitz:90a;92a], and Dally at MIT (J Machine) [Dally:90a;92a], who are developing very interesting families of highly parallel ``small node'' multicomputers for which a full UNIX on each node may be inappropriate.

Essentially, all the applications described in this book are not sensitive to these issues, which would only affect the convenience of program development and operating environment. CP's applications were all developed using a simple message-passing system involving C (and less often Fortran) node programs that sent messages to each other via subroutine call. The key function of CrOS and Express, described in Section 5.2, was to provide this subroutine library.

There are some important capabilities that a parallel computing environment needs in addition to message passing and UNIX services. These include:

scheduling of multiple users-at its simplest, this is provided by space sharing with distinct sets of nodes assigned to individual users. More sophisticated time sharing is also becoming available.
performance visualization or profiling-the CP tool is described in Section 5.4.
load balancing-this is still not well understood at the operating system level, although at the data (user) level the situation is much clearer. CP research is summarized in Chapter 11 and Section 15.2.
parallel input/output-this topic needs a more elaborate discussion given below.

We did not perform any major computations in CP that required high-speed input/output capabilities. This reflects both our applications mix and the poor I/O performance of the early hypercubes. The applications described in Chapter 18 needed significant but not high bandwidth input/output during computation, as did our analysis of radio astronomy data. However, the other applications used input/output for checkpointing, interchange of parameters between user and program, and in greatest volume, checkpoint and restart. This input/output was typically performed between the host and (node 0 of) the parallel ensemble. Section 5.2.7 and in greater detail [Fox:88a] describe the Cubix system, which we developed to make this input/output more convenient. This system was overwhelmingly preferred by the CP community as compared to the conventional host-node programming style. Curiously, Cubix seems to have made no impact on the ``real world.'' We are not aware of any other group that has adopted it.

Next: 5.2 A ``Packet'' History Up: Express and CrOS Previous: Express and CrOS

Guy Robinson
Wed Mar 1 10:19:35 EST 1995