The world of modern computing potentially offers many helpful methods and tools to scientists and engineers, but the fast pace of change in computer hardware, software, and algorithms often makes practical use of the newest computing technology difficult. The Scientific and Engineering Computation series focuses on rapid advances in computing technologies and attempts to facilitate transferring these technologies to applications in science and engineering. It will include books on theories, methods, and original applications in such areas as parallelism, large-scale simulations, time-critical computing, computer-aided design and engineering, use of computers in manufacturing, visualization of scientific data, and human-machine interface technology.
The series will help scientists and engineers to understand the current world of advanced computation and to anticipate future developments that will impact their computing environments and open up new capabilities and modes of computation.
This volume presents a software package for developing parallel programs executable on networked Unix computers. The tool called Parallel Virtual Machine (PVM) allows a heterogeneous collection of workstations and supercomputers to function as a single high-performance parallel machine. PVM is portable and runs on a wide variety of modern platforms. It has been well accepted by the global computing community and used successfully for solving large-scale problems in science, industry, and business.
Janusz S. Kowalik
Preface
In this book we describe the Parallel Virtual Machine (PVM) system and how to develop programs using PVM. PVM is a software system that permits a heterogeneous collection of Unix computers networked together to be viewed by a user's program as a single parallel computer. PVM is the mainstay of the Heterogeneous Network Computing research project, a collaborative venture between Oak Ridge National Laboratory, the University of Tennessee, Emory University, and Carnegie Mellon University.
The PVM system has evolved in the past several years into a viable technology for distributed and parallel processing in a variety of disciplines. PVM supports a straightforward but functionally complete message-passing model.
PVM is designed to link computing resources and provide users with a parallel platform for running their computer applications, irrespective of the number of different computers they use and where the computers are located. When PVM is correctly installed, it is capable of harnessing the combined resources of typically heterogeneous networked computing platforms to deliver high levels of performance and functionality.
In this book, we describe the architecture of the PVM system and discuss its computing model; the programming interface it supports; auxiliary facilities for process groups; the use of PVM on highly parallel systems such as the Intel Paragon, Cray T3D, and Thinking Machines CM-5; and some of the internal implementation techniques employed. Performance issues, dealing primarily with communication overheads, are analyzed, and recent findings as well as enhancements are presented. To demonstrate the viability of PVM for large-scale scientific supercomputing, we also provide some example programs.
This book is not a textbook; rather, it is meant to provide a fast entrance to the world of heterogeneous network computing. We intend this book to be used by two groups of readers: students and researchers working with networks of computers. As such, we hope this book can serve both as a reference and as a supplement to a teaching text on aspects of network computing.
This guide will familiarize readers with the basics of PVM and the concepts used in programming on a network. The information provided here will help with the following PVM tasks:
Stand-alone workstations delivering several tens of millions of operations per second are commonplace, and continuing increases in power are predicted. When these computer systems are interconnected by an appropriate high-speed network, their combined computational power can be applied to solve a variety of computationally intensive applications. Indeed, network computing may even provide supercomputer-level computational power. Further, under the right circumstances, the network-based approach can be effective in coupling several similar multiprocessors, resulting in a configuration that might be economically and technically difficult to achieve with supercomputer hardware.
To be effective, distributed computing requires high communication speeds.
In the past fifteen years or so, network speeds have increased by several orders
of magnitude (see Figure
).
Among the most notable advances in computer networking technology are the following:
ATM - Asynchronous Transfer Mode. ATM is the technique for transport, multiplexing, and switching that provides a high degree of flexibility required by B-ISDN. ATM is a connection-oriented protocol employing fixed-size packets with a 5-byte header and 48 bytes of information.
These advances in high-speed networking promise high throughput with low latency and make it possible to utilize distributed computing for years to come. Consequently, increasing numbers of universities, government and industrial laboratories, and financial firms are turning to distributed computing to solve their computational problems. The objective of PVM is to enable these institutions to use distributed computing efficiently.
Four functions
handle all packet traffic into and out of libpvm.
mroute()
is called by higher-level functions
such as pvm_send() and pvm_recv()
to copy messages into and out of the task.
It establishes any necessary routes before calling mxfer().
mxfer()
polls for messages,
optionally blocking until one is received
or until a specified timeout.
It calls mxinput() to copy
fragments into the task and reassemble messages.
In the generic version of PVM,
mxfer()
uses select() to poll all routes (sockets) in order to find
those ready for input or output.
pvmmctl()
is called by mxinput()
when a control message (Section
)
is received.
Direct routing allows one task to send messages to another through a TCP link, avoiding the overhead of forwarding through the pvmds. It is implemented entirely in libpvm, using the notify and control message facilities. By default, a task routes messages to its pvmd, which forwards them on. If direct routing is enabled (PvmRouteDirect) when a message (addressed to a task) is passed to mroute(), it attempts to create a direct route if one doesn't already exist. The route may be granted or refused by the destination task, or fail (if the task doesn't exist). The message is then passed to mxfer().
Libpvm maintains a protocol control block (struct ttpcb)
for each active or denied connection,
in list ttlist.
The state diagram for a ttpcb is shown in
Figure
.
To request a connection,
mroute()
makes a ttpcb and socket,
then
sends a
TC_CONREQ
control message to the destination via the default route.
At the same time,
it sends a TM_NOTIFY message to the pvmd,
to be notified if the destination task exits,
with closure (message tag)
TC_TASKEXIT.
Then it
puts the ttpcb in
state TTCONWAIT,
and calls
mxfer() in blocking mode repeatedly
until the state changes.
When the destination task enters mxfer() (for example, to receive a message), it receives the TC_CONREQ message. The request is granted if its routing policy (pvmrouteopt != PvmDontRoute) and implementation allow a direct connection, it has resources available, and the protocol version (TDPROTOCOL) in the request matches its own. It makes a ttpcb with state TTGRNWAIT, creates and listens on a socket, and then replies with a TC_CONACK message. If the destination denies the connection, it nacks, also with a TC_CONACK message. The originator receives the TC_CONACK message, and either opens the connection (state = TTOPEN) or marks the route denied (state = TTDENY). Then, mroute() passes the original message to mxfer(), which sends it. Denied connections are cached in order to prevent repeated negotiation.
If the destination doesn't exist, the TC_CONACK message never arrives because the TC_CONREQ message is silently dropped. However, the TC_TASKEXIT message generated by the notify system arrives in its place, and the ttpcb state is set to TTDENY.
This connect scheme also works if both ends try to establish a connection at the same time. They both enter TTCONWAIT, and when they receive each other's TC_CONREQ messages, they go directly to the TTOPEN state.
Figure: Task-task connection state diagram
The libpvm function pvm_mcast() sends a message to multiple destinations simultaneously. The current implementation only routes multicast messages through the pvmds. It uses a 1:N fanout to ensure that failure of a host doesn't cause the loss of any messages (other than ones to that host). The packet routing layer of the pvmd cooperates with the libpvm to multicast a message.
To form a multicast address TID (GID)
,
the G bit is set
(refer to Figure
).
The L field is assigned by a counter that is incremented for
each multicast,
so
a new multicast address is used for each message,
then recycled.
To initiate a multicast, the task sends a TM_MCA message to its pvmd, containing a list of recipient TIDs. The pvmd creates a multicast descriptor (struct mca) and GID. It sorts the addresses, removes bogus ones, and duplicates and caches them in the mca. To each destination pvmd (ones with destination tasks), it sends a DM_MCA message with the GID and destinations on that host. The GID is sent back to the task in the TM_MCA reply message.
The task sends the multicast message to the pvmd, addressed to the GID. As each packet arrives, the routing layer copies it to each local task and foreign pvmd. When a multicast packet arrives at a destination pvmd, it is copied to each destination task. Packet order is preserved, so the multicast address and data packets arrive in order at each destination. As it forwards multicast packets, each pvmd eavesdrops on the header flags. When it sees a packet with EOM flag set, it flushes the mca.
Experience seems to indicate that inherited environment (Unix environ) is useful to an application. For example, environment variables can be used to distinguish a group of related tasks or to set debugging variables.
PVM makes increasing use of environment, and may eventually support it even on machines where the concept is not native. For now, it allows a task to export any part of environ to tasks spawned by it. Setting variable PVM_EXPORT to the names of other variables causes them to be exported through spawn. For example, setting
PVM_EXPORT=DISPLAY:SHELLexports the variables DISPLAY and SHELL to children tasks (and PVM_EXPORT too).
The following environment variables are used by PVM. The user may set these:
----------------------------------------------------------------------- PVM_ROOT Root installation directory PVM_EXPORT Names of environment variables to inherit through spawn PVM_DPATH Default slave pvmd install path PVM_DEBUGGER Path of debugger script used by spawn -----------------------------------------------------------------------
The following variables are set by PVM and should not be modified:
------------------------------------------------------------------- PVM_ARCH PVM architecture name PVMSOCK Address of the pvmd local socket; see Section 7.4.2 PVMEPID Expected PID of a spawned task PVMTMASK Libpvm Trace mask -------------------------------------------------------------------
Each task spawned through PVM has /dev/null opened for stdin. From its parent, it inherits a stdout sink, which is a (TID, code) pair. Output on stdout or stderr is read by the pvmd through a pipe, packed into PVM messages and sent to the TID, with message tag equal to the code. If the output TID is set to zero (the default for a task with no parent), the messages go to the master pvmd, where they are written on its error log.
Children spawned by a task inherit its stdout sink. Before the spawn, the parent can use pvm_setopt() to alter the output TID or code. This doesn't affect where the output of the parent task itself goes. A task may set output TID to one of three settings: the value inherited from its parent, its own TID, or zero. It can set output code only if output TID is set to its own TID. This means that output can't be assigned to an arbitrary task.
Four types of messages are sent to an stdout sink. The message body formats for each type are as follows:
------------------------------------------------------------
Spawn: (code) { Task has been spawned
int tid, Task id
int -1, Signals spawn
int ptid TID of parent
}
Begin: (code) { First output from task
int tid, Task id
int -2, Signals task creation
int ptid TID of parent
}
Output: (code) { Output from a task
int tid, Task id
int count, Length of output fragment
data[count] Output fragment
}
End: (code) { Last output from a task
int tid, Task id
int 0 Signals EOF
}
------------------------------------------------------------
The first two items in the message body are always the task id and output count, which allow the receiver to distinguish between different tasks and the four message types. For each task, one message each of types Spawn, Begin, and End is sent, along with zero or more messages of class Output, (count > 0). Classes Begin, Output and End will be received in order, as they originate from the same source (the pvmd of the target task). Class Spawn originates at the (possibly different) pvmd of the parent task, so it can be received in any order relative to the others. The output sink is expected to understand the different types of messages and use them to know when to stop listening for output from a task (EOF) or group of tasks (global EOF).
The messages are designed so as to prevent race conditions
when a task spawns another task,
then immediately exits.
The
output sink might
get the End
message from the parent task
and decide the group is finished,
only to receive more output later from the child task.
According to these rules, the Spawn
message for the second task
must
arrive before
the End message from the first task.
The Begin message itself is necessary because the Spawn
message for a task may arrive after the End message
for the same task.
The state transitions of a task as observed by the receiver of
the output messages
are shown in
Figure
.
Figure: Output states of a task
The libpvm function pvm_catchout() uses this output collection feature to put the output from children of a task into a file (for example, its own stdout). It sets output TID to its own task id, and the output code to control message TC_OUTPUT. Output from children and grandchildren tasks is collected by the pvmds and sent to the task, where it is received by pvmmctl() and printed by pvmclaimo().
The libpvm library
has a tracing system
that can record the parameters and results of all calls to interface
functions.
Trace data is sent as messages to a trace sink task
just as output is sent to an stdout sink (Section
).
If the trace output TID is set to zero (the default),
tracing is disabled.
Besides the trace sink, tasks also inherit a trace mask, used to enable tracing function-by-function. The mask is passed as a (printable) string in environment variable PVMTMASK. A task can manipulate its own trace mask or the one to be inherited from it. A task's trace mask can also be set asynchronously with a TC_SETTMASK control message.
Constants related to trace messages are defined in public header file pvmtev.h. Trace data from a task is collected in a manner similar to the output redirection discussed above. Like the type Spawn, Begin, and End messages which bracket output from a task, TEV_SPNTASK, TEV_NEWTASK and TEV_ENDTASK trace messages are generated by the pvmds to bracket trace messages.
The tracing system was introduced in version 3.3 and is still expected to change somewhat.
PVM provides a simple but extensible debugging facility. Tasks started by hand could just as easily be run under a debugger, but this procedure is cumbersome for those spawned by an application, since it requires the user to comment out the calls to pvm_spawn() and start tasks manually. If PvmTaskDebug is added to the flags passed to pvm_spawn(), the task is started through a debugger script (a normal shell script), $PVM_ROOT/lib/debugger.
The pvmd passes the name and parameters of the task to the debugger script, which is free to start any sort of debugger. The script provided is very simple. In an xterm window, it runs the correct debugger according to the architecture type of the host. The script can be customized or replaced by the user. The pvmd can be made to execute a different debugger via the bx= host file option or the PVM_DEBUGGER environment variable.
The PVM console is used to manage the virtual machine-to reconfigure it or start and stop processes. In addition, it's an example program that makes use of most of the libpvm functions.
pvm_getfds() and select() are used to check for input from the keyboard and messages from the pvmd simultaneously. Keyboard input is passed to the command interpreter, while messages contain notification (for example, HostAdd) or output from a task.
The console can collect output or trace messages from spawned tasks,
using the redirection mechanisms described
in Section
and Section
,
and
write them to the screen or a file.
It uses the begin and end messages
from child tasks to maintain groups of tasks (or jobs),
related by common ancestors.
Using the PvmHostAdd notify event,
it informs the user when the virtual machine is reconfigured.
Resource limits imposed by the operating system and available hardware are in turn passed to PVM applications. Whenever possible, PVM avoids setting explicit limits; instead, it returns an error when resources are exhausted. Competition between users on the same host or network affects some limits dynamically.
The PVM software provides a unified framework within which parallel programs can be developed in an efficient and straightforward manner using existing hardware. PVM enables a collection of heterogeneous computer systems to be viewed as a single parallel virtual machine. PVM transparently handles all message routing, data conversion, and task scheduling across a network of incompatible computer architectures.
The PVM computing model is simple yet very general, and accommodates a wide variety of application program structures. The programming interface is deliberately straightforward, thus permitting simple program structures to be implemented in an intuitive manner. The user writes his application as a collection of cooperating tasks. Tasks access PVM resources through a library of standard interface routines. These routines allow the initiation and termination of tasks across the network as well as communication and synchronization between tasks. The PVM message-passing primitives are oriented towards heterogeneous operation, involving strongly typed constructs for buffering and transmission. Communication constructs include those for sending and receiving data structures as well as high-level primitives such as broadcast, barrier synchronization, and global sum.
PVM tasks may possess arbitrary control and dependency structures. In other words, at any point in the execution of a concurrent application, any task in existence may start or stop other tasks or add or delete computers from the virtual machine. Any process may communicate and/or synchronize with any other. Any specific control and dependency structure may be implemented under the PVM system by appropriate use of PVM constructs and host language control-flow statements.
Owing to its ubiquitous nature (specifically, the virtual machine concept) and also because of its simple but complete programming interface, the PVM system has gained widespread acceptance in the high-performance scientific computing community.
How many tasks each pvmd can manage is limited by two factors: the number of processes allowed a user by the operating system, and the number of file descriptors available to the pvmd. The limit on processes is generally not an issue, since it doesn't make sense to have a huge number of tasks running on a uniprocessor machine.
Each task consumes one file descriptor in the pvmd, for the pvmd-task TCP stream. Each spawned task (not ones connected anonymously) consumes an extra descriptor, since its output is read through a pipe by the pvmd (closing stdout and stderr in the task would reclaim this slot). A few more file descriptors are always in use by the pvmd for the local and network sockets and error log file. For example, with a limit of 64 open files, a user should be able to have up to 30 tasks running per host.
The pvmd may become a bottleneck if all these tasks try to talk to one another through it.
The pvmd uses dynamically allocated memory to store message packets en route between tasks. Until the receiving task accepts the packets, they accumulate in the pvmd in an FIFO procedure. No flow control is imposed by the pvmd: it will happily store all the packets given to it, until it can't get any more memory. If an application is designed so that tasks can keep sending even when the receiving end is off doing something else and not receiving, the system will eventually run out of memory .
As with the pvmd, a task may have a limit on the number of others it can connect to directly. Each direct route to a task has a separate TCP connection (which is bidirectional), and so consumes a file descriptor. Thus, with a limit of 64 open files, a task can establish direct routes to about 60 other tasks. Note that this limit is in effect only when using task-task direct routing. Messages routed via the pvmds use only the default pvmd-task connection.
The maximum size of a PVM message is limited by the amount of memory available to the task. Because messages are generally packed using data existing elsewhere in memory, and they must be reside in memory between being packed and sent, the largest possible message a task can send should be somewhat less than half the available memory. Note that as a message is sent, memory for packet buffers is allocated by the pvmd, aggravating the situation. In-place message encoding alleviates this problem somewhat, because the data is not copied into message buffers in the sender. However, on the receiving end, the entire message is downloaded into the task before the receive call accepts it, possibly leaving no room to unpack it.
In a similar vein, if many tasks send to a single destination all at once, the destination task or pvmd may be overloaded as it tries to store the messages. Keeping messages from being freed when new ones are received by using pvm_setrbuf() also uses up memory.
These problems can sometimes be avoided by rearranging the application code, for example, to use smaller messages, eliminate bottlenecks, and process messages in the order in which they are generated.
Developed initially as a parallel programming environment for Unix workstations, PVM has gained wide acceptance and become a de facto standard for message-passing programming. Users want the same programming environment on multiprocessor computers so they can move their applications onto these systems. A common interface would also allow users to write vendor-independent programs for parallel computers and to do part or most of the development work on workstations, freeing up the multiprocessor supercomputers for production runs.
With PVM, multiprocessor systems can be included in the same configuration with workstations. For example, a PVM task running on a graphics workstation can display the results of computations carried out on a massively parallel processing supercomputer. Shared-memory computers with a small number of processors can be linked to deliver supercomputer performance.
The virtual machine hides the configuration details from the programmer. The physical processors can be a network of workstations, or they can be the nodes of a multicomputer. The programmer doesn't have to know how the tasks are created or where they are running; it is the responsibility of PVM to schedule user's tasks onto individual processors. The user can, however, tune the program for a specific configuration to achieve maximum performance, at the expense of its portability.
Multiprocessor systems can be divided into two main categories: message passing and shared memory. In the first category, PVM is now supported on Intel's iPSC/860 and Paragon , as well as Thinking Machine's CM-5 . Porting PVM to these platforms is straightforward, because the message-passing functions in PVM map quite naturally onto the native system calls. The difficult part is the loading and management of tasks. In the second category, message passing can be done by placing the message buffers in shared memory. Access to these buffers must be synchronized with mutual exclusion locks. PVM 3.3 shared memory ports include SGI multiprocessor machines running IRIX 5.x and Sun Microsystems, Inc., multiprocessor machines running Solaris 2.3 (This port also runs on the Cray Research, Inc., CS6400 ). In addition, CRAY and DEC have created PVM ports for their T3D and DEC 2100 shared memory multiprocessors, respectively.
Figure:
PVM daemon and tasks on MPP host
A typical MPP system has one or more service nodes for user logins and a large number of compute nodes for number crunching. The PVM daemon runs on one of the service nodes and serves as the gateway to the outside world. A task can be started on any one of the service nodes as a Unix process and enrolls in PVM by establishing a TCP socket connection to the daemon. The only way to start PVM tasks on the compute nodes is via pvm_spawn(). When the daemon receives a request to spawn new tasks, it will allocate a set of nodes if necessary, and load the executable onto the specified number of nodes.
The way PVM allocates nodes is system dependent. On the CM-5, the entire partition is allocated to the user. On the iPSC/860, PVM will get a subcube big enough to accommodate all the tasks to be spawned. Tasks created with two separate calls to pvm_spawn() will reside in different subcubes, although they can exchange messages directly by using the physical node address. The NX operating system limits the number of active subcubes system-wide to 10. Pvm_spawn will fail when this limit is reached or when there are not enough nodes available. In the case of the Paragon, PVM uses the default partition unless a different one is specified when pvmd is invoked. Pvmd and the spawned tasks form one giant parallel application. The user can set the appropriate NX environment variables such as NX_DFLT_SIZE before starting PVM, or he can specify the equivalent command-line arguments to pvmd (i.e., pvmd -sz 32).
Figure:
Packing: breaking data into fixed-size fragments
PVM message-passing functions are implemented in terms of
the native send and receive system calls.
The ``address" of a task is encoded in the task id, as illustrated
in Figure
.
Figure: How TID is used to distinguish tasks on MPP
This enables the messages to be sent directly to the target task, without any help from the daemon. The node number is normally the logical node number, but the physical address is used on the iPSC/860 to allow for direct intercube communication. The instance number is used to distinguish tasks running on the same node.
Figure:
Buffering: buffering one fragment by receiving
task until pvm_recv() is called
PVM normally uses asynchronous send primitives to send
messages.
The operating system can run out of
message handles very quickly if a lot of small messages or several
large messages are sent at once.
PVM will be forced to switch to synchronous send when there are no more
message handles left or when the system buffer gets filled up.
To improve performance, a task
should call pvm_send() as soon as the data becomes available,
so (one hopes) when the other task calls pvm_recv(), the message will
already be in its buffer. PVM buffers one incoming packet between
calls to pvm_send()/pvm_recv(). A large message,
however, is broken up into
many fixed-size fragments during packing, and each piece is sent
separately.
Buffering one of these fragments
is not sufficient unless pvm_send() and pvm_recv() are synchronized.
Figures
and
illustrate this process.
The front end of an MPP system is treated as a regular workstation. Programs to be run there should be linked with the regular PVM library, which relies on Unix sockets to transmit messages. Normally one should avoid running processes on the front end, because communication between those processes and the node processes must go through the PVM daemon and a TCP socket link. Most of the computation and communication should take place on the compute nodes in order to take advantage of the processing power of these nodes and the fast interconnects between them.
Since the PVM library for the front end is different from the one for the nodes, the executable for the front end must be different from the one compiled for the nodes. An SPMD program, for example, has only one source file, but the object code must be linked with the front end and node PVM libraries separately to produce two executables if it is to be started from the front end. An alternative would be a ``hostless" SPMD program , which could be spawned from the PVM console.
Table
shows the native system calls used by the corresponding
PVM functions on various platforms.
Table: Implementation of PVM system calls
The CM-5 is somewhat different from the Intel systems because it requires a special host process for each group of tasks spawned. This process enrolls in PVM and relays messages between pvmd and the node programs. This, needless to say, adds even more overhead to daemon-task communications.
Another restrictive feature of the CM-5 is that all nodes in the same partition are scheduled as a single unit. The partitions are normally configured by the system manager and each partition must contain at least 16 processors. User programs are run on the entire partition by default. Although it is possible to idle some of the processors in a partition, as PVM does when fewer nodes are called for, there is no easy way to harness the power of the idle processors. Thus, if PVM spawns two groups of tasks, they will time-share the partition, and any intergroup traffic must go through pvmd.
Additionally, CMMD has no support for multicasting. Thus, pvm_mcast() is implemented with a loop of CMMD_async_send().
The shared-memory architecture provides a very efficient medium for processes to exchange data. In our implementation, each task owns a shared buffer created with the shmget() system call. The task id is used as the ``key" to the shared segment. If the key is being used by another user, PVM will assign a different id to the task. A task communicates with other tasks by mapping their message buffers into its own memory space.
To enroll in PVM, the task first writes its Unix process id into
pvmd's incoming box. It then looks for the assigned task id in
pvmd's pid
TID table.
The message buffer is divided into pages, each of which holds one fragment
(Figure
).
PVM's page size can be a multiple of the system page size.
Each page has a header, which contains the lock and
the reference count.
The first few pages are used as the incoming box, while the rest of the pages
hold outgoing fragments (Figure
). To send a message,
the task first packs the
message body into its buffer, then delivers the message header (which
contains the sender's TID and the location of the data) to the incoming
box of the intended recipient. When pvm_recv() is called, PVM checks
the incoming box, locates and unpacks the messages (if any), and
decreases the reference count so the space can be reused. If a task
is not able to deliver the header directly because the receiving box
is full, it will block until the other task is ready.
Figure:
Structure of a PVM page
Figure:
Structures of shared message buffers
Inevitably some overhead will be incurred when a message is packed into and unpacked from the buffer, as is the case with all other PVM implementations. If the buffer is full, then the data must first be copied into a temporary buffer in the process's private space and later transferred to the shared buffer.
Memory contention is usually not a problem. Each process has its own buffer, and each page of the buffer has its own lock. Only the page being written to is locked, and no process should be trying to read from this page because the header has not been sent out. Different processes can read from the same page without interfering with each other, so multicasting will be efficient (they do have to decrease the counter afterwards, resulting in some contention). The only time contention occurs is when two or more processes trying to deliver the message header to the same process at the same time. But since the header is very short (16 bytes), such contention should not cause any significant delay.
To minimize the possibility of page faults, PVM attempts to use only a small number of pages in the message buffer and recycle them as soon as they have been read by all intended recipients.
Once a task's buffer has been mapped, it will not be unmapped unless the system limits the number of mapped segments. This strategy saves time for any subsequent message exchanges with the same process.
In the original implementation, all user messages are buffered by PVM. The user must pack the data into a PVM buffer before sending it, and unpack the data after it has been received into an internal buffer. This approach works well on systems with relatively high communication latency, such as the Ethernet. On MPP systems the packing and unpacking introduce substantial overhead. To solve this problem we added two new PVM functions, namely pvm_psend() and pvm_precv(). These functions combine packing/unpacking and sending/receiving into one single step. They could be mapped directly into the native message passing primitives available on the system, doing away with internal buffers altogether. On the Paragon these new functions give almost the same performance as the native ones.
Although the user can use both pvm_psend() and pvm_send() in the same program, on MPP the pvm_psend() must be matched with pvm_precv(), and pvm_send() with pvm_recv().
Several research groups have developed software packages that like PVM assist programmers in using distributed computing. Among the most well known efforts are P4 [1], Express [], MPI [], and Linda []. Various other systems with similar capabilities are also in existence; a reasonably comprehensive listing may be found in [13].
It is often useful and always reassuring to be able to see the present configuration of the virtual machine and the status of the hosts. It would be even more useful if the user could also see what his program is doing-what tasks are running, where messages are being sent, etc. The PVM GUI called XPVM was developed to display this information, and more.
XPVM combines the capabilities of the PVM console, a performance monitor, and a call-level debugger into a single, easy-to-use X-Windows interface. XPVM is available from netlib in the directory pvm3/xpvm. It is distributed as precompiled, ready-to-run executables for SUN4, RS6K, ALPHA, SUN4SOL2, HPPA, and SGI5. The XPVM source is also available for compiling on other machines.
XPVM is written entirely in C using the TCL/TK [8] toolkit and runs just like another PVM task. If a user wishes to build XPVM from the source, he must first obtain and install the TCL/TK software on his system. TCL and TK were developed by John Ousterhout at Berkeley and can be obtained by anonymous ftp to sprite.berkeley.edu The TCL and XPVM source distributions each contain a README file that describes the most up-to-date installation procedure for each package respectively.
Figure
shows a snapshot of XPVM in use.
Figure: XPVM interface - snapshot during use
- figure not available -
Like the PVM console, XPVM will start PVM if PVM is not already running, or will attach to the local pvmd if it is. The console can take an optional hostfile argument whereas XPVM always reads $HOME/.xpvm_hosts as its hostfile. If this file does not exist, then XPVM just starts PVM on the local host (or attaches to the existing PVM). In typical use, the hostfile .xpvm_hosts contains a list of hosts prepended with an &. These hostnames then get added to the Hosts menu for addition and deletion from the virtual machine by clicking on them.
The top row of buttons perform console-like functions. The Hosts button displays a menu of hosts. Clicking on a host toggles whether it is added or deleted from the virtual machine. At the bottom of the menu is an option for adding a host not listed. The Tasks button brings up a menu whose most-used selection is spawn. Selecting spawn brings up a window where one can set the executable name, spawn flags, start position, number of copies to start, etc. By default, XPVM turns on tracing in all tasks (and their children) started inside XPVM. Clicking on Start in the spawn window starts the task, which will then appear in the space-time view. The Reset button has a menu for resetting PVM (i.e., kill all PVM tasks) or resetting different parts of XPVM. The Quit button exits XPVM while leaving PVM running. If XPVM is being used to collect trace information, the information will not be collected if XPVM is stopped. The Halt button is used when one is through with PVM. Clicking on this button kills all running PVM tasks, shuts down PVM cleanly, and exits the XPVM interface. The Help button brings up a menu of topics the user can get help about.
During startup, XPVM joins a group called xpvm. The intention is that tasks started outside the XPVM interface can get the TID of XPVM by doing tid = pvm_gettid( xpvm, 0 ). This TID would be needed if the user wanted to manually turn on tracing inside such a task and pass the events back to XPVM for display. The expected TraceCode for these events is 666.
While an application is running, XPVM collects and displays the information in real time. Although XPVM updates the views as fast as it can, there are cases when XPVM cannot keep up with the events and it falls behind the actual run time.
In the middle of the XPVM interface are tracefile controls. It is here that the user can specify a tracefile-a default tracefile in /tmp is initially displayed. There are buttons to specify whether the specified tracefile is to be played back or overwritten by a new run. XPVM saves trace events in a file using the ``self defining data format'' (SDDF) described in Dan Reed's Pablo [11] trace playing package. The analysis of PVM traces can be carried out on any of a number of systems such as Pablo.
XPVM can play back its own SDDF files. The tape-player-like buttons allow the user to rewind the tracefile, stop the display at any point, and step through the execution. A time display specifies the number of seconds from when the trace display began.
The Views button allows the user to open or close any of several views presently supplied with XPVM. These views are described below.
The Network view displays the present virtual machine configuration and the activity of the hosts. Each host is represented by an icon that includes the PVM_ARCH and host name inside the icon. In the initial release of XPVM, the icons are arranged arbitrarily on both sides of a bus network. In future releases the view will be extended to visualize network activity as well. At that time the user will be able to specify the network topology to display.
These icons are illuminated in different colors to indicate their status in executing PVM tasks. Green implies that at least one task on that host is busy executing useful work. Yellow indicates that no tasks are executing user computation, but at least one task is busy executing PVM system routines. When there are no tasks on a given host, its icon is left uncolored or white. The specific colors used in each case are user customizable.
The user can tell at a glance how well the virtual machine is being utilized by his PVM application. If all the hosts are green most of the time, then machine utilization is good. The Network view does not display activity from other users' PVM jobs or other processes that may be running on the hosts.
In future releases the view will allow the user to click on a multiprocessor icon and get information about the number of processors, number of PVM tasks, etc., that are running on the host.
The Space-Time view displays the activities of individual PVM tasks that are running on the virtual machine. Listed on the left-hand side of the view are the executable names of the tasks, preceded by the host they are running on. The task list is sorted by host so that it is easy to see whether tasks are being clumped on one host. This list also shows the task-to-host mappings (which are not available in the Network view).
The Space-Time view combines three different displays. The first is like a Gantt chart . Beside each listed task is a horizontal bar stretching out in the ``time'' direction. The color of this bar at any time indicates the state of the task. Green indicates that user computations are being executed. Yellow marks the times when the task is executing PVM routines. White indicates when a task is waiting for messages. The bar begins at the time when the task starts executing and ends when the task exits normally. The specific colors used in each case are user customizable.
The second display overlays the first display with the communication activity among tasks. When a message is sent between two tasks, a red line is drawn starting at the sending task's bar at the time the message is sent and ending at the receiving task's bar when the message is received. Note that this is not necessarily the time the message arrived, but rather the time the task returns from pvm_recv(). Visually, the patterns and slopes of the red lines combined with white ``waiting'' regions reveal a lot about the communication efficiency of an application.
The third display appears only when a user clicks on interesting features of the Spac