Contents



next up previous index
Next: A Bit of Up: PVM: Parallel Virtual Machine Previous: PVM: Parallel Virtual Machine

Contents

Series Foreword

The world of modern computing potentially offers many helpful methods and tools to scientists and engineers, but the fast pace of change in computer hardware, software, and algorithms often makes practical use of the newest computing technology difficult. The Scientific and Engineering Computation series focuses on rapid advances in computing technologies and attempts to facilitate transferring these technologies to applications in science and engineering. It will include books on theories, methods, and original applications in such areas as parallelism, large-scale simulations, time-critical computing, computer-aided design and engineering, use of computers in manufacturing, visualization of scientific data, and human-machine interface technology.

The series will help scientists and engineers to understand the current world of advanced computation and to anticipate future developments that will impact their computing environments and open up new capabilities and modes of computation.

This volume presents a software package for developing parallel programs executable on networked Unix computers. The tool called Parallel Virtual Machine (PVM) allows a heterogeneous collection of workstations and supercomputers to function as a single high-performance parallel machine. PVM is portable and runs on a wide variety of modern platforms. It has been well accepted by the global computing community and used successfully for solving large-scale problems in science, industry, and business.

Janusz S. Kowalik

Preface

In this book we describe the Parallel Virtual Machine (PVM) system and how to develop programs using PVM. PVM is a software system that permits a heterogeneous collection of Unix computers networked together to be viewed by a user's program as a single parallel computer. PVM is the mainstay of the Heterogeneous Network Computing research project, a collaborative venture between Oak Ridge National Laboratory, the University of Tennessee, Emory University, and Carnegie Mellon University.

The PVM system has evolved in the past several years into a viable technology for distributed and parallel processing in a variety of disciplines. PVM supports a straightforward but functionally complete message-passing model.

PVM is designed to link computing resources and provide users with a parallel platform for running their computer applications, irrespective of the number of different computers they use and where the computers are located. When PVM is correctly installed, it is capable of harnessing the combined resources of typically heterogeneous networked computing platforms to deliver high levels of performance and functionality.

In this book, we describe the architecture of the PVM system and discuss its computing model; the programming interface it supports; auxiliary facilities for process groups; the use of PVM on highly parallel systems such as the Intel Paragon, Cray T3D, and Thinking Machines CM-5; and some of the internal implementation techniques employed. Performance issues, dealing primarily with communication overheads, are analyzed, and recent findings as well as enhancements are presented. To demonstrate the viability of PVM for large-scale scientific supercomputing, we also provide some example programs.

This book is not a textbook; rather, it is meant to provide a fast entrance to the world of heterogeneous network computing. We intend this book to be used by two groups of readers: students and researchers working with networks of computers. As such, we hope this book can serve both as a reference and as a supplement to a teaching text on aspects of network computing.

This guide will familiarize readers with the basics of PVM and the concepts used in programming on a network. The information provided here will help with the following PVM tasks:





next up previous index
Next: A Bit of Up: PVM: Parallel Virtual Machine Previous: PVM: Parallel Virtual Machine




Footnotes

...
Currently, the pvmd generates an acknowledgement packet for each data packet.

...
This was once implemented, but was removed while the code was updated and hasn't been reintroduced.

Jack Dongarra
Thu Sep 15 21:00:17 EDT 1994

Trends in Distributed Computing



next up previous contents index
Next: PVM Overview Up: Introduction Previous: Heterogeneous Network Computing

Trends in Distributed Computing

Stand-alone workstations delivering several tens of millions of operations per second are commonplace, and continuing increases in power are predicted. When these computer systems are interconnected by an appropriate high-speed network, their combined computational power can be applied to solve a variety of computationally intensive applications. Indeed, network computing may even provide supercomputer-level computational power. Further, under the right circumstances, the network-based approach can be effective in coupling several similar multiprocessors, resulting in a configuration that might be economically and technically difficult to achieve with supercomputer hardware.

To be effective, distributed computing requires high communication speeds. In the past fifteen years or so, network speeds have increased by several orders of magnitude (see Figure gif ).

   
Figure: Networking speeds

Among the most notable advances in computer networking technology are the following:

These advances in high-speed networking promise high throughput with low latency and make it possible to utilize distributed computing for years to come. Consequently, increasing numbers of universities, government and industrial laboratories, and financial firms are turning to distributed computing to solve their computational problems. The objective of PVM is to enable these institutions to use distributed computing efficiently.



next up previous contents index
Next: PVM Overview Up: Introduction Previous: Heterogeneous Network Computing




Libpvm<A NAME=1386>  </A>



next up previous contents index
Next: Direct Message Routing Up: Message Routing Previous: Pvmd and Foreign

Libpvm  

Four functions handle all packet traffic into and out of libpvm. mroute() is called by higher-level functions such as pvm_send() and pvm_recv() to copy messages into and out of the task. It establishes any necessary routes before calling mxfer(). mxfer() polls for messages, optionally blocking until one is received or until a specified timeout. It calls mxinput() to copy fragments into the task and reassemble messages. In the generic version of PVM, mxfer() uses select() to poll all routes (sockets) in order to find those ready for input or output. pvmmctl() is called by mxinput() when a control message (Section gif ) is received.






Direct Message Routing<A NAME=1398>  </A>



next up previous contents index
Next: Multicasting Up: Libpvm Previous: Libpvm

Direct Message Routing  

 

Direct routing allows one task to send messages to another through a TCP link, avoiding the overhead of forwarding through the pvmds. It is implemented entirely in libpvm, using the notify and control message facilities. By default, a task routes messages to its pvmd, which forwards them on. If direct routing is enabled (PvmRouteDirect) when a message (addressed to a task) is passed to mroute(), it attempts to create a direct route if one doesn't already exist. The route may be granted or refused by the destination task, or fail (if the task doesn't exist). The message is then passed to mxfer().

Libpvm maintains a protocol control block (struct ttpcb) for each active or denied connection, in list ttlist. The state diagram for a ttpcb is shown in Figure gif . To request a connection, mroute() makes a ttpcb and socket, then sends a TC_CONREQ control message to the destination via the default route. At the same time, it sends a TM_NOTIFY message to the pvmd, to be notified if the destination task exits, with closure (message tag) TC_TASKEXIT. Then it puts the ttpcb in state TTCONWAIT, and calls mxfer() in blocking mode repeatedly until the state changes.

When the destination task enters mxfer() (for example, to receive a message), it receives the TC_CONREQ message. The request is granted if its routing policy (pvmrouteopt != PvmDontRoute) and implementation allow a direct connection, it has resources available, and the protocol version (TDPROTOCOL) in the request matches its own. It makes a ttpcb with state TTGRNWAIT, creates and listens on a socket, and then replies with a TC_CONACK message. If the destination denies the connection, it nacks, also with a TC_CONACK message. The originator receives the TC_CONACK message, and either opens the connection (state = TTOPEN) or marks the route denied (state = TTDENY). Then, mroute() passes the original message to mxfer(), which sends it. Denied connections are cached in order to prevent repeated negotiation.

If the destination doesn't exist, the TC_CONACK message never arrives because the TC_CONREQ message is silently dropped. However, the TC_TASKEXIT message generated by the notify system arrives in its place, and the ttpcb state is set to TTDENY.

This connect scheme also works if both ends try to establish a connection at the same time. They both enter TTCONWAIT, and when they receive each other's TC_CONREQ messages, they go directly to the TTOPEN state.

   
Figure: Task-task connection state diagram



next up previous contents index
Next: Multicasting Up: Libpvm Previous: Libpvm




Multicasting<A NAME=1436>  </A>



next up previous contents index
Next: Task Environment Up: Message Routing Previous: Direct Message Routing

Multicasting  

 

The libpvm function pvm_mcast() sends a message to multiple destinations simultaneously. The current implementation only routes multicast messages through the pvmds. It uses a 1:N fanout to ensure that failure of a host doesn't cause the loss of any messages (other than ones to that host). The packet routing layer of the pvmd cooperates with the libpvm to multicast a message.

To form a multicast address TID (GID)  , the G bit is set (refer to Figure gif ). The L field is assigned by a counter that is incremented for each multicast, so a new multicast address is used for each message, then recycled.

To initiate a multicast, the task sends a TM_MCA message to its pvmd, containing a list of recipient TIDs. The pvmd creates a multicast descriptor (struct mca) and GID. It sorts the addresses, removes bogus ones, and duplicates and caches them in the mca. To each destination pvmd (ones with destination tasks), it sends a DM_MCA message with the GID and destinations on that host. The GID is sent back to the task in the TM_MCA reply message.

The task sends the multicast message to the pvmd, addressed to the GID. As each packet arrives, the routing layer copies it to each local task and foreign pvmd. When a multicast packet arrives at a destination pvmd, it is copied to each destination task. Packet order is preserved, so the multicast address and data packets arrive in order at each destination. As it forwards multicast packets, each pvmd eavesdrops on the header flags. When it sees a packet with EOM flag set, it flushes the mca.




Task Environment



next up previous contents index
Next: Environment Variables Up: How PVM Works Previous: Multicasting

Task Environment






Environment Variables<A NAME=1447>  </A>



next up previous contents index
Next: Standard Input and Up: Task Environment Previous: Task Environment

Environment Variables  

Experience seems to indicate that inherited environment (Unix environ) is useful to an application. For example, environment variables can be used to distinguish a group of related tasks or to set debugging variables.

PVM makes increasing use of environment, and may eventually support it even on machines where the concept is not native. For now, it allows a task to export any part of environ to tasks spawned by it. Setting variable PVM_EXPORT to the names of other variables causes them to be exported through spawn. For example, setting

    PVM_EXPORT=DISPLAY:SHELL
exports the variables DISPLAY and SHELL to children tasks (and PVM_EXPORT too).

The following environment variables are used by PVM. The user may set these:

-----------------------------------------------------------------------
PVM_ROOT        Root installation directory
PVM_EXPORT      Names of environment variables to inherit through spawn
PVM_DPATH       Default slave pvmd install path
PVM_DEBUGGER    Path of debugger script used by spawn
-----------------------------------------------------------------------

The following variables are set by PVM and should not be modified:

-------------------------------------------------------------------
PVM_ARCH        PVM architecture name
PVMSOCK         Address of the pvmd local socket; see Section 7.4.2
PVMEPID         Expected PID of a spawned task
PVMTMASK        Libpvm Trace mask
-------------------------------------------------------------------




Standard Input and Output<A NAME=1480>  </A>



next up previous contents index
Next: Tracing Up: Task Environment Previous: Environment Variables

Standard Input and Output  

 

Each task spawned through PVM has /dev/null opened for stdin. From its parent, it inherits a stdout sink, which is a (TID, code) pair. Output on stdout or stderr is read by the pvmd through a pipe, packed into PVM messages and sent to the TID, with message tag equal to the code. If the output TID is set to zero (the default for a task with no parent), the messages go to the master pvmd, where they are written on its error log.

Children spawned by a task inherit its stdout sink. Before the spawn, the parent can use pvm_setopt() to alter the output TID or code. This doesn't affect where the output of the parent task itself goes. A task may set output TID to one of three settings: the value inherited from its parent, its own TID, or zero. It can set output code only if output TID is set to its own TID. This means that output can't be assigned to an arbitrary task.

Four types of messages are sent to an stdout sink. The message body formats for each type are as follows:

------------------------------------------------------------
Spawn:     (code) {                  Task has been spawned
                      int tid,       Task id
                      int -1,        Signals spawn
                      int ptid       TID of parent
                  }
Begin:     (code) {                  First output from task
                      int tid,       Task id
                      int -2,        Signals task creation
                      int ptid       TID of parent
                  }
Output:    (code) {                  Output from a task
                      int tid,       Task id
                      int count,     Length of output fragment
                      data[count]    Output fragment
                  }
End:       (code) {                  Last output from a task
                      int tid,       Task id
                      int 0          Signals EOF
                  }
------------------------------------------------------------

The first two items in the message body are always the task id and output count, which allow the receiver to distinguish between different tasks and the four message types. For each task, one message each of types Spawn, Begin, and End is sent, along with zero or more messages of class Output, (count > 0). Classes Begin, Output and End will be received in order, as they originate from the same source (the pvmd of the target task). Class Spawn originates at the (possibly different) pvmd of the parent task, so it can be received in any order relative to the others. The output sink is expected to understand the different types of messages and use them to know when to stop listening for output from a task (EOF) or group of tasks (global EOF).

The messages are designed so as to prevent race conditions when a task spawns another task, then immediately exits. The output sink might get the End message from the parent task and decide the group is finished, only to receive more output later from the child task. According to these rules, the Spawn message for the second task must arrive before the End message from the first task. The Begin message itself is necessary because the Spawn message for a task may arrive after the End message for the same task. The state transitions of a task as observed by the receiver of the output messages are shown in Figure gif .

   
Figure: Output states of a task

The libpvm function pvm_catchout() uses this output collection feature to put the output from children of a task into a file (for example, its own stdout). It sets output TID to its own task id, and the output code to control message TC_OUTPUT. Output from children and grandchildren tasks is collected by the pvmds and sent to the task, where it is received by pvmmctl() and printed by pvmclaimo().



next up previous contents index
Next: Tracing Up: Task Environment Previous: Environment Variables




Tracing<A NAME=1542>  </A>



next up previous contents index
Next: Debugging Up: Task Environment Previous: Standard Input and

Tracing  

 

The libpvm library has a tracing system that can record the parameters and results of all calls to interface functions. Trace data is sent as messages to a trace sink task just as output is sent to an stdout sink (Section gif ). If the trace output TID is set to zero (the default), tracing is disabled.

Besides the trace sink, tasks also inherit a trace mask, used to enable tracing function-by-function. The mask is passed as a (printable) string in environment variable PVMTMASK. A task can manipulate its own trace mask or the one to be inherited from it. A task's trace mask can also be set asynchronously with a TC_SETTMASK control message.

Constants related to trace messages are defined in public header file pvmtev.h. Trace data from a task is collected in a manner similar to the output redirection discussed above. Like the type Spawn, Begin, and End messages which bracket output from a task, TEV_SPNTASK, TEV_NEWTASK and TEV_ENDTASK trace messages are generated by the pvmds to bracket trace messages.

The tracing system was introduced in version 3.3 and is still expected to change somewhat.




Debugging<A NAME=1554>  </A>



next up previous contents index
Next: Console Program Up: Task Environment Previous: Tracing

Debugging  

 

PVM provides a simple but extensible debugging facility. Tasks started by hand could just as easily be run under a debugger, but this procedure is cumbersome for those spawned by an application, since it requires the user to comment out the calls to pvm_spawn() and start tasks manually. If PvmTaskDebug is added to the flags passed to pvm_spawn(), the task is started through a debugger script (a normal shell script), $PVM_ROOT/lib/debugger.

The pvmd passes the name and parameters of the task to the debugger script, which is free to start any sort of debugger. The script provided is very simple. In an xterm window, it runs the correct debugger according to the architecture type of the host. The script can be customized or replaced by the user. The pvmd can be made to execute a different debugger via the bx= host file option or the PVM_DEBUGGER environment variable.




Console Program<A NAME=1563>  </A>



next up previous contents index
Next: Resource Limitations Up: How PVM Works Previous: Debugging

Console Program  

The PVM console is used to manage the virtual machine-to reconfigure it or start and stop processes. In addition, it's an example program that makes use of most of the libpvm functions.

pvm_getfds() and select() are used to check for input from the keyboard and messages from the pvmd simultaneously. Keyboard input is passed to the command interpreter, while messages contain notification (for example, HostAdd) or output from a task.

The console can collect output or trace messages from spawned tasks, using the redirection mechanisms described in Section gif and Section gif , and write them to the screen or a file. It uses the begin and end messages from child tasks to maintain groups of tasks (or jobs), related by common ancestors. Using the PvmHostAdd notify event, it informs the user when the virtual machine is reconfigured.




Resource Limitations<A NAME=1572>  </A>



next up previous contents index
Next: In the PVM Up: How PVM Works Previous: Console Program

Resource Limitations  

 

Resource limits imposed by the operating system and available hardware are in turn passed to PVM applications. Whenever possible, PVM avoids setting explicit limits; instead, it returns an error when resources are exhausted. Competition between users on the same host or network affects some limits dynamically.






PVM Overview



next up previous contents index
Next: Other Packages Up: Introduction Previous: Trends in Distributed

PVM Overview

The PVM software provides a unified framework within which parallel programs can be developed in an efficient and straightforward manner using existing hardware. PVM enables a collection of heterogeneous computer systems to be viewed as a single parallel virtual machine. PVM transparently handles all message routing, data conversion, and task scheduling across a network of incompatible computer architectures.

The PVM computing model is simple yet very general, and accommodates a wide variety of application program structures. The programming interface is deliberately straightforward, thus permitting simple program structures to be implemented in an intuitive manner. The user writes his application as a collection of cooperating tasks. Tasks access PVM resources through a library of standard interface routines. These routines allow the initiation and termination of tasks across the network as well as communication and synchronization between tasks. The PVM message-passing primitives are oriented towards heterogeneous operation, involving strongly typed constructs for buffering and transmission. Communication constructs include those for sending and receiving data structures as well as high-level primitives such as broadcast, barrier synchronization, and global sum.

PVM tasks may possess arbitrary control and dependency structures. In other words, at any point in the execution of a concurrent application, any task in existence may start or stop other tasks or add or delete computers from the virtual machine. Any process may communicate and/or synchronize with any other. Any specific control and dependency structure may be implemented under the PVM system by appropriate use of PVM constructs and host language control-flow statements.

Owing to its ubiquitous nature (specifically, the virtual machine concept) and also because of its simple but complete programming interface, the PVM system has gained widespread acceptance in the high-performance scientific computing community.




In the PVM Daemon



next up previous contents index
Next: In the Task Up: Resource Limitations Previous: Resource Limitations

In the PVM Daemon

How many tasks each pvmd can manage is limited by two factors: the number of processes allowed a user by the operating system, and the number of file descriptors available to the pvmd. The limit on processes   is generally not an issue, since it doesn't make sense to have a huge number of tasks running on a uniprocessor machine.

Each task consumes one file descriptor   in the pvmd, for the pvmd-task   TCP stream. Each spawned task (not ones connected anonymously) consumes an extra descriptor, since its output is read through a pipe by the pvmd (closing stdout and stderr in the task would reclaim this slot). A few more file descriptors are always in use by the pvmd for the local and network sockets and error log file. For example, with a limit of 64 open files, a user should be able to have up to 30 tasks running per host.

The pvmd may become a bottleneck   if all these tasks try to talk to one another through it.

The pvmd uses dynamically allocated memory   to store message packets en route between tasks. Until the receiving task accepts the packets, they accumulate in the pvmd in an FIFO procedure. No flow control is imposed by the pvmd: it will happily store all the packets given to it, until it can't get any more memory. If an application is designed so that tasks can keep sending even when the receiving end is off doing something else and not receiving, the system will eventually run out of memory  .




In the Task



next up previous contents index
Next: Multiprocessor Systems Up: Resource Limitations Previous: In the PVM

In the Task

As with the pvmd, a task may have a limit on the number of others it can connect   to directly. Each direct route to a task has a separate TCP connection (which is bidirectional), and so consumes a file descriptor. Thus, with a limit of 64 open files, a task can establish direct routes to about 60 other tasks. Note that this limit is in effect only when using task-task   direct routing. Messages routed via the pvmds use only the default pvmd-task   connection.

The maximum size of a PVM message   is limited by the amount of memory available to the task. Because messages are generally packed using data existing elsewhere in memory, and they must be reside in memory between being packed and sent, the largest possible message a task can send should be somewhat less than half the available memory. Note that as a message is sent, memory for packet buffers is allocated by the pvmd, aggravating the situation. In-place message   encoding alleviates this problem somewhat, because the data is not copied into message buffers in the sender. However, on the receiving end, the entire message is downloaded into the task before the receive call accepts it, possibly leaving no room to unpack it.

In a similar vein, if many tasks send to a single destination all at once, the destination task or pvmd may be overloaded as it tries to store the messages. Keeping messages from being freed when new ones are received by using pvm_setrbuf() also uses up memory.

These problems can sometimes be avoided by rearranging the application code, for example, to use smaller messages, eliminate bottlenecks, and process messages in the order in which they are generated.

 




Multiprocessor Systems<A NAME=1589>  </A>



next up previous contents index
Next: Message-Passing Architectures Up: How PVM Works Previous: In the Task

Multiprocessor Systems  

 

Developed initially as a parallel programming environment for Unix workstations, PVM has gained wide acceptance and become a de facto standard for message-passing programming. Users want the same programming environment on multiprocessor computers so they can move their applications onto these systems. A common interface would also allow users to write vendor-independent programs for parallel computers and to do part or most of the development work on workstations, freeing up the multiprocessor supercomputers for production runs.

With PVM, multiprocessor systems can be included in the same configuration with workstations. For example, a PVM task running on a graphics workstation can display the results of computations carried out on a massively parallel processing supercomputer. Shared-memory computers with a small number of processors can be linked to deliver supercomputer performance.

The virtual machine hides the configuration details from the programmer. The physical processors can be a network of workstations, or they can be the nodes of a multicomputer. The programmer doesn't have to know how the tasks are created or where they are running; it is the responsibility of PVM to schedule user's tasks onto individual processors. The user can, however, tune the program for a specific configuration to achieve maximum performance, at the expense of its portability.

Multiprocessor systems can be divided into two main categories: message passing and shared memory. In the first category, PVM is now supported on Intel's iPSC/860   and Paragon  , as well as Thinking Machine's CM-5  . Porting PVM to these platforms is straightforward, because the message-passing functions in PVM map quite naturally onto the native system calls. The difficult part is the loading and management of tasks. In the second category, message passing can be done by placing the message buffers in shared memory. Access to these buffers must be synchronized with mutual exclusion locks. PVM 3.3 shared memory ports include SGI multiprocessor machines running IRIX 5.x and Sun Microsystems, Inc.,   multiprocessor machines running Solaris 2.3   (This port also runs on the Cray Research, Inc., CS6400  ). In addition, CRAY and DEC have created PVM ports for their T3D and DEC 2100 shared memory multiprocessors, respectively.





next up previous contents index
Next: Message-Passing Architectures Up: How PVM Works Previous: In the Task




Message-Passing Architectures



next up previous contents index
Next: Shared-Memory Architectures Up: Multiprocessor Systems Previous: Multiprocessor Systems

Message-Passing Architectures

 
Figure:  PVM daemon and tasks on MPP host

A typical MPP system has one or more service nodes for user logins and a large number of compute nodes for number crunching. The PVM daemon   runs on one of the service nodes   and serves as the gateway to the outside world. A task can be started on any one of the service nodes as a Unix process and enrolls in PVM by establishing a TCP socket connection to the daemon. The only way to start PVM tasks on the compute nodes is via pvm_spawn(). When the daemon receives a request to spawn new tasks, it will allocate a set of nodes if necessary, and load the executable onto the specified number of nodes.

The way PVM allocates nodes is system dependent. On the CM-5, the entire partition is allocated to the user. On the iPSC/860, PVM will get a subcube big enough to accommodate all the tasks to be spawned. Tasks created with two separate calls to pvm_spawn() will reside in different subcubes, although they can exchange messages directly by using the physical node address. The NX operating system   limits the number of active subcubes system-wide to 10. Pvm_spawn will fail when this limit is reached or when there are not enough nodes available. In the case of the Paragon,   PVM uses the default partition unless a different one is specified when pvmd is invoked. Pvmd and the spawned tasks form one giant parallel application. The user can set the appropriate NX environment variables such as NX_DFLT_SIZE before starting PVM, or he can specify the equivalent command-line arguments to pvmd (i.e., pvmd -sz 32).

 
Figure:  Packing: breaking data into fixed-size fragments

PVM message-passing functions are implemented in terms of the native send and receive system calls. The ``address" of a task is encoded in the task id, as illustrated in Figure gif .

   
Figure: How TID is used to distinguish tasks on MPP

This enables the messages to be sent directly to the target task, without any help from the daemon. The node number is normally the logical node number, but the physical address is used on the iPSC/860   to allow for direct intercube communication. The instance number is used to distinguish tasks running on the same node.

 
Figure:  Buffering: buffering one fragment by receiving task until pvm_recv() is called

PVM normally uses asynchronous send primitives to send messages. The operating system can run out of message handles very quickly if a lot of small messages or several large messages are sent at once. PVM will be forced to switch to synchronous send when there are no more message handles left or when the system buffer gets filled up. To improve performance, a task should call pvm_send() as soon as the data becomes available, so (one hopes) when the other task calls pvm_recv(), the message will already be in its buffer. PVM buffers one incoming packet between calls to pvm_send()/pvm_recv(). A large message, however, is broken up into many fixed-size fragments during packing, and each piece is sent separately. Buffering one of these fragments is not sufficient unless pvm_send() and pvm_recv() are synchronized. Figures gif and gif illustrate this process.

The front end of an MPP   system is treated as a regular workstation. Programs to be run there should be linked with the regular PVM library, which relies on Unix sockets to transmit messages. Normally one should avoid running processes on the front end, because communication between those processes and the node processes must go through the PVM daemon and a TCP socket link. Most of the computation and communication should take place on the compute nodes in order to take advantage of the processing power of these nodes and the fast interconnects between them.

Since the PVM library for the front end is different from the one for the nodes, the executable for the front end must be different from the one compiled for the nodes. An SPMD program, for example, has only one source file, but the object code must be linked with the front end and node PVM libraries separately to produce two executables if it is to be started from the front end. An alternative would be a ``hostless"   SPMD program  , which could be spawned from the PVM console.

Table gif shows the native system calls used by the corresponding PVM functions on various platforms.

   
Table: Implementation of PVM system calls

The CM-5 is somewhat different from the Intel systems because it requires a special host process for each group of tasks spawned. This process enrolls in PVM and relays messages between pvmd and the node programs. This, needless to say, adds even more overhead to daemon-task communications.

Another restrictive feature of the CM-5 is that all nodes in the same partition are scheduled as a single unit. The partitions are normally configured by the system manager and each partition must contain at least 16 processors. User programs are run on the entire partition by default. Although it is possible to idle some of the processors in a partition, as PVM does when fewer nodes are called for, there is no easy way to harness the power of the idle processors. Thus, if PVM spawns two groups of tasks, they will time-share the partition, and any intergroup traffic must go through pvmd.

Additionally, CMMD has no support for multicasting. Thus, pvm_mcast() is implemented with a loop of CMMD_async_send().



next up previous contents index
Next: Shared-Memory Architectures Up: Multiprocessor Systems Previous: Multiprocessor Systems




Shared-Memory Architectures<A NAME=1767>  </A>



next up previous contents index
Next: Optimized Send and Up: Multiprocessor Systems Previous: Message-Passing Architectures

Shared-Memory Architectures  

The shared-memory architecture provides a very efficient medium for processes to exchange data. In our implementation, each task owns a shared buffer created with the shmget() system call. The task id is used as the ``key" to the shared segment. If the key is being used by another user, PVM will assign a different id to the task. A task communicates with other tasks by mapping their message buffers into its own memory space.

To enroll in PVM, the task first writes its Unix process id into pvmd's incoming box. It then looks for the assigned task id in pvmd's pid TID table.

The message buffer is divided into pages, each of which holds one fragment (Figure gif ). PVM's page size can be a multiple of the system page size. Each page has a header, which contains the lock and the reference count. The first few pages are used as the incoming box, while the rest of the pages hold outgoing fragments (Figure gif ). To send a message, the task first packs the message body into its buffer, then delivers the message header (which contains the sender's TID and the location of the data) to the incoming box of the intended recipient. When pvm_recv() is called, PVM checks the incoming box, locates and unpacks the messages (if any), and decreases the reference count so the space can be reused. If a task is not able to deliver the header directly because the receiving box is full, it will block until the other task is ready.

 
Figure:  Structure of a PVM page

 
Figure:  Structures of shared message buffers

Inevitably some overhead will be incurred when a message is packed into and unpacked from the buffer, as is the case with all other PVM implementations. If the buffer is full, then the data must first be copied into a temporary buffer in the process's private space and later transferred to the shared buffer.

Memory contention is usually not a problem. Each process has its own buffer, and each page of the buffer has its own lock. Only the page being written to is locked, and no process should be trying to read from this page because the header has not been sent out. Different processes can read from the same page without interfering with each other, so multicasting will be efficient (they do have to decrease the counter afterwards, resulting in some contention). The only time contention occurs is when two or more processes trying to deliver the message header to the same process at the same time. But since the header is very short (16 bytes), such contention should not cause any significant delay.

To minimize the possibility of page faults, PVM attempts to use only a small number of pages in the message buffer and recycle them as soon as they have been read by all intended recipients.

Once a task's buffer has been mapped, it will not be unmapped unless the system limits the number of mapped segments. This strategy saves time for any subsequent message exchanges with the same process.



next up previous contents index
Next: Optimized Send and Up: Multiprocessor Systems Previous: Message-Passing Architectures




Optimized Send and Receive on MPP



next up previous contents index
Next: Advanced Topics Up: Multiprocessor Systems Previous: Shared-Memory Architectures

Optimized Send and Receive on MPP

In the original implementation, all user messages are buffered by PVM. The user must pack the data into a PVM buffer before sending it, and unpack the data after it has been received into an internal buffer. This approach works well on systems with relatively high communication latency, such as the Ethernet. On MPP systems the packing and unpacking introduce substantial overhead. To solve this problem we added two new PVM functions, namely pvm_psend() and pvm_precv(). These functions combine packing/unpacking and sending/receiving into one single step. They could be mapped directly into the native message passing primitives available on the system, doing away with internal buffers altogether. On the Paragon these new functions give almost the same performance as the native ones.

Although the user can use both pvm_psend() and pvm_send() in the same program, on MPP the pvm_psend() must be matched with pvm_precv(), and pvm_send() with pvm_recv().




Other Packages



next up previous contents index
Next: The p4 System Up: Introduction Previous: PVM Overview

Other Packages

Several research groups have developed software packages that like PVM assist programmers in using distributed computing. Among the most well known efforts are P4   [1], Express   [], MPI   [], and Linda   []. Various other systems with similar capabilities are also in existence; a reasonably comprehensive listing may be found in [13].






Advanced Topics



next up previous contents index
Next: XPVM Up: PVM: Parallel Virtual Machine Previous: Optimized Send and

Advanced Topics






XPVM<A NAME=1860>  </A>



next up previous contents index
Next: Network View Up: Advanced Topics Previous: Advanced Topics

XPVM  

It is often useful and always reassuring to be able to see the present configuration of the virtual machine and the status of the hosts. It would be even more useful if the user could also see what his program is doing-what tasks are running, where messages are being sent, etc. The PVM GUI   called XPVM was developed to display this information, and more.

XPVM combines the capabilities of the PVM console, a performance monitor, and a call-level debugger into a single, easy-to-use X-Windows interface. XPVM is available from netlib   in the directory pvm3/xpvm. It is distributed as precompiled, ready-to-run executables for SUN4, RS6K, ALPHA, SUN4SOL2, HPPA, and SGI5. The XPVM source is also available for compiling on other machines.

XPVM is written entirely in C using the TCL/TK [8] toolkit   and runs just like another PVM task. If a user wishes to build XPVM from the source, he must first obtain and install the TCL/TK software on his system. TCL and TK were developed by John Ousterhout   at Berkeley and can be obtained by anonymous ftp to sprite.berkeley.edu The TCL and XPVM source distributions each contain a README file that describes the most up-to-date installation procedure for each package respectively.

Figure gif shows a snapshot of XPVM in use.

   
Figure: XPVM interface - snapshot during use

- figure not available -

Like the PVM console, XPVM will start PVM if PVM is not already running, or will attach to the local pvmd if it is. The console can take an optional hostfile argument whereas XPVM always reads $HOME/.xpvm_hosts as its hostfile. If this file does not exist, then XPVM just starts PVM on the local host (or attaches to the existing PVM). In typical use, the hostfile .xpvm_hosts contains a list of hosts prepended with an &. These hostnames then get added to the Hosts menu for addition and deletion from the virtual machine by clicking on them.

The top row of buttons perform console-like functions. The Hosts button displays a menu of hosts. Clicking on a host toggles whether it is added or deleted from the virtual machine. At the bottom of the menu is an option for adding a host not listed. The Tasks button brings up a menu whose most-used selection is spawn. Selecting spawn brings up a window where one can set the executable name, spawn flags, start position, number of copies to start, etc. By default, XPVM turns on tracing in all tasks (and their children) started inside XPVM. Clicking on Start in the spawn window starts the task, which will then appear in the space-time view. The Reset button has a menu for resetting PVM (i.e., kill all PVM tasks) or resetting different parts of XPVM. The Quit button exits XPVM while leaving PVM running. If XPVM is being used to collect trace information, the information will not be collected if XPVM is stopped. The Halt button is used when one is through with PVM. Clicking on this button kills all running PVM tasks, shuts down PVM cleanly, and exits the XPVM interface. The Help button brings up a menu of topics the user can get help about.

During startup, XPVM joins a group called xpvm. The intention is that tasks started outside the XPVM interface can get the TID of XPVM by doing tid = pvm_gettid( xpvm, 0 ). This TID would be needed if the user wanted to manually turn on tracing inside such a task and pass the events back to XPVM for display. The expected TraceCode for these events is 666.

While an application is running, XPVM collects and displays the information in real time. Although XPVM updates the views as fast as it can, there are cases when XPVM cannot keep up with the events and it falls behind the actual run time.

In the middle of the XPVM interface are tracefile controls. It is here that the user can specify a tracefile-a default tracefile in /tmp is initially displayed. There are buttons to specify whether the specified tracefile is to be played back or overwritten by a new run. XPVM saves trace events in a file using the ``self defining data format'' (SDDF)     described in Dan Reed's   Pablo [11]   trace playing package. The analysis of PVM traces can be carried out on any of a number of systems such as Pablo.

XPVM can play back its own SDDF files. The tape-player-like buttons allow the user to rewind the tracefile, stop the display at any point, and step through the execution. A time display specifies the number of seconds from when the trace display began.

The Views button allows the user to open or close any of several views presently supplied with XPVM. These views are described below.





next up previous contents index
Next: Network View Up: Advanced Topics Previous: Advanced Topics




Network View<A NAME=1890>  </A>



next up previous contents index
Next: Space-Time View Up: XPVM Previous: XPVM

Network View  

The Network view displays the present virtual machine configuration and the activity of the hosts. Each host is represented by an icon that includes the PVM_ARCH and host name inside the icon. In the initial release of XPVM, the icons are arranged arbitrarily on both sides of a bus network. In future releases the view will be extended to visualize network activity as well. At that time the user will be able to specify the network topology to display.

These icons are illuminated in different colors to indicate their status in executing PVM tasks. Green implies that at least one task on that host is busy executing useful work. Yellow indicates that no tasks are executing user computation, but at least one task is busy executing PVM system routines. When there are no tasks on a given host, its icon is left uncolored or white. The specific colors used in each case are user customizable.

The user can tell at a glance how well the virtual machine is being utilized by his PVM application. If all the hosts are green most of the time, then machine utilization is good. The Network view does not display activity from other users' PVM jobs or other processes that may be running on the hosts.

In future releases the view will allow the user to click on a multiprocessor icon and get information about the number of processors, number of PVM tasks, etc., that are running on the host.  




Space-Time View<A NAME=1893>  </A>



next up previous contents index
Next: Other Views Up: XPVM Previous: Network View

Space-Time View  

The Space-Time view displays the activities of individual PVM tasks that are running on the virtual machine. Listed on the left-hand side of the view are the executable names of the tasks, preceded by the host they are running on. The task list is sorted by host so that it is easy to see whether tasks are being clumped on one host. This list also shows the task-to-host mappings (which are not available in the Network view).

The Space-Time view combines three different displays. The first is like a Gantt chart  . Beside each listed task is a horizontal bar stretching out in the ``time'' direction. The color of this bar at any time indicates the state of the task. Green indicates that user computations are being executed. Yellow marks the times when the task is executing PVM routines. White indicates when a task is waiting for messages. The bar begins at the time when the task starts executing and ends when the task exits normally. The specific colors used in each case are user customizable.

The second display overlays the first display with the communication activity among tasks. When a message is sent between two tasks, a red line is drawn starting at the sending task's bar at the time the message is sent and ending at the receiving task's bar when the message is received. Note that this is not necessarily the time the message arrived, but rather the time the task returns from pvm_recv(). Visually, the patterns and slopes of the red lines combined with white ``waiting'' regions reveal a lot about the communication efficiency of an application.

The third display appears only when a user clicks on interesting features of the Spac