Pvmd-Pvmd<A NAME=1285> </A>

next up previous contents index
Next: Pvmd-Task and Task-Task Up: Protocols Previous: Messages



PVM daemons communicate with one another through UDP sockets. UDP is an unreliable delivery service which can lose, duplicate or reorder packets, so an acknowledgment and retry mechanism is used. UDP also limits packet length, so PVM fragments long messages.

We considered TCP, but three factors make it inappropriate. First is scalability . In a virtual machine of N hosts, each pvmd must have connections to the other N - 1. Each open TCP connection consumes a file descriptor in the pvmd, and some operating systems limit the number of open files to as few as 32, whereas a single UDP socket can communicate with any number of remote UDP sockets. Second is overhead . N pvmds need N(N - 1)/2 TCP connections, which would be expensive to set up. The PVM/UDP protocol is initialized with no communication. Third is fault tolerance . The communication system detects when foreign pvmds have crashed or the network has gone down, so we need to set timeouts in the protocol layer. The TCP keepalive option might work, but it's not always possible to get adequate control over the parameters.

The packet header is shown in Figure gif. Multibyte values are sent in (Internet) network byte order (most significant byte first).

Figure: Pvmd-pvmd packet header

The source and destination fields hold the TIDs of the true source and final destination of the packet, regardless of the route it takes. Sequence and acknowledgment numbers start at 1 and increment to 65535, then wrap to zero.

SOM (EOM) - Set for the first (last) fragment of a message. Intervening fragments have both bits cleared. They are used by tasks and pvmds to delimit message boundaries.

DAT - If set, data is contained in the packet, and the sequence number is valid. The packet, even if zero length, must be delivered.

ACK - If set, the acknowledgment number field is valid. This bit may be combined with the DAT bit to piggyback an acknowledgment on a data packet.gif

FIN - The pvmd is closing down the connection. A packet with FIN bit set (and DAT cleared) begins an orderly shutdown. When an acknowledgement arrives (ACK bit set and ack number matching the sequence number from the FIN packet), a final packet is sent with both FIN and ACK set. If the pvmd panics, (for example on a trapped segment violation) it tries to send a packet with FIN and ACK set to every peer before it exits.

The state of a connection to another pvmd is kept in its host table entry. The protocol driver uses the following fields of struct hostd:

Field           Meaning
hd_hostpart    TID of pvmd
hd_mtu         Max UDP packet length to host
hd_sad         IP address and UDP port number
hd_rxseq       Expected next packet number from host
hd_txseq       Next packet number to send to host
hd_txq         Queue of packets to send
hd_opq         Queue of packets sent, awaiting ack
hd_nop         Number of packets in hd_opq 
hd_rxq         List of out-of-order received packets
hd_rxm         Buffer for message reassembly
hd_rtt         Estimated smoothed round-trip time

Figure gif shows the host send and outstanding-packet queues. Packets waiting to be sent to a host are queued in FIFO hd_txq. Packets are appended to this queue by the routing code, described in Section gif. No receive queues are used; incoming packets are passed immediately through to other send queues or reassembled into messages (or discarded). Incoming messages are delivered to a pvmd entry point as described in Section gif.

Figure: Host descriptors with send queues

The protocol allows multiple outstanding packets to improve performance over high-latency networks, so two more queues are required. hd_opq holds a per-host list of unacknowledged packets, and global opq lists all unacknowledged packets, ordered by time to retransmit. hd_rxq holds packets received out of sequence until they can be accepted.

The difference in time between sending a packet and getting the acknowledgement is used to estimate the round-trip time to the foreign host. Each update is filtered into the estimate according to the formula .

When the acknowledgment for a packet arrives, the packet is removed from hd_opq and opq and discarded. Each packet has a retry timer and count, and each is resent until acknowledged by the foreign pvmd. The timer starts at 3 * hd_rtt, and doubles for each retry up to 18 seconds. hd_rtt is limited to nine seconds, and backoff is bounded in order to allow at least 10 packets to be sent to a host before giving up. After three minutes of resending with no acknowledgment, a packet expires.

If a packet expires as a result of timeout, the foreign pvmd is assumed to be down or unreachable, and the local pvmd gives up on it, calling hostfailentry()

next up previous contents index
Next: Pvmd-Task and Task-Task Up: Protocols Previous: Messages