Representation of Video Signals by Tom Duff October 13, 1994 1. Introduction This note goes over the basics of signal representation in video systems. It starts at the very beginning, with phosphor dots on the CRT face, and discusses timing, sync, analog component video, analog composite video, digital component video and compressed digital video. The last section briefly compares the video quality that can be expected from each signal representation. 2. CRT Basics Color Cathode Ray Tubes (CRTs) work by bombarding a pattern of red, green and blue (R, G and B) phosphor dots or stripes painted on the CRT's face with three beams of electrons (one each for red, green and blue). The phosphors emit light in direct proportion to the intensity of bombardment. The electron beams scan across the CRT face in a pattern of parallel left-to-right lines. Successive lines are drawn below those that proceed until we reach the bottom of the screen. There are 15750 lines drawn per second, amounting to 60 complete scans of the screen. Every second scan is offset vertically by half the inter-line spacing, so that if nothing is moving we have double the effective vertical resolution at half the scan rate, while moving images are still effectively sampled at 60 Hz. This also accounts for 15750 not being a multiple of 60 -- each vertical scan, or `field', is 242.5 lines; the extra half-line is eaten up producing the offset between the two fields. (Note, where I quote specific numbers, I'm referring to the US, Canadian & Japanese standard for ordinary television. In other parts of the world, the scan rate and small details of signal encoding are slightly different. HDTV, of course, requires much higher signal rates, but the principles are the same. Also, for extremely obscure reasons, Black & White (B&W) TV signals are transmitted at rates different from Color signals by about 0.1%. I've quoted the B&W numbers because they're rounder -- for example, the actual Color field rate is 59.940/sec.) Any representation of a color TV signal must allow us to recover the beam position and the red, green and blue intensities as functions of time. The lowest-level representation used inside a television is just the five signals (Hdrive, Vdrive, R, G, B) that drive the monitor. Here Hdrive and Vdrive are the voltages applied to the vertical and horizontal deflection plates in the CRT. 3. Sync Since Hdrive and Vdrive are periodic signals that increase linearly from one edge of the CRT face to the other at known rates, they can easily be recovered from appropriate sync pulses. This representation, five wires carrying (Vsync, Hsync, R, G, B), is used by many computer monitors. The two sync signals are related in a simple way that allows them to be combined on a single wire. A short horizontal sync pulses is sent during each `horizontal retrace' interval (when the electron beam is moving from right-to-left to set up for the next scan line). During the `vertical interval' when the beam is moving from bottom-to-top to trace the next field `equalizing pulses' are transmitted. Where horizontal sync are quite short, the equalizing pulses have a long duty cycle. Also, their frequency is twice the horizontal rate, and there is an odd number of them. They serve two purposes. First, because of their width, they can be separated from the horizontal pulses by a low-pass filter to reconstruct vertical sync. Second, the horizontal oscillator must be offset by 1/2 the horizontal rate during the vertical interval to effect the 1/2-line offset between the two fields of a frame. This combined sync signal is called composite sync. It can be presented as a separate signal, giving a 4-wire representation: (Csync, R, G, B). Since the sync pulses always occur at times when the beam is invisibly moving to set up for the next scan-line or field, we can save a wire by adding negative-going composite sync to a positive-going color signal. When needed, they can be separated with a diode. This is the representation of B&W television signals used for broadcast. Computer monitors often accept `Sync on green', that is three signals of the form (R, G+Csync, B). 4. Composite Video For television transmission, the composite sync, red, green and blue signals must somehow be multiplexed onto a single wire. In the early 1950's, when the US National Television Standards Committee (NTSC) designed the encoding scheme, they worked under the constraint that the signal be receivable by existing B&W televisions. For the phosphors that NTSC designed for, the appropriate B&W level for a given RGB signal (its `luminance', usually called Y) is 0.299*R + 0.587*G + 0.114*B. NTSC encodes the rest of the color information into two `chroma' signals, I and Q, by a linear transformation: [ Y ] [ 0.299 0.587 0.114 ] [ R ] [ I ] = [ 0.596 -0.274 -0.322 ] [ G ] [ Q ] [ 0.212 -0.523 0.311 ] [ B ] These coefficients were chosen to minimize the bandwidth required to encode the I and Q signals of typical television signals, bearing in mind that the human visual system is much more sensitive to luminance detail than to chroma. The I and Q signals are used to modulate a 3.58 MHz signal that is added to the luminance. The color subcarrier frequency is 455/2 times the horizontal frequency, and was chosen to be high enough to be hard to see on B&W televisions, and to fit into a mostly-unused part of the B&W signal's spectrum, whose energy is concentrated at multiples of the horizontal scan rate and away from its half-multiples. The actual chroma signal is the sum of two sinusoids, 90 degrees out of phase, one modulated by I and the other by Q. Demodulation is a complicated process that requires a phase-reference signal that is synchronized to a short 3.58 MHz `color burst' transmitted between each horizontal sync pulse and the subsequent start of active video. In much of the world, a slightly different modulation scheme called PAL (for Phase Alternate Line -- you don't want to know why) is used. In France and the (former) Soviet Union a very different scheme called SECAM is standard. Its subcarrier has only a single modulator, either I or Q on alternate lines. The recovered modulator is preserved in a delay line for reuse on the next line. 5. Digital Component Video The main digital video encoding standard is CCIR Recommendation 601-2. As above, we encode three linear combinations of the RGB signal, in this case called Y, Cr and Cb. These are eight bit quantities, computed roughly thus: Y = .299*R + .587*G + .114*B just as in NTSC Cr = R-Y Cb = B-Y (This is a slight lie. In fact, Cr and Cb are carefully scaled to fully use the available 8 bit range, except that the values 0 and 255 are never transmitted, those values being used for framing. The details are unimportant.) Y is sampled 720 times per scan line, Cr and Cb are low-pass filtered and sampled 360 times, half the Y rate. Every second transmitted sample is a Y value, with Cr and Cb values alternating in the remaining slots, like this: Y Cr Y Cb Y Cr Y Cb ... CCIR 601-2 actually allows other sub-sampling rates for Cr and Cb, but this scheme, called 4:2:2 to indicate the ratio of the Y:Cr:Cb sample rates, is almost universal. (Other possibilities are 4:1:1 and 4:4:4.) 6. Video Compression There are three important digital video compression standards, CCITT H.261 (also called p*64, because the bit-rates it supports are multiples of 64K bit/sec), MPEG-I and MPEG-II. MPEG stands for Motion Picture Experts Group. H.261 is designed for video telephony, MPEG-I for CD-ROM video playback. MPEG-II is an ambitious standard intended to cover the full range of applications. It is upward compatible with MPEG-I. Despite their varied audiences, the three standards are pretty similar. I'll describe H.261 and then mention the (small) differences in the MPEG standards. First, RGB images are converted to (Y,Cr,Cb) form and the Cr and Cb components are sub-sampled by a factor of 2. Now 8x8 pixel blocks of each component are coded separately. Frames are divided into two categories, called I-frames (for intra-coded frames) and P-frames (predicted frames.) I-frames are coded without reference to previous frames. The code for each block of a P-frame contains a motion vector specifying where to look in the previous I-frame to find a predictor for it. CCITT recommends that an I-frame be sent once in every 132 frames. Now, for each block, we subtract it from its predictor, unless we're doing an I-frame. Then we compute the Discrete Cosine Transform (DCT) of the block. Information is concentrated in the low-frequency components of the DCT, so when we quantize the coefficients (in a mildly complicated way, the details aren't important) many of them will go to zero. Then we run-length code the quantized coefficients to collapse runs of zeros and Huffman code the result. The MPEG standards differ mainly in having a third frame category -- blocks of B-frames (bidirectionally-predicted) can be predicted from either I- or P-frames preceding and following them, which ever works better. H.261 can't do this because it is intended for on-line applications. 7. Evaluation Obviously, we will obtain the highest-quality video by picking a representation as close as possible to the input expected by the CRT. For digital video at ordinary scan rates, this probably means 8-bit samples for each of R, G and B at about 640 samples per scan line. CCIR 601-2 is the standard digital video format used in television studios, where the concern for video quality is traditionally the highest. They do not regard the signal degradation due to chroma sub-sampling as important. Nor do I. NTSC and its relatives PAL and SECAM are notable first as inspired engineering solutions to ridiculous political problems. Their quality is limited by the small bandwidth available to the chroma subcarrier. In most consumer TV systems, image quality is limited not by the intrinsic problems of NTSC encoding, but by cost-cutting in the receiver. NTSC can deliver much better signals than most of us are used to seeing. H.261 and MPEG-I are both intended for low bit-rate applications. The lowest plausible H.261 bit-rate is 128K bit/sec (two ISDN channels). MPEG-I is highly tuned for 1.5M bit/sec -- the rate at which an hour of video can be stored on a CD-ROM. Neither of them produces what I would call adequate video -- they both operate at 120 scan-line resolution, compared to the native NTSC resolution of 486 scan-lines. MPEG-II is not targeted at a particular bit-rate, and can produce video of arbitrary quality, although the bit-rate may be prohibitive.