Representation of Video Signals
			by Tom Duff
			October 13, 1994

1.  Introduction

This note goes over the basics of signal representation in video
systems.  It starts at the very beginning, with phosphor dots on
the CRT face, and discusses timing, sync, analog component video,
analog composite video, digital component video and compressed
digital video.  The last section briefly compares the video quality
that can be expected from each signal representation.

2.  CRT Basics

Color Cathode Ray Tubes (CRTs) work by bombarding a pattern of red,
green and blue (R, G and B) phosphor dots or stripes painted on the
CRT's face with three beams of electrons (one each for red, green and
blue).  The phosphors emit light in direct proportion to the intensity
of bombardment.

The electron beams scan across the CRT face in a pattern of parallel
left-to-right lines.  Successive lines are drawn below those that
proceed until we reach the bottom of the screen.  There are 15750
lines drawn per second, amounting to 60 complete scans of the screen.
Every second scan is offset vertically by half the inter-line spacing,
so that if nothing is moving we have double the effective vertical
resolution at half the scan rate, while moving images are still
effectively sampled at 60 Hz. This also accounts for 15750 not being a
multiple of 60 -- each vertical scan, or `field', is 242.5 lines; the
extra half-line is eaten up producing the offset between the two
fields.

(Note, where I quote specific numbers, I'm referring to the US,
Canadian & Japanese standard for ordinary television.  In other parts
of the world, the scan rate and small details of signal encoding are
slightly different.  HDTV, of course, requires much higher signal
rates, but the principles are the same.  Also, for extremely obscure
reasons, Black & White (B&W) TV signals are transmitted at rates
different from Color signals by about 0.1%.  I've quoted the B&W
numbers because they're rounder -- for example, the actual Color field
rate is 59.940/sec.)

Any representation of a color TV signal must allow us to recover the
beam position and the red, green and blue intensities as functions of
time.  The lowest-level representation used inside a television is
just the five signals (Hdrive, Vdrive, R, G, B) that drive the monitor.
Here Hdrive and Vdrive are the voltages applied to the vertical and
horizontal deflection plates in the CRT.

3.  Sync

Since Hdrive and Vdrive are periodic signals that increase linearly from one
edge of the CRT face to the other at known rates, they can easily be
recovered from appropriate sync pulses.  This representation, five
wires carrying (Vsync, Hsync, R, G, B), is used by many computer
monitors.

The two sync signals are related in a simple way that allows them to
be combined on a single wire.  A short horizontal sync pulses is sent
during each `horizontal retrace' interval (when the electron beam is
moving from right-to-left to set up for the next scan line).  During
the `vertical interval' when the beam is moving from bottom-to-top to
trace the next field `equalizing pulses' are transmitted.  Where
horizontal sync are quite short, the equalizing pulses have a long
duty cycle.  Also, their frequency is twice the horizontal rate, and
there is an odd number of them.  They serve two purposes.  First,
because of their width, they can be separated from the horizontal
pulses by a low-pass filter to reconstruct vertical sync.  Second, the
horizontal oscillator must be offset by 1/2 the horizontal rate during
the vertical interval to effect the 1/2-line offset between the two
fields of a frame.

This combined sync signal is called composite sync.  It can be
presented as a separate signal, giving a 4-wire representation:
(Csync, R, G, B). Since the sync pulses always occur at times when the
beam is invisibly moving to set up for the next scan-line or field, we
can save a wire by adding negative-going composite sync to a
positive-going color signal.  When needed, they can be separated with
a diode.  This is the representation of B&W television signals used
for broadcast.  Computer monitors often accept `Sync on green', that
is three signals of the form (R, G+Csync, B).

4.  Composite Video

For television transmission, the composite sync, red, green and blue
signals must somehow be multiplexed onto a single wire.  In the early
1950's, when the US National Television Standards Committee (NTSC)
designed the encoding scheme, they worked under the constraint that
the signal be receivable by existing B&W televisions.  For the
phosphors that NTSC designed for, the appropriate B&W level for a
given RGB signal (its `luminance', usually called Y) is 0.299*R +
0.587*G + 0.114*B.  NTSC encodes the rest of the color information
into two `chroma' signals, I and Q, by a linear transformation:

        [ Y ]   [ 0.299 0.587   0.114 ] [ R ]
        [ I ] = [ 0.596 -0.274 -0.322 ] [ G ]
        [ Q ]   [ 0.212 -0.523  0.311 ] [ B ]

These coefficients were chosen to minimize the bandwidth required to
encode the I and Q signals of typical television signals, bearing in
mind that the human visual system is much more sensitive to luminance
detail than to chroma.

The I and Q signals are used to modulate a 3.58 MHz signal that is
added to the luminance.  The color subcarrier frequency is 455/2 times
the horizontal frequency, and was chosen to be high enough to be hard
to see on B&W televisions, and to fit into a mostly-unused part of the
B&W signal's spectrum, whose energy is concentrated at multiples of
the horizontal scan rate and away from its half-multiples.

The actual chroma signal is the sum of two sinusoids, 90 degrees out
of phase, one modulated by I and the other by Q.

Demodulation is a complicated process that requires a phase-reference
signal that is synchronized to a short 3.58 MHz `color burst'
transmitted between each horizontal sync pulse and the subsequent
start of active video.

In much of the world, a slightly different modulation scheme called
PAL (for Phase Alternate Line -- you don't want to know why) is used.
In France and the (former) Soviet Union a very different scheme called
SECAM is standard.  Its subcarrier has only a single modulator, either
I or Q on alternate lines.  The recovered modulator is preserved in a
delay line for reuse on the next line.

5.  Digital Component Video

The main digital video encoding standard is CCIR Recommendation 601-2.

As above, we encode three linear combinations of the RGB signal, in
this case called Y, Cr and Cb. These are eight bit quantities,
computed roughly thus:

        Y  = .299*R + .587*G + .114*B          just as in NTSC
        Cr = R-Y
        Cb = B-Y

(This is a slight lie.  In fact, Cr and Cb are carefully scaled to
fully use the available 8 bit range, except that the values 0 and 255
are never transmitted, those values being used for framing.  The
details are unimportant.)

Y is sampled 720 times per scan line, Cr and Cb are low-pass filtered
and sampled 360 times, half the Y rate.  Every second transmitted
sample is a Y value, with Cr and Cb values alternating in the
remaining slots, like this:

        Y Cr Y Cb Y Cr Y Cb ...

CCIR 601-2 actually allows other sub-sampling rates for Cr and Cb, but
this scheme, called 4:2:2 to indicate the ratio of the Y:Cr:Cb sample
rates, is almost universal.  (Other possibilities are 4:1:1 and
4:4:4.)

6.  Video Compression

There are three important digital video compression standards, CCITT
H.261 (also called p*64, because the bit-rates it supports are
multiples of 64K bit/sec), MPEG-I and MPEG-II.  MPEG stands for Motion
Picture Experts Group.

H.261 is designed for video telephony, MPEG-I for CD-ROM video
playback.  MPEG-II is an ambitious standard intended to cover the full
range of applications.  It is upward compatible with MPEG-I.  Despite
their varied audiences, the three standards are pretty similar.  I'll
describe H.261 and then mention the (small) differences in the MPEG
standards.

First, RGB images are converted to (Y,Cr,Cb) form and the Cr and Cb
components are sub-sampled by a factor of 2.  Now 8x8 pixel blocks of
each component are coded separately.

Frames are divided into two categories, called I-frames (for
intra-coded frames) and P-frames (predicted frames.)  I-frames are
coded without reference to previous frames.  The code for each block
of a P-frame contains a motion vector specifying where to look in the
previous I-frame to find a predictor for it.  CCITT recommends that an
I-frame be sent once in every 132 frames.

Now, for each block, we subtract it from its predictor, unless we're
doing an I-frame.  Then we compute the Discrete Cosine Transform (DCT)
of the block.  Information is concentrated in the low-frequency
components of the DCT, so when we quantize the coefficients (in a
mildly complicated way, the details aren't important) many of them
will go to zero.  Then we run-length code the quantized coefficients
to collapse runs of zeros and Huffman code the result.

The MPEG standards differ mainly in having a third frame category --
blocks of B-frames (bidirectionally-predicted) can be predicted from
either I- or P-frames preceding and following them, which ever works
better.  H.261 can't do this because it is intended for on-line
applications.

7.  Evaluation

Obviously, we will obtain the highest-quality video by picking a
representation as close as possible to the input expected by the CRT.
For digital video at ordinary scan rates, this probably means 8-bit
samples for each of R, G and B at about 640 samples per scan line.

CCIR 601-2 is the standard digital video format used in television
studios, where the concern for video quality is traditionally the
highest.  They do not regard the signal degradation due to chroma
sub-sampling as important.  Nor do I.

NTSC and its relatives PAL and SECAM are notable first as inspired
engineering solutions to ridiculous political problems.  Their quality
is limited by the small bandwidth available to the chroma subcarrier.
In most consumer TV systems, image quality is limited not by the intrinsic
problems of NTSC encoding, but by cost-cutting in the receiver.  NTSC
can deliver much better signals than most of us are used to seeing.

H.261 and MPEG-I are both intended for low bit-rate applications.  The
lowest plausible H.261 bit-rate is 128K bit/sec (two ISDN channels).
MPEG-I is highly tuned for 1.5M bit/sec -- the rate at which an hour
of video can be stored on a CD-ROM.  Neither of them produces what I
would call adequate video -- they both operate at 120 scan-line
resolution, compared to the native NTSC resolution of 486 scan-lines.
MPEG-II is not targeted at a particular bit-rate, and can produce
video of arbitrary quality, although the bit-rate may be prohibitive.