In order to understand the notion of digitizing information, it must first be understood that everything in nature, including the sounds and images one wishes to record or transmit, is originally analog. The second thing to be understood is that analog works very well. In fact, a first-generation analog recording can be a better representation of the original images than a first-generation digital recording. This is because digital is a coded approximation of analog. With enough bandwidth, a first-generation analog videotape recorder (VTR) can record the more "perfect" copy.
Digital is a binary language represented by zeros (an "off" state) and ones (an "on" state), so the signal either exists ("on") or does not exist ("off"). Even with low signal power, if the transmitted digital signal is higher that the background noise level, a perfect picture and sound can be obtained—"on" is "on" no matter what the signal strength.
Digital uses its own language that is based on the terms "bits" and "bytes." Bit is short for binary digit and is the smallest data unit in a digital system. A bit is either a single 1 or a single 0. A byte consists of a series of bits. The most common length for a byte "word" is 8 bits, although there can be other lengths (e.g., 1, 2, 10, 16, 24, 32).
In an 8-bit system, there are 256 discrete values, ranging from 0 to 255. The mathematics are simple because the number of discrete values is equal to the number 2 (as in binary) raised to the power of the number of bits. In this case, 2 raised to the power of 8 equals 256. The two extreme bytes in the 8-bit system are 00000000 and 11111111, which are calculated as follows: and
An example of an 8-bit value between the two extremes is 10100011, which is calculated as follows:
The more bits in the byte, the more distinct the values. For example, a gray scale can be represented by 1 bit. This would give the scale two values (2 raised to the power of 1): 0 or 1. Therefore, the gray scale would consist of white and black. A 2-bit gray scale has four values (2 raised to the power of 2): 0, 1, 2, and 3. In this case, 0 = 0 percent white (black), 1 = 33 percent white, 2 = 67 percent white, and 3 = 100 percent white. As the number of bits is increased, a more accurate gray scale is obtained. For example, a 10-bit system has 1,024 discrete values (2 raised to the power of 10), providing a more detailed gray scale. With each additional bit, the number of discrete values is doubled, as is the number of values for the gray scale.
Digital Video and Audio
In digital video, black is not at value 0 and white is neither at value 255 for 8-bit video nor 1,023 for 10-bit video. To add some buffer space and to allow for "superblack" (which is at 0 IRE while regular black is at 7.5 IRE), black is at value 16 while white is at value 235 for 8-bit video. For 10-bit video, black is at value 64 while white is at value 940.
While digital is an approximation of the analog world—the actual analog value is assigned to its closest digital value—human perception has a hard time recognizing the fact that it is being cheated. While a few expert observers might be able to tell that something did not look right in 8-bit video, 10-bit video looks perfect to the human eye. Digitizing audio, however, is a different story. Human ears are not as forgiving as human eyes; in audio, most people require at least 16-bit resolution, while some experts argue that 20-bit, or ultimately even 24-bit, technology needs to become standard before recordings will be able to match the sensitivity of human hearing.
To transform a signal from analog to digital, the analog signal must go through the processes of sampling and quantization. The better the sampling and quantization, the better the digital image will represent the analog image.
Sampling is how often a device (such as an analog-to-digital converter) samples (or looks at) an analog signal. The sampling rate is usually given in a figure such as 48 kHz (48,000 samples per second) for audio and 13.5 MHz (13.5 million samples per second) for video. For television pictures, 8-bit or 10-bit sampling systems are normally used; for audio, 16-bit or 20-bit sampling systems are common, though 24-bit sampling systems are also used. The International Telecommunications Union-Radiocommunication (ITU-R) 601 standard defines the sampling of video components based on 13.5 MHz, and the Audio Engineering Society/European Broacasting Union (AES/EBU) defines sampling based on 44.1 and 48 kHz for audio.
Quantization, which involves assigning a more limited scale of values to the sample, usually occurs after the signal has been sampled. Consequently, it defines how many levels (bits per sample) the analog signal will have to force itself into to produce a digital approximation of the original signal. As noted earlier, a 10-bit digital signal has more levels (thus higher resolution) than an 8-bit signal.
Errors at this stage of digitizing (called quantization errors) occur because quantizing a signal only results in a digital approximation of the original signal. Errors can also occur because of loss of signal or unintended changes to a signal, such as when a bit changes its state from "off" to "on" or from "on" to "off." Just how large the error will be is determined by when that change occurred and how long the change lasted. An error can last briefly enough not to even affect one bit, or it can last long enough to affect a number of bits, entire bytes, multiple bytes, or even seconds of video and audio.
In an 8-bit byte, for example, the 1 on the far right represents the value 1. It is the least significant bit (LSB). If there is an error that changes this bit from 1 ("on") to 0 ("off"), the value of the byte changes from 163 to 162—a very minor difference. Error increases as problems occur with bits more toward the left of the byte word.
In contrast, the 1 on the left that represents the value 128 is called the most significant bit (MSB). An error that changes this bit from 1 (on) to 0 (off) changes the value of the byte from 163 to 35—a very major difference. If this represented the gray scale, the sample has changed from 64-percent white to only 14-percent white.
If the error occurs in the LSB, chances are that the effect will be lost in the noise and will not even be noticed. An MSB error may result in a pop in the sound or an unwanted dot in the picture. If the error occurs in a sync word (i.e., the part of the digital signal that controls how a picture is put together), a whole line or frame could be lost. With compressed video, an error in just the right place could disrupt not only one frame but a long string of frames.
One of the benefits of digital is that through a process called "error management," large errors can become practically invisible. When things go wrong in the digital world, bits are corrupted and the message can become distorted. The effect of these distortions varies with the nature of the digital system. With computers, there is a huge sensitivity to errors, particularly in instructions. A single error in the right place, and it becomes time to reboot. With video and audio, the effect is more subjective. Error management can be broken down into four stages: error avoidance, error detection, error correction, and error concealment.
Error management, error avoidance, and redundancy coding constitute a sort of preprocessing in anticipation of the errors to come. Much of this is simply good engineering, such as preventative maintenance for errors. For example, technicians check to make sure that there is enough transmit power and a strong enough antenna to ensure an adequate signal-to-noise ratio at the receiver.
Next comes redundancy coding, without which error detection would be impossible. Detection is one of the most important steps in error management. It must be very reliable, because if an error is undetected, it does not matter how effective the other error management techniques are.
Redundancy codes can be extremely complex, but the simple parity check illustrates the principle. As with all redundancy codes, the parity check adds bits to the original data in such a way that errors can be recognized at the receiver. Certain bits in a byte, when their representative values (1 for "on" or 0 for "off") are added together, must always be an odd or an even number. If the receiver sees that the redundancy code is incorrect (i.e., odd when it should be even, or vice versa), the receiver can request a retransmission of that part of the data.
Of course, every system has its limits. Large errors cannot be corrected. However, it is possible to interleave data (i.e., send the data out of sequence) during transmission or recording to improve the chances of a system to correct any errors.
No matter how elegant the coding, errors will occur that cannot be corrected. The only option is to conceal them. With digital audio, the simple fix is to approximate a lost sample by interpolating (averaging) a value from samples on either side. A more advanced method makes a spectral analysis of the sound and inserts samples with the same spectral characteristics. If there are too many errors to conceal, the only choice is to mute.
With digital video, missing samples can be approximated from adjacent samples in the same line or adjacent lines, or from samples in previous and succeeding fields. The technique works because there is a lot of redundancy in a video image. If the video is compressed, there will be less redundancy, so concealment may not work as well. When both correction and concealment capabilities are exceeded in video, the options are either to freeze the last frame or to drop to black.
To make digital video more affordable for both professionals and consumers, compression is used. The trade-off is quality because compression "throws away" some of the signal. For example, high definition is compressed to approximately 18 Mbits per second (18 million bits per second) for digital television transmission, a compression ratio of almost 55:1.
There are two general types of compression algorithms: lossless and lossy. As the name suggests, a lossless algorithm gives back the original data bit-for-bit on decompression. Lossless processes can be applied safely to a checkbook accounting program, but their compression ratios are usually low—on the order of 2:1. In practice, these ratios are unpredictable and depend heavily on the type of data in the files. Alas, pictures are not as predictable as text and bank records, and lossless techniques have only limited effectiveness with video.
Virtually all video compression uses lossy video compression systems. These use lossless techniques where they can, but the really big savings come from throwing things away. To do this, the image is processed or "transformed" into two groups of data. One group will, ideally, contain all the important information. The other gets all of the unimportant information. Only the important data needs to be kept and transmitted.
Lossy compression systems take the performance of the human eye into account as they decide what information to place in the important pile and which to discard in the unimportant pile. They throw away things that the eye does not notice or will not be too upset about losing. Because human perception of fine color details is limited, for example, chroma resolution can be reduced by factors of two, four, eight, or more, depending on the application.
Video compression also relies heavily on the correlation between adjacent picture elements. If television pictures consisted entirely of randomly valued pixels (noise), compression would not be possible. Fortunately, adjoining picture elements are more likely to be the same than they are to be different. Predictive coding relies on making an estimate of the value of the current pixel based on previous values for that location and other neighboring areas. The rules of the estimating game are stored in the decoder, and, for any new pixel, the encoder need only send the difference or error value between what the rules would have predicted and the actual value of the new element. The more accurate the prediction, the less data needs to be sent.
The motion of objects or the camera from one frame to the next complicates predictive coding, but it also opens up new compression possibilities. Fortunately, moving objects in the real world are somewhat predictable. They tend to move with inertia and in a continuous fashion. With the Motion Picture Experts Group (MPEG) standard, where picture elements are processed in blocks, quite a few bits can be saved if it can be predicted how a given block of pixels has moved from one frame to the next. By sending commands (motion vectors) that simply tell the decoder how to move a block of pixels that is already in its memory, resending of all the data associated with that block is avoided.
As long as compressed pictures are only going to be transmitted and viewed, compression encoders can assign lots of bits into the unimportant pile by exploiting the redundancy in successive frames. This is called "interframe" coding. If, on the other hand, the video is destined to undergo further processing such as enlargement or chromakey, some of those otherwise unimportant details may suddenly become important, and it may be necessary to spend more bits to accommodate what postproduction equipment can "see." To facilitate editing and other postprocessing, compression schemes that are intended for postproduction usually confine their efforts within a single frame and are called "intraframe." It takes more bits, but it is worth it.
Ratios such as 4:2:2 and 4:1:1 are an accepted part of the jargon of digital video, a shorthand that is taken for granted and sometimes not adequately explained. With single-channel composite signals, such as the National Television System Committee (NTSC) and Phase Alternate Line (PAL) signals, digital sampling rates are synchronized at either two, three, or four times the subcarrier frequency. The shorthand for these rates is 2fsc, 3fsc, and 4fsc, respectively.
With three-channel component signals, the sampling shorthand becomes a ratio. The first number usually refers to the sampling rate that is used for the luminance signal, while the second and third numbers refer to the rates for the red and blue color-difference signals, respectively. Thus, a 14:7:7 system would be one in which a wideband luminance signal is sampled at 14 MHz and the narrower bandwidth color-difference signals are each sampled at 7 MHz.
As work on component digital systems evolved, the shorthand changed. At first, 4:2:2 referred to sampling luminance at 4fsc (about 14.3 MHz for NTSC) and color-difference signals sampled at half that rate, or 2fsc. Sampling schemes based on multiples of NTSC or PAL subcarrier frequency were soon abandoned in favor of a single sampling standard for both 525-and 625-line component systems. Nevertheless, the 4:2:2 shorthand remained.
In current usage, "4" usually represents the internationally agreed upon sampling frequency of 13.5 MHz. Other numbers represent corresponding fractions of that frequency. Thus, a 4:1:1 ratio describes a system with luminance sampled at 13.5 MHz and color-difference signals sampled at 3.375 MHz.
The shorthand continues to evolve. Contrary to what one might expect from the discussion above, the 4:2:0 ratio that is frequently seen in discussions of MPEG compression does not indicate a system without a blue color-difference component. Here, the shorthand describes a video stream in which there are only two color-difference samples (one red, one blue) for every four luminance samples. Unlike 4:1:1, however, the samples in 525-line systems do not come from the same line as luminance; they are averaged from two adjacent lines in the field. The idea was to provide a more even and averaged distribution of the reduced color information over the picture.
Panasonic. (1999). The Video Compression Book. Los Angeles: Panasonic (Matsuchita Electronic Industrial Co., Ltd., Video Systems Division).
Silbergleid, Michael, and Pescatore, Mark J., eds.(2000). The Guide to Digital Television, 3rd edition. New York: United Entertainment Media.
Mark J. Pescatore