Coding Techniques

views updated Jun 11 2018

Coding Techniques

Representation of information is a fundamental aspect of all communication from bird songs to human language to modern telecommunications. In the case of digital storage and transmission of information, mathematical analysis has led to principles that drive the design of symbolic representations. For example, it has let a binary code be defined as a mapping from a set of source symbols, or source alphabet, to unique bit strings. A familiar example is the standard American Standard Code for Information Interchange (ASCII) code (see Table 1) in which each character from a standard keyboard is represented as a 7-bit sequence.

ASCII is an example of a block code, where each symbol is represented by a fixed-length "block" of n bits . Given a number of symbols (K ) to encode in binary form, a number of bits (n ) is chosen such that there are enough binary patterns of that length to encode all K symbols. With n bits, 2n unique strings exist, and so we choose the smallest integer n that satisfies K 2 n. Thus a 3-bit code can represent up to eight symbols, a 4-bit code can be used for a set of up to 16 symbols, etc.

Because of its universal use, the ASCII code has great advantages as a means of storing textual data and communicating between machines. On the face of it, the ASCII design seems perfectly reasonable. After all, a common language is central to communication. However, ASCII lacks certain properties desirable in a code. One of these is efficiency and the other is robustness.

Efficiency

Knowledge of symbol probabilities can be exploited to make a code more efficient. Morse code, the system of dots and dashes used for telegraph transmission

SymbolCodeSymbolCodeSymbolCodeSymbolCode
NUL0000000 0100000@1000000`1100000
SOH0000001!0100001A1000001a1100001
STX0000010"0100010B1000010b1100010
ETX0000011#0100011C1000011c1100011
EOT0000100$0100100D1000100d1100100
ENQ0000101%0100101E1000101e1100101
ACK0000110&0100110F1000110f1100110
BEL0000111'0100111G1000111g1100111
BS0001000(0101000H1001000h1101000
TAB0001001)0101001I1001001i1101001
LF0001010*0101010J1001010j1101010
VT0001011+0101011K1001011k1101011
FF0001100,0101100L1001100l1101100
CR0001101-0101101M1001101m1101101
SO0001110.0101110N1001110n1101110
SI0001111/0101111O1001111o1101111
DLE001000000110000P1010000p1110000
DC1001000110110001Q1010001q1110001
DC2001001020110010R1010010r1110010
DC3001001130110011S1010011s1110011
DC4001010040110100T1010100t1110100
NAK001010150110101U1010101u1110101
SYN001011060110110V1010110v1110110
ETB001011170110111W1010111w1110111
CAN001100080111000X1011000x1111000
EM001100190111001Y1011001y1111001
SUB0011010:0111010Z1011010z1111010
ESC0011011;0111011[1011011{1111011
FS0011100<011110010111001111100
GS0011101=0111101]1011101}1111101
RS0011110>0111110^1011110~1111110
US0011111?0111111_1011111[.alpha]1111111

in the early days of electric communication, made use of such knowledge. By representing the more frequent letters in common English with shorter dash-dot sequences, the average time to transmit a character is reduced in a message whose character statistics are consistent with the assumed frequencies (see Table 2).

Consider codes in which the number of bits assigned to each symbol is not fixed, and let l i denote the number of bits in the string for the i th symbol s i . In such a variable length code, it makes sense to assign shorter bit strings to symbols that tend to occur more frequently than average in typical use. Hence, an efficient code can be designed by making l i a function of a

SymbolCodeSymbolCodeSymbolCode
A. -N- .0- - - - -
B-. . .O- - -1. - - - -
C-. - .P. - - .2. . - - -
D-. .Q- -. -3. .. - -
E.R. - .4. .. . -
F. . - .S. . .5. .. . .
G- - .T-6-. .. .
H. .. .U. . -7- -. . .
I. .V. .. -8- - -. .
J. - - -W. - -9- - - - .
K-. -X-. . -period. -. -. -
L. -. .Y-. - -comma- -. . - -
M- -Z- -. .

symbol's probability, p i . Let the efficiency of a code be measured as the average number of bits per symbol, L avg , weighted by the probabilities [Eq. 1].

The example that follows illustrates the increase in efficiency offered by a variable length code. Consider a set of four symbols a, b, c, and d with corresponding probabilities p a 0.5, p b 0.25, p c p d 0.125. Two codes are listed in Table 3, with the average lengths for each computed according to Equation 1. Note that the average length computed for Code II depends on the probability distribution, whereas the average number of bits per symbol for an n -bit block code is obviously n, regardless of the probabilities. Which code would be more efficient if the four symbols all have the same probability? (They would be equally efficient, both two bits long for each symbol.)

A potential problem with variable length codes is that an encoded string of symbols may not be uniquely decodeable; that is, there may be more than one interpretation for a sequence of bits. For example, let the symbol set {a,b,c} be encoded as a 0, b 10, c 01. The sequence 010 could be interpreted as either ab or ca, thus this code is not uniquely decodeable. This problem does not occur with all variable length codes. Huffman Codes are uniquely decodeable codes that are generated based on symbol probabilities.

The entropy of a probability distribution (denoted H and defined in Equation 2) is the lower bound for L avg . That is, for a given set of probabilities p 1, p 2, p K , the most efficient uniquely decodeable code must satisfy:

Robustness

A principle of redundancy underlies the design of error correcting codes. By using more bits than are actually required to represent a set of symbols uniquely, a more robust code can be generated. If the code is designed such that any two legal codewords differ in at least 3 bits, then the result of "flipping" the value of any bit (that is, converting a 1 to a 0 or vice versa) will result in a string that remains closer to the original than it is to any other codeword. Similarly, if the minimum distance is 5 bits, double errors can be reliably corrected, with a 7-bit minimum distance, triple errors can be corrected, etc. A very simple illustration of this principle is the case of two symbols. In each of the four codes in Table 4, the symbol A is represented by a set of 0s, and B is represented by a block of 1s. For codes of increasing blocksize, more errors can be tolerated. For example, in the case of the 5-bit double-error correcting code, the received sequence 10100 would be interpreted as an A.

SymbolUnique encodingSingle error correctingDouble error correctingTriple error correcting
A0000000000000000
B1111111111111111
SymbolProbabilityCode ICode II
a0.5000
b0.250110
c0.12510110
d0.12511111
Average Length 21.75
CODE
SymbolCodewordSymbolCodeword
A0000000
B0001011
C0010110
D0011101
E0100111
F0101100
G0110001
H0111010
I1000101
J1001110
K1010011
L1011000
M1100010
N1101001
O1110100
P1111111

The codes noted in Table 4 are inefficient, in that they require many bits per symbol. Even the single error correcting code in Table 4 uses three times as many bits than are necessary without error correction. Much more efficient error-correcting codes are possible. The code in Table 5 is an example of a family of codes developed by Richard Hamming. It is a representation of a set of 16 symbols using 7 bits per symbol. While 4 bits would be sufficient to represent each of the symbols uniquely, this 7-bit code designed by Hamming guarantees the ability to correct a single bit error in any 7-bit block. The Hamming code is designed such that any two codewords are different in at least 3 bits. Hence, if one bit is altered by a storage or transmission error, the resulting bit string is still closer to the original codeword than it is to any of the other 15 symbols. Thus, this code is robust to single errors. Note that if two errors occur in the same block, the code fails.

Conclusion

The primary framework for symbolic representation of information is human language, which has evolved over a period spanning more than 100,000 years. But only the past century has seen the application of mathematical principles to the design of encoding schemes. In combination with high-speed electronic signal transmission, the result is a system that enables communication with efficiency, reliability, and range that would have been inconceivable a few generations ago. Ongoing improvements in high-density magnetic and optical storage media have brought about a tremendous reduction in the physical space required to store information, thus amplifying the utility of these recently developed encoding techniques.

see also Bell Labs; Binary Number System; Codes.

Paul Munro

Bibliography

Lucky, Robert W. Silicon Dreams. New York: St. Martin's Press, 1989.

Wells, Richard B. Applied Coding and Information Theory for Engineers. Upper Saddle River, NJ: Prentice Hall, 1999.

Coding and Decoding

views updated May 29 2018

Coding and Decoding. Cryptography, the art of creating or deciphering secret writing, is an ancient military process with a rich history in the American military experience. U.S. coding and decoding expertise trailed that of European nations, particularly Britain, until World War II, but America became the premier cryptographic power during the Cold War and has maintained a lead in this field ever since. While military cryptography has been a powerful tool for uniformed leaders in obtaining information about an enemy's capabilities, limitations, and intentions, it is just as important to the commander in masking his own powers, vulnerabilities, and plans. In the rare case, such as the naval Battle of Midway in 1942, American deciphering abilities have proven decisive. Cryptography normally supplies only partial solutions for military intelligence and counterintelligence problems. Coding and decoding is and has always been a “cat and mouse” game, the coder occasionally gaining a temporary advantage on those who intercept and decode, only to experience the shock of a role reversal at other times.

From the outset of U.S. military operations, cryptography was practiced, but the security of American codes and the ability to read an enemy's secret writing lagged behind the U.S. Army and U.S. Navy's mentors, the British. Codes and ciphers were personally used in the Revolutionary War by both Gen. George Washington and the Continental Congress's Secret Committee. However, British agents were quite successful in penetrating Washington's headquarters as well as gaining knowledge of Benjamin Franklin's diplomatic operations in Paris. American cryptographic skills made little difference in the outcome of the Revolutionary War.

During the nineteenth century, American military cryptography suffered from the same ills that plagued U.S. military and political intelligence in general. There would be a flurry of coding and decoding activity in time of war, but with the coming of peace, cryptographic knowledge and skills would atrophy and have to be relearned again at the next outbreak of hostilities. The entire U.S. intelligence capability in this era can best be described as primitive. Those Americans who engaged in the craft were invariably amateurs.

This cycle was broken during the twentieth century through the efforts of Herbert O. Yardley, a State Department code clerk who demonstrated a capability to break foreign ciphers before World War I. During that war, Yardley was used as an instructor and organizer for U.S. military cryptography. Afterward, he resumed his State Department work in the 1920s, and much to the advantage of U.S. negotiators, broke the Japanese diplomatic code during the Washington Conference that led to the Washington Naval Arms Limitation Treaty of 1922. When the State Department discontinued this work, Yardley retired and wrote The American Black Chamber (1931), exposing his feats and causing foreign nations to manufacture ciphers that were far more difficult to decode.

The next master American codebreaker was the War Department's William F. Friedman, who managed to create a machine that could decipher much of the Japanese Foreign Office's “Purple” Code in 1940. Army and navy intelligence officers coordinated the placement of radio intercept stations, exchanged information, and produced signals intelligence known as MAGIC even before the Japanese attack on Pearl Harbor in December 1941. However, the Japanese main naval code was not broken until early 1942. During World War II, the army and navy became adept at both signals intelligence and the ability to create codes that were nearly impossible for the Axis powers to decipher. But American intercept and deciphering capabilities were no panacea; for example, in late 1944 there was a rapid decline in the quality of U.S. Army intelligence as American forces approached the German border. Telephonic communications of the German Army had been monitored and reported to the Allies by the French Resistance. Learning or suspecting this, Germans defending France were forced to use their radios more often than they would have liked and these coded radio messages were intercepted (as they had been since 1940) by the British and decided through the process called ULTRA. But as the German forces withdrew into Germany in late 1944, they traded radio communications for the comparatively secure German telephone system, and other land lines. The concentration of troops that led to the rapid and initially successful German thrust into Belgium in December 1944 in the Battle of the Bulge was not detected by a U.S. intelligence system that had grown too reliant on communications intelligence.

Following World War II, the Department of Defense (DoD) combined army and navy cryptography and in 1952 designated the resulting organization the National Security Agency (NSA). Headed by a military officer and making its headquarters in Fort Meade, Maryland, NSA kept a low profile during the Cold War. By the 1990s, it had created over 2,000 air, land, sea, and space‐based intercept sites. During this period, it gained the largest budget and the most personnel of any element of the U.S. intelligence community, including the Central Intelligence Agency. Much of the reason for this size and expense stems from the fact that NSA's work is not only dependent on the latest technology, it is also labor‐intensive. Cryptanalysis, particularly work in breaking some of America's adversaries' high‐level codes, requires large numbers of people who must endlessly toil to decipher critical foreign communications for the use of U.S. decision makers. The same applies to the creation of secure communications for the U.S. government. Secure communications also demands manpower and equipment. And NSA's work is not limited to creating or deciphering “secure” communications between people. As the missile age developed from the 1950s on, telemetry between instruments, guidance systems, and detection systems was increasingly deciphered or encoded.

Since most industrialized nations have created sophisticated codes for use in their most sensitive communications, NSA cannot quickly decipher an opponent's high‐level messages. Lower‐level codes, those associated with typical military units, are somewhat easier to break; but here some of the best information may be which units are communicating with a particular headquarters. This “traffic analysis,” the art of associating one organization with another in time and space, is a specialty of military intelligence analysts and has contributed to several American military successes, particularly before and during the Persian Gulf War, 1990–91. But as U.S. cryptographic achievements have become known, opponents have avoided radio communications, relying on face‐to‐face meetings or the simple use of messengers. Electronic intercept is only one of several components the American military intelligence community uses to provide their commanders with the best information about an adversary.
[See also Cold War: External Course; Cold War: Domestic Course.]

Bibliography

Herbert O. Yardley , The American Black Chamber, 1931.
Fletcher Pratt , Secret and Urgent: The Story of Codes and Ciphers, 1939.
David Kahn , The Codebreakers: The Story of Secret Writing, 1967.
Ronald W. Clark , The Man Who Broke Purple: The Life of Colonel William F. Friedman, 1977.
U.S. Army Security Agency , The History of Codes and Ciphers in the United States Prior to World War I, 1978.
U.S. Army Security Agency , The History of Codes and Ciphers in the United States During World War I, 1979.
James Bamford , The Puzzle Palace: A Report on NSA, America's Most Secret Agency, 1982.
Thomas Parrish , The American Codebreakers: The U.S. Role in ULTRA, 1986.

Rod Paschall