Coding Techniques

views updated Jun 11 2018

Coding Techniques

Representation of information is a fundamental aspect of all communication from bird songs to human language to modern telecommunications. In the case of digital storage and transmission of information, mathematical analysis has led to principles that drive the design of symbolic representations. For example, it has let a binary code be defined as a mapping from a set of source symbols, or source alphabet, to unique bit strings. A familiar example is the standard American Standard Code for Information Interchange (ASCII) code (see Table 1) in which each character from a standard keyboard is represented as a 7-bit sequence.

ASCII is an example of a block code, where each symbol is represented by a fixed-length "block" of n bits . Given a number of symbols (K ) to encode in binary form, a number of bits (n ) is chosen such that there are enough binary patterns of that length to encode all K symbols. With n bits, 2ⁿ unique strings exist, and so we choose the smallest integer n that satisfies K 2 ⁿ. Thus a 3-bit code can represent up to eight symbols, a 4-bit code can be used for a set of up to 16 symbols, etc.

Because of its universal use, the ASCII code has great advantages as a means of storing textual data and communicating between machines. On the face of it, the ASCII design seems perfectly reasonable. After all, a common language is central to communication. However, ASCII lacks certain properties desirable in a code. One of these is efficiency and the other is robustness.

Efficiency

Knowledge of symbol probabilities can be exploited to make a code more efficient. Morse code, the system of dots and dashes used for telegraph transmission

Symbol	Code	Symbol	Code	Symbol	Code	Symbol	Code
NUL	0000000		0100000	@	1000000	`	1100000
SOH	0000001	!	0100001	A	1000001	a	1100001
STX	0000010	"	0100010	B	1000010	b	1100010
ETX	0000011	#	0100011	C	1000011	c	1100011
EOT	0000100	$	0100100	D	1000100	d	1100100
ENQ	0000101	%	0100101	E	1000101	e	1100101
ACK	0000110	&	0100110	F	1000110	f	1100110
BEL	0000111	'	0100111	G	1000111	g	1100111
BS	0001000	(	0101000	H	1001000	h	1101000
TAB	0001001	)	0101001	I	1001001	i	1101001
LF	0001010	*	0101010	J	1001010	j	1101010
VT	0001011	+	0101011	K	1001011	k	1101011
FF	0001100	,	0101100	L	1001100	l	1101100
CR	0001101	-	0101101	M	1001101	m	1101101
SO	0001110	.	0101110	N	1001110	n	1101110
SI	0001111	/	0101111	O	1001111	o	1101111
DLE	0010000	0	0110000	P	1010000	p	1110000
DC1	0010001	1	0110001	Q	1010001	q	1110001
DC2	0010010	2	0110010	R	1010010	r	1110010
DC3	0010011	3	0110011	S	1010011	s	1110011
DC4	0010100	4	0110100	T	1010100	t	1110100
NAK	0010101	5	0110101	U	1010101	u	1110101
SYN	0010110	6	0110110	V	1010110	v	1110110
ETB	0010111	7	0110111	W	1010111	w	1110111
CAN	0011000	8	0111000	X	1011000	x	1111000
EM	0011001	9	0111001	Y	1011001	y	1111001
SUB	0011010	:	0111010	Z	1011010	z	1111010
ESC	0011011	;	0111011	[	1011011	{	1111011
FS	0011100	<	0111100	1011100	1111100
GS	0011101	=	0111101	]	1011101	}	1111101
RS	0011110	>	0111110	^	1011110	~	1111110
US	0011111	?	0111111	_	1011111	[.alpha]	1111111

in the early days of electric communication, made use of such knowledge. By representing the more frequent letters in common English with shorter dash-dot sequences, the average time to transmit a character is reduced in a message whose character statistics are consistent with the assumed frequencies (see Table 2).

Consider codes in which the number of bits assigned to each symbol is not fixed, and let l _i denote the number of bits in the string for the i ^th symbol s _i. In such a variable length code, it makes sense to assign shorter bit strings to symbols that tend to occur more frequently than average in typical use. Hence, an efficient code can be designed by making l _i a function of a

Symbol	Code	Symbol	Code	Symbol	Code
A	. -	N	- .	0	- - - - -
B	-. . .	O	- - -	1	. - - - -
C	-. - .	P	. - - .	2	. . - - -
D	-. .	Q	- -. -	3	. .. - -
E	.	R	. - .	4	. .. . -
F	. . - .	S	. . .	5	. .. . .
G	- - .	T	-	6	-. .. .
H	. .. .	U	. . -	7	- -. . .
I	. .	V	. .. -	8	- - -. .
J	. - - -	W	. - -	9	- - - - .
K	-. -	X	-. . -	period	. -. -. -
L	. -. .	Y	-. - -	comma	- -. . - -
M	- -	Z	- -. .

symbol's probability, p _i. Let the efficiency of a code be measured as the average number of bits per symbol, L _avg, weighted by the probabilities [Eq. 1].

The example that follows illustrates the increase in efficiency offered by a variable length code. Consider a set of four symbols a, b, c, and d with corresponding probabilities p _a 0.5, p _b 0.25, p _c p _d 0.125. Two codes are listed in Table 3, with the average lengths for each computed according to Equation 1. Note that the average length computed for Code II depends on the probability distribution, whereas the average number of bits per symbol for an n -bit block code is obviously n, regardless of the probabilities. Which code would be more efficient if the four symbols all have the same probability? (They would be equally efficient, both two bits long for each symbol.)

A potential problem with variable length codes is that an encoded string of symbols may not be uniquely decodeable; that is, there may be more than one interpretation for a sequence of bits. For example, let the symbol set {a,b,c} be encoded as a 0, b 10, c 01. The sequence 010 could be interpreted as either ab or ca, thus this code is not uniquely decodeable. This problem does not occur with all variable length codes. Huffman Codes are uniquely decodeable codes that are generated based on symbol probabilities.

The entropy of a probability distribution (denoted H and defined in Equation 2) is the lower bound for L _avg. That is, for a given set of probabilities p ₁, p ₂,… p _K, the most efficient uniquely decodeable code must satisfy:

Robustness

A principle of redundancy underlies the design of error correcting codes. By using more bits than are actually required to represent a set of symbols uniquely, a more robust code can be generated. If the code is designed such that any two legal codewords differ in at least 3 bits, then the result of "flipping" the value of any bit (that is, converting a 1 to a 0 or vice versa) will result in a string that remains closer to the original than it is to any other codeword. Similarly, if the minimum distance is 5 bits, double errors can be reliably corrected, with a 7-bit minimum distance, triple errors can be corrected, etc. A very simple illustration of this principle is the case of two symbols. In each of the four codes in Table 4, the symbol A is represented by a set of 0s, and B is represented by a block of 1s. For codes of increasing blocksize, more errors can be tolerated. For example, in the case of the 5-bit double-error correcting code, the received sequence 10100 would be interpreted as an A.

Symbol	Unique encoding	Single error correcting	Double error correcting	Triple error correcting
A	0	000	00000	0000000
B	1	111	11111	1111111

Symbol	Probability	Code I	Code II
a	0.5	00	0
b	0.25	01	10
c	0.125	10	110
d	0.125	11	111
Average Length		2	1.75

CODE
Symbol	Codeword	Symbol	Codeword
A	0000000
B	0001011
C	0010110
D	0011101
E	0100111
F	0101100
G	0110001
H	0111010
I	1000101
J	1001110
K	1010011
L	1011000
M	1100010
N	1101001
O	1110100
P	1111111

The codes noted in Table 4 are inefficient, in that they require many bits per symbol. Even the single error correcting code in Table 4 uses three times as many bits than are necessary without error correction. Much more efficient error-correcting codes are possible. The code in Table 5 is an example of a family of codes developed by Richard Hamming. It is a representation of a set of 16 symbols using 7 bits per symbol. While 4 bits would be sufficient to represent each of the symbols uniquely, this 7-bit code designed by Hamming guarantees the ability to correct a single bit error in any 7-bit block. The Hamming code is designed such that any two codewords are different in at least 3 bits. Hence, if one bit is altered by a storage or transmission error, the resulting bit string is still closer to the original codeword than it is to any of the other 15 symbols. Thus, this code is robust to single errors. Note that if two errors occur in the same block, the code fails.

Conclusion

The primary framework for symbolic representation of information is human language, which has evolved over a period spanning more than 100,000 years. But only the past century has seen the application of mathematical principles to the design of encoding schemes. In combination with high-speed electronic signal transmission, the result is a system that enables communication with efficiency, reliability, and range that would have been inconceivable a few generations ago. Ongoing improvements in high-density magnetic and optical storage media have brought about a tremendous reduction in the physical space required to store information, thus amplifying the utility of these recently developed encoding techniques.

see also Bell Labs; Binary Number System; Codes.

Paul Munro

Bibliography

Lucky, Robert W. Silicon Dreams. New York: St. Martin's Press, 1989.

Wells, Richard B. Applied Coding and Information Theory for Engineers. Upper Saddle River, NJ: Prentice Hall, 1999.

Computer Sciences Munro, Paul

Coding and Decoding

views updated May 29 2018

Coding and Decoding. Cryptography, the art of creating or deciphering secret writing, is an ancient military process with a rich history in the American military experience. U.S. coding and decoding expertise trailed that of European nations, particularly Britain, until World War II, but America became the premier cryptographic power during the Cold War and has maintained a lead in this field ever since. While military cryptography has been a powerful tool for uniformed leaders in obtaining information about an enemy's capabilities, limitations, and intentions, it is just as important to the commander in masking his own powers, vulnerabilities, and plans. In the rare case, such as the naval Battle of Midway in 1942, American deciphering abilities have proven decisive. Cryptography normally supplies only partial solutions for military intelligence and counterintelligence problems. Coding and decoding is and has always been a “cat and mouse” game, the coder occasionally gaining a temporary advantage on those who intercept and decode, only to experience the shock of a role reversal at other times.

From the outset of U.S. military operations, cryptography was practiced, but the security of American codes and the ability to read an enemy's secret writing lagged behind the U.S. Army and U.S. Navy's mentors, the British. Codes and ciphers were personally used in the Revolutionary War by both Gen. George Washington and the Continental Congress's Secret Committee. However, British agents were quite successful in penetrating Washington's headquarters as well as gaining knowledge of Benjamin Franklin's diplomatic operations in Paris. American cryptographic skills made little difference in the outcome of the Revolutionary War.

During the nineteenth century, American military cryptography suffered from the same ills that plagued U.S. military and political intelligence in general. There would be a flurry of coding and decoding activity in time of war, but with the coming of peace, cryptographic knowledge and skills would atrophy and have to be relearned again at the next outbreak of hostilities. The entire U.S. intelligence capability in this era can best be described as primitive. Those Americans who engaged in the craft were invariably amateurs.

This cycle was broken during the twentieth century through the efforts of Herbert O. Yardley, a State Department code clerk who demonstrated a capability to break foreign ciphers before World War I. During that war, Yardley was used as an instructor and organizer for U.S. military cryptography. Afterward, he resumed his State Department work in the 1920s, and much to the advantage of U.S. negotiators, broke the Japanese diplomatic code during the Washington Conference that led to the Washington Naval Arms Limitation Treaty of 1922. When the State Department discontinued this work, Yardley retired and wrote The American Black Chamber (1931), exposing his feats and causing foreign nations to manufacture ciphers that were far more difficult to decode.

The next master American codebreaker was the War Department's William F. Friedman, who managed to create a machine that could decipher much of the Japanese Foreign Office's “Purple” Code in 1940. Army and navy intelligence officers coordinated the placement of radio intercept stations, exchanged information, and produced signals intelligence known as MAGIC even before the Japanese attack on Pearl Harbor in December 1941. However, the Japanese main naval code was not broken until early 1942. During World War II, the army and navy became adept at both signals intelligence and the ability to create codes that were nearly impossible for the Axis powers to decipher. But American intercept and deciphering capabilities were no panacea; for example, in late 1944 there was a rapid decline in the quality of U.S. Army intelligence as American forces approached the German border. Telephonic communications of the German Army had been monitored and reported to the Allies by the French Resistance. Learning or suspecting this, Germans defending France were forced to use their radios more often than they would have liked and these coded radio messages were intercepted (as they had been since 1940) by the British and decided through the process called ULTRA. But as the German forces withdrew into Germany in late 1944, they traded radio communications for the comparatively secure German telephone system, and other land lines. The concentration of troops that led to the rapid and initially successful German thrust into Belgium in December 1944 in the Battle of the Bulge was not detected by a U.S. intelligence system that had grown too reliant on communications intelligence.

Following World War II, the Department of Defense (DoD) combined army and navy cryptography and in 1952 designated the resulting organization the National Security Agency (NSA). Headed by a military officer and making its headquarters in Fort Meade, Maryland, NSA kept a low profile during the Cold War. By the 1990s, it had created over 2,000 air, land, sea, and space‐based intercept sites. During this period, it gained the largest budget and the most personnel of any element of the U.S. intelligence community, including the Central Intelligence Agency. Much of the reason for this size and expense stems from the fact that NSA's work is not only dependent on the latest technology, it is also labor‐intensive. Cryptanalysis, particularly work in breaking some of America's adversaries' high‐level codes, requires large numbers of people who must endlessly toil to decipher critical foreign communications for the use of U.S. decision makers. The same applies to the creation of secure communications for the U.S. government. Secure communications also demands manpower and equipment. And NSA's work is not limited to creating or deciphering “secure” communications between people. As the missile age developed from the 1950s on, telemetry between instruments, guidance systems, and detection systems was increasingly deciphered or encoded.

Since most industrialized nations have created sophisticated codes for use in their most sensitive communications, NSA cannot quickly decipher an opponent's high‐level messages. Lower‐level codes, those associated with typical military units, are somewhat easier to break; but here some of the best information may be which units are communicating with a particular headquarters. This “traffic analysis,” the art of associating one organization with another in time and space, is a specialty of military intelligence analysts and has contributed to several American military successes, particularly before and during the Persian Gulf War, 1990–91. But as U.S. cryptographic achievements have become known, opponents have avoided radio communications, relying on face‐to‐face meetings or the simple use of messengers. Electronic intercept is only one of several components the American military intelligence community uses to provide their commanders with the best information about an adversary.
[See also Cold War: External Course; Cold War: Domestic Course.]

Bibliography

Herbert O. Yardley , The American Black Chamber, 1931.
Fletcher Pratt , Secret and Urgent: The Story of Codes and Ciphers, 1939.
David Kahn , The Codebreakers: The Story of Secret Writing, 1967.
Ronald W. Clark , The Man Who Broke Purple: The Life of Colonel William F. Friedman, 1977.
U.S. Army Security Agency , The History of Codes and Ciphers in the United States Prior to World War I, 1978.
U.S. Army Security Agency , The History of Codes and Ciphers in the United States During World War I, 1979.
James Bamford , The Puzzle Palace: A Report on NSA, America's Most Secret Agency, 1982.
Thomas Parrish , The American Codebreakers: The U.S. Role in ULTRA, 1986.