views updated


The compiler is a program that translates the source language (which takes the form of a written program), into machine code, the language of the machine. There are several portions to the compiler, some of the most prominent being the lexical analyzer , the syntactic analyzer , the code generation phase, and a code optimizer, which is optional. The generation of an intermediate code is also optional, but often done to facilitate the portability of the compiler to various machines.

The lexical analyzer finds the words of the source language. It is similar to a dictionary in a natural language such as English. It determines what symbols constitute words, and what types of word they are. In English, this includes words such as "cat" (N or Noun), "ran" (V or Verb), or "the" (Determinant). In programming languages these are "words" like " -sign" ( ), identifiers (user-defined identifiers or variables), and keywords (while, for, read, print). The words are used to form sentences in programming language, such as "while (a 5) do"

The syntactic analyzer checks the form, or organization, of the sentences in programming languages. In other words, it checks to be sure the programming language's "grammar" is correct. For example, a sentence like "to the store he went" would be considered ungrammatical while "he went to the store" would be considered grammatical. In programming languages, the following statement would be considered grammatical: for( j 0; j MAXNUM;j) printf("%d",j); Compare this to the following statement, which is considered ungrammatical in a language such as C:
for(j=0; printf("%d", j);
j MAXNUM; j)

The syntactic analyzer receives the words and their respective types (number, constant, -sign, keyword, identifier) from the lexical analyzer and sees that they are organized in the proper way according to the rules of the language. The organization, or syntax, of the language is defined by the grammar for the language.

Together, the lexical analyzer and syntactic analyzer determine whether a program is "well-written." If it is, the code generator generates the code for the program. The code is specific to each type of machine the compiler will run on. An example would be the assignment statement:
c = a + b

This adds the values of a and b, and places the result in c. This might be translated as:
movl a, R6
addl b, R6
movl R6, c
This is read as:
"move the value of a into register 6"
"add the value of b to register 6"
"move the value of register 6 to c "

The registers are special locations in the central processing unit where the arithmetic is done. The previous code is assembly language code. The compiler might also be written to generate machine code directly. This might look like:
70 52,6
72 54,6
70 6,56

Here, the code 70 would stand for "movl," 72 for "addl," and 52, 54, and 56 are the addresses of a, b, and c, respectively. The 6 is register 6. Although the example is hypothetical, it is based on the machine language (binary code ) and assembly language (symbolic code) of a real machine, the VAX family of Digital Equipment Corporation. It is because the code is machine-dependent, based on the architecture and mnemonics (symbols or symbolic language) of the machine, that the compiler must be written differently for different machines in this phase.

The lexical and syntactic analyzers are machine-independent. They base their decisions and actions on the grammar of the language being translated, such as C or Pascal. The code generation phase is machine-dependent. The code generated depends on the architecture of the machine.

A fourth phase that is often implemented is the intermediate code generation. This is a machine-independent code that can be used by the compiler writer to generate the machine code. An example might be:
Add A, B
Store C

The first instruction indicates that one should add B to A, and the second indicates that the result should be stored in C. The use of the intermediate code allows for a machine-independent code to be optimized, with the final code being translated into the code of the machine.

The machine-independent phases are sometimes referred to as the front end of the compiler, and the final phase as the back end. The front end is the same for all machines, but the back end is machine-dependent. Thus the front end, which comprises a good portion of the work, can be done once for all machines, while the back end is tailored specifically to the machine at hand. This allows the single compiler for a language, such as C, to be ported, or implemented, to many machines.

see also Binary Number System; Early Computers; Procedural Languages.

Roger R. Flynn


Aho, Alfred V., Ravi Sethi, and Jeffrey D. Ullman. Compilers: Principles, Techniques and Tools. Reading, MA: Addison-Wesley, 1986.