CS325 - Compiler Design

A compiler is a program that can read a program in one language, the source language, and translate it into an equivalent program in another language, the target language.

An interpreter appears to directly execute the operations specified in the source program on inputs supplies by the user. It does not product a target program.

Compiled programs are faster than interpreted programs but an interpreter can give better error diagnostics.

A compiler consists of several phases. The front-end or analysis phase consists of lexical analysis, syntax analysis and semantic analysis. The back-end or synthesis phase consists of intermediate code generation, machine-independent code optimisation, code generation and machine-dependent code optimisation.

The entire process is as follows:

character stream >lexical> token stream >syntax> syntax tree >semantic> syntax tree >IC generation> intermediate representation (IR) >optimisation> IR >code generator> target-machine code >optimisation> target-machine code

Lexical analysis is the first stage of the compiler. It does the following

  • Scans the input one character at a time
  • Removes whitespace
  • Groups the characters into meaningful sequences called lexemes
  • For each lexeme, produces as output a token of the form <token_name, attribute_value>
  • Builds a symbol table
  • Attempts to find the longest possible legal token
  • Does error recovery

Syntax analysis or parsing is the process of extracting the structure of the program. Languages can be expressed using a grammar. Parsing determines that the program can be derived from the grammar and produces a parse tree. Real compilers rarely produces a full parse tree and store it during compilation.

Sematic analysis verifies the program is meaningful. It uses the syntax tree and symbol table to check for semantic consistency and gather type information. This stage is where type checking, type coercion and scope checks take place.

In the process of translating a program, a compiler may construct one or more intermediate representations. After semantic analysis, many compilers generate a low-level intermediate representation (IR). IR represent all the facts the compiler has derived about a program. IR generation is the last platform-independent stage.

The symbol table is an IR.

Lexing

  • Token - A pair consisting of a token name and optional attribute value
  • Lexeme - sequence of characters in the source program which matches the pattern for a token
  • Pattern - description of the form that lexemes of a token may take
  • Recogniser - a program that identifies words in a stream of characters, can be represented as finite automata or regular expressions

For notes about regular expressions and context-free grammars, see CS259.

A parser must infer a derivation for a given input or determine that none exists.

There are two distinct and opposite methods for deriving a parse tree - top-down and bottom up.

A top-down parser begins at the root and grows the tree towards its leaves. At each step it selects a node for some non-terminal on the lower fringe of the tree and extends it with a subtree that represents the RHS of a production.

A bottom-up parser begins with the leaves and grows to the root. At each step it identifies a contiguous substring of the upper fringe of the parse tree that matches the RHS of some production and builds a node for the LHS.

If at some stage of a top-down parse none of the available productions match, the parser can backtrack and try a different production. If the parser has tried all productions, it can conclude there is a syntax error.

The top-down parser with backtracking works as follows:

root = start symbol
focus = root
push(null)
word = NextWord()
while(true){
    if(focus is a non-terminal){
        pick a rule to expand focus (A -> b1b2...bn)
        push(bn,...,b2)
        focus = b1
    } else if(word matches focus){
        word = NextWord()
        focus = pop()
    } else if(word == eof and focus == null){
        accept the input and return root
    } else{
        backtrack
    }
}

Backtracking is expensive and it is usually possible to transform most grammars into a form that does not require backtracking. Three methods of transforming a grammar to make it suitable for top-down parsing are

  1. Eliminating left recursion
  2. Predictive grammars
  3. Left factoring

A grammar is left recursive if it has a non-terminal $A$ such that there is a derivation $A \rightarrow Aa$ for some string $a$. Top-down parsers can't handle left recursion, it will just loop indefinitely.

It is possible to transform a grammar from left recursive to right recursive. Right recursive grammars work with top-down parsers.

  • Direct left recursion ($A \rightarrow Aa)$ is easy to eliminate
  • Indirect left recursion ($A \rightarrow B$, $B \rightarrow C$, $C \rightarrow AD$) is less obvious

A direct left recursive production of the form $X \rightarrow X \alpha ;|; \beta$ can be eliminated by replacing it with two new productions, $X \rightarrow \beta X'$ and $X' \rightarrow \alpha X' ;|; \epsilon$. The new grammar accepts the same language but uses right recursion.

The general method to transform a grammar to eliminate direct left recursion is as follows

  1. Eliminate $\epsilon$-productions ($A \rightarrow \epsilon$)
  2. Eliminate cycles (such as $A \rightarrow B, B \rightarrow C, C \rightarrow A$)
  3. Group productions into the form $A \rightarrow A \alpha_1 | A \alpha_2 | \dots | A \alpha_m | \beta_1 | \beta_2 | \dots | \beta_n$ where no $\beta_i$ begins with an $A$.
  4. Replace this production with $A \rightarrow \beta_1 A' | \beta_2 A' | \dots | \beta_n A'$ and $A' \rightarrow \alpha_1 A' | \alpha_2 A' | \dots | \alpha_m A' | \epsilon$

Indirect left recursion occurs when a chain of productions causes left recursion. After the grammar has no cycles or $\epsilon$-productions, indirect left recursion can be eliminated using the following algorithm

arrange the non-terminals in some order A1, A2, ..., An
for each i from 1 to n {
    for each j from 1 to i - 1 {
        replace each production of the form Ai -> A_jy by the  productions Ai -> s1y | s2y | ... |sky where Aj -> s1 | s2 | ... | sk are all current Aj productions
    }
    eliminate the immediate left recursion among the Ai-productions
}

Eliminating left recursion stops a top-down parser from entering an infinite loop.

A variable $A$ is nullable if all the symbols in $A$ can be expanded with $\epsilon$ productions, that is, $A$ derives $\epsilon$. This is denoted $\text{nullable}(A)$ which is either true or false.

To determine if a variable $A$ is nullable, do the following

  • If $A \rightarrow \epsilon$ is a production then $A$ is nullable
  • If $A \rightarrow B_1 B_2 \dots B_k$ is a production and each $B_i$ is nullable then $A$ is nullable
  • $\text{nullable}(\epsilon)$ = true
  • $\text{nullable}(a)$ = false where $a$ is a terminal
  • $\text{nullable}(\alpha\beta) = \text{nullable}(\alpha) \text{ and nullable}(\beta)$ where $\alpha$ and $\beta$ are sequences of grammar symbols
  • $\text{nullable}(N) = \text{nullable}(\alpha_1) \text{or ... or nullable}(\alpha_n)$ where $N \rightarrow \alpha_1 | \dots | \alpha_n$.

To eliminate $\epsilon$-productions, look for non-terminals in the RHS of the each production. If the non-terminal is nullable, create a new production by replacing it with $\epsilon$. For example, $Z \rightarrow bZa | \epsilon$ can be replaced with $Z \rightarrow bZa | ba$ by substituting $Z = \epsilon$ in the first production.

Eliminating $\epsilon$-productions can greatly increase the size of a grammar.

One type of top-down parser is the recursive descent parser.

  • There is one parse method for each non-terminal symbol
  • A non-terminal symbol on the right hand side of a rewrite rule leads to a call to the parse method for that non-terminal
  • A terminal symbol on the right hand side of a rewrite rule leads to consuming that token from the input token string
  • Multiple productions for a non-terminal leads to the parser needing to select one production at a time, backtracking for incorrect productions, to complete the parse.

If a top-down parser picks the wrong production it may need to backtrack. With a lookahead in the input stream and using context a parser may pick the correct production. A grammar that can be parsed top-down without backtracking is also called a predictive grammar.

The FIRST set is defined as follows

  • If $\alpha$ is a terminal or EOF then FIRST($\alpha$) = $\{\alpha\}$
  • For a non-terminal $A \rightarrow X_1 | \dots | X_n$, FIRST($A$) is the set containing the first terminal of the RHS of each $X_i$, so if $X_i \rightarrow abM$ then $a$ would be in the FIRST set as it is the first terminal
  • If $\alpha \rightarrow^+ \epsilon$ then $\epsilon$ is also in FIRST($\alpha$)

If FIRST($\alpha$) and FIRST($\beta$) are disjoint sets then the parser can choose the correct production by looking ahead only on symbol, except when there are epsilon productions. If there are epsilon productions, the parser needs the symbol after the next.

The FOLLOW set for a non-terminal $A$ is the set of terminals that can appear immediately to the right of $A$ is some sentential form. Alternatively, FOLLOW($A$) is the set of terminals $a$ such that there exists a derivation of the form $S \rightarrow^+ \alpha A a \beta$ for some $\alpha$ and $\beta$.

The FIRST set can be computed as follows

  1. FIRST($\epsilon$) = $\emptyset$
  2. FIRST($a$) = $\{a\}$ where $a$ is a terminal
  3. If $X \rightarrow \epsilon$ is a production, then add $\epsilon$ to FIRST($X$)
  4. If $X \rightarrow \alpha_1\dots\alpha_n$ is a production then FIRST($X$) is
    • FIRST($\alpha_1$) if $a_1$ is not nullable
    • $\{$FIRST $(\alpha_1 \setminus \epsilon) \} \cup$ FIRST($\alpha_2$) if $\alpha_1$ is nullable and $\alpha_2$ is not
    • $\epsilon$ if FIRST($a_i$) contains $\epsilon$ for all $i$
  5. FIRST($N$) = FIRST($\alpha_1$) $\cup \dots \cup$ FIRST($\alpha_n$) where $N \rightarrow \alpha_1 | \dots | \alpha_n$

For example, the FIRST set for $X \rightarrow (Expr) | num | name$ is

  • FIRST($(Expr)$) $\cup$ FIRST($num$) $\cup$ FIRST($name$) by rule 5
  • FIRST($($) $\cup num \cup name$ by rule 4 and rule 2
  • $\{(, num, name\}$ by rule 2

The FOLLOW set can be computed using the following rules

  1. If $A$ is the rightmost symbol in some sentential form then EOF is in FOLLOW($A$)
  2. For a production $A \rightarrow \alpha B\beta$ then everything in FIRST($\beta$) except $\epsilon$ is in FOLLOW($B$)
  3. For a production $A \rightarrow \alpha B$ or a production $A \rightarrow \alpha B \beta$ where $\beta$ is nullable, everything in FOLLOW($A$) is in FOLLOW($B$)

For example, consider the grammar $$ \begin{aligned} S &\rightarrow AC\ C &\rightarrow c|\epsilon\ A &\rightarrow aBCd |BQ\ B &\rightarrow bB | \epsilon\ Q &\rightarrow q | \epsilon \end{aligned} $$

First, work out which non-terminals are nullable, then compute the FIRST set.

Non-terminalNullableFIRST
$S$TrueFIRST($AC$) = FIRST($A$) $\cup$ FIRST($C$)
$A$True$\{a\} \cup$ FIRST($B$) $\cup$ FIRST($Q$)
$B$True$\{b, \epsilon\}$
$C$True$\{c, \epsilon\}$
$Q$True$\{q, \epsilon\}$

Which gives

Non-terminalFIRST
$S$$\{a, b, q, c, \epsilon\}$
$A$$\{a, b, q, \epsilon\}$
$B$$\{b, \epsilon\}$
$C$$\{c, \epsilon\}$
$Q$$\{q, \epsilon\}$

This can then be used to derive the FOLLOW sets

Non-terminalFOLLOW
$S$$\text{EOF}$
$A$$\{ \operatorname{FIRST}(C) \setminus \epsilon\}\cup \{ \text{EOF} \} $
$B$$\{ \operatorname{FIRST}(C) \setminus \epsilon\} \cup \{d\} \cup \{ \operatorname{FIRST}(Q) - \epsilon \} \cup \operatorname{FOLLOW}(A)$
$C$$\{d\} \cup \operatorname{FOLLOW}(S)$
$Q$$\operatorname{FOLLOW}(A) \setminus \{c, \text{EOF} \}$

Which gives

Non-terminalFOLLOW
$S$$\text{EOF}$
$A$$\{c, \text{EOF}\}$
$B$$\{c, d, q, \text{EOF}\}$
$C$$\{d, \text{EOF}\}$
$Q$$\emptyset$

If at any point in the derivation a production $N \rightarrow \alpha$ on an input symbol $c$ can always be chosen that satisfies either of the following

  • $c \in$ FIRST($\alpha$)
  • $\alpha$ is nullable and $c \in$ FOLLOW($N$)

then the grammar is LL(1). This is called the LL(1) property and means the grammar can be parsed with a lookahead of 1. An LL(1) grammar is backtrack free for a top-down parser and allows it to make a correct choice with a lookahead of exactly one symbol.

LL(1) grammars are also called predictive grammars as the parser can predict the correct production at each point in the parse. These parsers are called predictive parsers.

Not all grammars have the LL(1) property but can still be transformed to avoid backtracing using left-factoring.

Given a non-terminal $A \rightarrow \alpha \beta_1 | \alpha \beta_2 | \dots | \alpha \beta_n|\gamma_1|\dots|\gamma_m$ where $\alpha$ is the common prefix and $\gamma_i$ represent RHSs which do not begin with $\alpha$, the left-factoring transformation introduces a new non-terminal $B$: $$ \begin{aligned} A &\rightarrow \alpha B | \gamma_1 | \dots | \gamma_m \\ B &\rightarrow \beta_1 | \beta_2 | \dots | \beta_n \end{aligned} $$

Left-factoring can often eliminate the need to backtrack, but some CFLs have no backtrack-free grammar.

Top-down parsers can be automatically generated for LL(1) grammars. The resulting parser is called an LL(1) parser.

Table driven parser are a type of automatically generated LL(1) parser. They use the FIRST and FOLLOW sets to build a table which can then be used for parsing.

Overall, LL(1) parsers can be constructed as follows

  1. Eliminate ambiguity
  2. Eliminate left-recursion
  3. Perform left factorisation where required
  4. Add an extra start production to the grammar
  5. Calculate FIRST and FOLLOW sets
  6. Choose correct productions using LL(1) property to generate a parser

Most programming language constructs can be expressed in a backtrack-free grammar and therefore implemented with a top-down predictive parser. Predictive parsers are efficient and can produce useful error messages. The main disadvantage of top-down parsers is their inability to handle left recursion.

Bottom-up parsers build a parse tree starting from its leaves and working toward its root. The parser constructs a leaf node for each word returned by the scanner. These leaves for the lower fringe of the tree. To build a derivation, the parser adds layers of non-terminals on top of the leaves as dictated by the grammar and the partially completed lower portion of the parse tree.

A reverse derivation is called a reduction. Bottom-up parsing uses left-to-right scanning of the input to construct a rightmost derivate in reverse. These parsers are called LR(k) parsers.

A handle is a substring that matches the body of a production. A reduction of a handle represents one step along the reverse of a rightmost derivation. The most critical part of bottom-up parsing is finding a handle to reduce.

Shift-reduce parsing is a form of bottom-up parsing where a stack holds grammar symbols and an input buffer holds the rest of the string to be parsed. The handle always appears at the top of the stack just before it is identified as the handle. The parser works as follows

  1. Parser shifts zero or more input symbols onto the stack until it is ready to reduce a string of grammar symbols on the top of the stack
  2. Reduce the top of the stack to the head of the appropriate production
  3. Repeat until the parser detects an error or the stack contains the EOF symbol and the input is empty

Advantages of LR(k) parsers include

  • They can recognise nearly all programming language constructs for which CFGs can be written
  • Can detect syntax errors as soon as they occur
  • More general than LL(k) parsers, can parse everything they can and more

The main disadvantage is that they are hard to write by hand and usually make use of parser generators.

An LR parser makes shift-reduce decisions by maintaining states. Each state maintained by an LR parser represents a set of items where an item indicates how much of a production has been seen at a given point in the parsing process.

An item of a grammar is a production with a dot ($\cdot$) at some point in the body. This is called an LR(0) item.

A collection of LR(0) items, called a canonical LR(0) collection, provides the basis for constructing a deterministic finite automaton that is used to make parsing decisions. The canonical collection represents all of the possible states of the parser and the transitions between those states.

To derive the canonical collection, first add a new start state $S' \rightarrow S \$$ where $\$$ is the EOF symbol. Next, compute the closure of item sets. The closure of a set of items $I$ is the set of items constructed from $I$ using two rules

  1. Initially add every item in $I$ to CLOSURE($I$)
  2. If $A \rightarrow \alpha \cdot B \beta$ is in CLOSURE($I$) and $B \rightarrow \gamma$ is a production then add the item $B \rightarrow \cdot \gamma$ if it doesn't already exist
  3. Repeat 2 until no items of the the form exist

Once the closure set of items has been computed, compute the GOTO function for the set of items. The GOTO function defines transitions in the LR(0) automaton.

GOTO($I, X$) where $I$ is a set of items and $X$ is a grammar symbol is defined to be the closure of the set of all items $A \rightarrow \alpha X \cdot \beta$ such that $A \rightarrow \alpha \cdot X\beta$ is in $I$.

The LR(0) automaton can be used to produce an LR(0) table, which can then be used by the parser. It is possible for a shift/reduce conflict to occur in the parsing table. This can be resolved with the more powerful SLR(1) parsing table which uses a lookahead of 1 and the FOLLOW set.

For grammars which can't be parsed with an SLR(1) parser, there are more powerful LR(1) and LALR parsers.

Semantic analysis

Syntax analysis checks a program is grammatically correct. Semantic analysis ensures that the program has a well defined meaning.

There are two methods for implementing semantic analysis

  • Attribute grammars - augmented grammars with rules to do semantic checking, also called a Syntax Directed Definitions (SDDs)
  • Translation schemes - based on attribute grammars but applies ad-hoc code, called semantic actions, within production bodies, also called Syntax Directed Translations (SDTs)

SDDs are rules attached to productions in a grammar. For example, a production $expr \rightarrow expr' + term$ could have a semantic rule to put it into postfix notation, expr.t = expr'.t || term.t || '+' where || denotes concatenation and t is called an attribute.

Applying semantic rules to a parse tree results in an annotated parse tree which show the attribute values at each node. Attributes may be of any kind, such as numbers, types, references to the symbol table or strings.

A synthesised attribute for a non-terminal $A$ at parse-tree node $N$ is defined by a semantic rule associated with the production at $N$. That production must have $A$ as the head. A synthesised attribute at node $N$ is defined only in terms of attribute values at the children of $N$ and at $N$ itself. Synthesised attributes can be evaluated during a single bottom-up traversal of a parse tree.

An SDD that only involves synthesised attributes is called S-attributed. In an S-attributed SDD, each rule computes an attribute for the non-terminal at the head of a production from attributes taken from the body of the production.

An inherited attribute for a non-terminal $B$ at parse tree node $N$ is defined by a semantic rule associated with the production at the parent of $N$. The production must have $B$ as a symbol in its body. An inherited attribute at $N$ is defined only in terms of attribute values at the parents and siblings of $N$ and $N$ itself.

Dependency graphs are a method of determining evaluation order for the attribute instances in a given parse tree. They depict the flow of information among the attribute instances in a given parse tree. An edge from one attribute instance to another means that the value of the first is needed to compute the second, expressing constraints implied by the semantic rules.

A topological sort of the dependency graph determines the order in which to evaluate nodes. If a graph contains cycles there is topological sorts, otherwise there is always at least one. Two classes of attribute grammars which don't contain cycles and therefore guarantee an evaluation order are

  • S-attributed grammars
  • L-attributed grammars

An attribute grammar is S-attributed if every attribute is synthesised. The attributes can be evaluated in any bottom-up order of the nodes in the parse tree, such as post-order. A post-order traversal of the parse tree corresponds to how an LR parser reduced a production to its head, so expressions can be evaluated without explicitly creating tree nodes.

An attribute grammar is L-attributed if every attribute is synthesised or inherited, but dependency graph edges can only go from left to right, not right to left.

Non-local computation in an SDD needs lots of supporting semantic rules. This is not practical for a real compiler. These additional rules can be replaced with a symbol table, but this breaks the dependency graph. These problems are resolved by SDTs.

SDTs are like SDDs but with program fragments, called semantic actions, embedded in production bodies. SDTs are more implementation oriented and indicate the order in which semantic rules are to be evaluated. They are usually implemented during parsing without building a parse tree. SDTs are often implemented alongside a symbol table.

SDTs can be used for type inference.

Any SDT can be implemented by building a parse tree and then performing the actions in a pre-order traversal, however SDTs are usually implemented during parsing without building a parse tree. A semantic action can be executed as soon as all the grammar symbols to its left have been matched. For example, given a production $B \rightarrow x\{a\}Y$, $a$ can be performed as soon as $X$ is recognised. For a bottom-up parse, the action is performed as soon as $X$ appears on the top of the parsing stack. For a top-down parse, the action is performed just before $Y$ is expanded.

SDTs that can be implemented during parsing include

  • Postfix SDTs
    • The underlying grammar is LR-parsable
    • The SDT implements an S-attributed SDD
    • Each semantic action is placed at the end of the production
    • The action is executed along with the reduction of the body during bottom-up parsing
  • SDTs that implement L-attributed definitions
    • The underlying grammar is LL-parsable
    • The SDT implements an L-attributed

Not all SDTs can be implemented during parsing but any SDT can be implemented using a parse tree.

Types

A type is a set of values and a set of operations on those values. A programming language comes with a type system which specifies which operations are valid for which types.

Type checking verifies that the rules of the type system are respected. The goal is to ensure that operations are only used with the correct types. Type errors arise when operations are performed on values that don't support them. Type checking also enables some compiler optimisations.

Programming languages can be classified based on their type systems

  • Static - All type checking done at compilation (C, Java)
  • Dynamic - All type checking done at run time (Python, JS)
  • Untyped - No type checking done at all (Machine code)
  • Strong - All type errors are caught (Python, Java)
  • Weak - Operations may be performed on values of the wrong type (C, Perl)

Static typing catches many errors at compile time and there is no runtime type checking overhead but the compiler must prove correctness and it is difficult to do rapid prototyping.

Dynamic typing allows for rapid prototyping but can have type errors at run time and have overhead from runtime type checking.

There are several methods of type checking

  • Type synthesis - Builds the type of an expression from the types of its sub-expressions
  • Type inference - Determines the type of a language construct from the way it used
  • Type conversion - Reassigning a type of a variable or expressions
    • Explicit type conversion - Done by the programmer, called a cast
    • Implicit type conversion - Done by the compiler, called a coercion
    • Widening conversions - Preserve information, like int to float
    • Narrowing conversions - Lose information, like float to int
    • Coercions are usually limited to widening conversions

The formalism for type checking is logical inference. For example, $$ \frac{ \begin{aligned} &X \text{ is an identifier} \ &X \text{ is a variable in scope } S \text{ with type } T \end{aligned} }{S \vdash X: T} $$

Another example is for array access $$ \frac{S \vdash e_1: \text{T[ ]} \quad S \vdash e_2: \text{int}}{S \vdash e_1[e_2]: \text{T}} $$

Types can then be checked by doing a bottom-up pass of the AST.

Formal specification of a type system gives a rigorous, implementation independent definition of types and allows for formal verification of program properties.

Intermediate representation

In the process of translating a source program into target code, a compiler may construct one or more intermediate representations (IRs). IR is the central data structure that encodes the compiler's knowledge of the program. IR code generation is usually the last machine-independent phase of the compiler.

There are three broad categories of IR

  1. Structure - Graphical, linear or hybrid
  2. Abstraction - Low-level representation
  3. Naming discipline - Schemes used to name values in an IR have a direct effect on the compiler's ability to optimise the IR and generate quality assembly from it

There are three types of structure IRs

  1. Graphical IRs encode the compiler's knowledge in a graph, such as parse trees and ASTs. These use a lot of memory.
  2. Linear IRs resemble pseudocode for an abstract machine. Examples include three-address code and stack machine code.
  3. Hybrid IRs combine elements of both, such as using linear IR to represent blocks of code but a graph to represent control flow.

Parse trees are a graphical IR with a node for each grammar symbol in the derivation. An AST is a parse tree without most of the non-terminals. A Directed Acyclic Graph (DAG) is a graphical IR like an AST with common sub-expressions factored out.

Control Flow Graphs (CFGs) model the flow of control between the basic blocks in a program. A basic block is a maximal-length sequence of branch-free code.

One-address code is a linear IR which models the behaviour of a stack machine. It has a stack of operands which are popped, evaluated and then pushed back. For example, stack machine code for $a-2 \times b$ would be

push 2
push b
multiply
push a
subtract

Stack machine code is simple to generate and execute. Java uses bytecode which uses one byte or less for opcodes which are run in an interpreter.

Three-address code is another linear IR. For example, $a-2 \times b$ becomes

T1 = 2
T2 = b
T3 = T1 * T2
T4 = a
T5 = T4 - T3

where Tn is a temporary register name.

Many modern processors implement three-address operations, so it is a good IR for modelling real assembly.

An address in three-address code can be one of the following

  • Name
  • Constant
  • Compiler generated temporary name

Valid three-address code instructions are

  • Binary assignment x = y op z
  • Unary assignment x = op y
  • Copy x = y
  • Unconditional jump goto L
  • Conditional jump if x goto L
  • Procedure call
param x1
...
param xn
call p, n
  • Indexed copy x = y[i] or x[i] = y
  • Address assignments x = &y
  • Pointer assignments x = *y or *x = y

Three-address code can be stored using

  • Quadruples - Four fields: op, arg1, arg2 and result. More space but easier to optimise
  • Triples - Three fields: op, arg1, arg2. Less space but harder to optimise
  • Indirect triples - Listing of pointers to triples. Optimisation is easier

Structural IRs are usually considered high level as they are explicitly related to the source code. Linear IRs are usually considered low level, but not always.

Static single-assignment (SSA) form is an intermediate representation that facilitates certain code optimisations. In SSA form, names correspond uniquely to specific definition points in the code. SSA uses the phi function to combine two definitions of a variable, such as for a conditional statement.

Relative addresses are computed at compile time for local declarations. Types and relative addresses are stored in a symbol table entry for the variable. Data of varying length like arrays are handled by reserving space for a pointer to the start.

Symbol tables are data structures that are used by compilers to hold information about source-program constructs such as identifiers, arrays, structs and functions.

The scope of a declaration is the section of a program to which the declaration applies. Scopes are implemented by setting up a separate symbol table for each.

Runtime environments

The compiler assumes that the executing target program runs in its own logical address space in which each program value has a location. The operating system maps logical addresses into physical addresses which may be spread throughout memory.

When a program is invoked, the OS allocates a block of memory for it, loads the code into that memory and executes a jump to the entry point of the program.

The compiler is responsible for deciding the layout of data and generating code that correctly manipulates the data. There are references to the data in the code, so code generation and data layout need to be designed together.

The storage layout for data is influenced by the addressing constraints of the target machine. Data is word aligned if it begins at a word boundary. Most machines have some alignment restrictions and performance penalties for poor alignment. Aligned memory is also important for optimisations such as automatic vectorisation.

Static values are allocated at compile time whereas dynamic values are allocated at runtime. Compilers can use the stack and heap for dynamic storage allocation.

  • The stack stores data local to a function and supports the call/return policy using activation records.
  • The heap stores dynamically allocated data, for example from malloc

Stack allocation assumes execution is sequential and there is no parallel execution. It also assumes that when a procedure is called control always returns to the point immediately after the call.

When a procedure is called, space for its local variables is pushed onto the stack and popped after the procedure terminates. This allows the stack to be shared by function calls whose durations do not overlap in time. This is made possible by nesting in time.

An invocation of a procedure is called activation. The lifetime of an activation is all the steps to execute it and all the steps in the procedures it calls.

Lifetimes of activations can be represented by an activation tree.

  • The sequence of procedure calls is a pre-order traversal of the tree
  • The sequence of returns is a post-order traversal of the tree
  • The ancestors of the currently executing node are all live activations.

Procedure calls and returns are managed by the control stack. Each live activation has an activation record (also called a frame) on the control stack. The currently executing procedure has its activation record at the top of the stack.

The activation record contains

  • Temporary data
  • Local data
  • Saved machine status - includes the return address and register contents before the procedure was called
  • Access link
  • Control link - points to the activation record of the caller
  • Return values
  • Actual parameters

Procedure calls are implemented by a calling sequence and a return sequence. The first allocates an activation record on the stack and the second restores the state of the machine.

A calling sequence is divided between the calling procedure and the called procedure.

Caller-saves means before a function is called the caller saves all register values and restores them after the call is complete. Callee-saves means the called function saves the registers when called and restores them before returning.

Caller-saves can be made more efficient by only saving the registers that hold live variables. Callee-saves can be made more efficient by saving the registers used in the function body.

Memory for data local to a procedure whose size can't be determined at runtime can be allocated on the stack. This memory is freed when the procedure returns. This does not require garbage collection which allocating on the heap might.

There are two methods for finding data used within a procedure which is defined outside of it.

  • Static scope - find the required data in the enclosing text, usually the most closely nested scope
  • Dynamic scope - looks for closest activation record on stack that has the data at runtime

In a language without nested procedures all variables are defined either in a single function or globally. No procedure declarations occur within another procedure.

Global variables are allocated static storage so have fixed locations known at compile time. Any other name must be local to the activation at the top of the stack.

A language with nested procedures can have nested function declarations and allows functions to take functions as arguments and return functions as values.

Scoping rules can be implemented by adding an access link pointer to the activation record. If a procedure is nested immediately inside another then the access link of the inner procedure points to the most recent activation record of the outer procedure. Access links form a chain from the activation record at the top of the stack to activations lower in the stack.

Access links are inefficient if the nesting depth is too large. Fast access to non-local data can be achieved using an array of pointers to activation records called a display. Each item in the array $d[i]$ is a pointer to the highest activation record on the stack for any procedure at nesting depth $i$. The compiler knows what $i$ is, so it can generate code to access non-local data using $d[i]$ and the offset from the top of the activation record, avoiding the chain of access links.

Under dynamic scope, a new activation inherits the existing bindings of non-local names to storage. A non-local name in the called activation refers to the same storage that it did in the calling activation. New bindings are set up for local names of the called procedures.

Dynamic scope can be implemented with deep or shallow access.

Deep access looks for the first activation record containing storage for the non-local name by following control links. The search may go deep into the stack. It takes longer than shallow access but there is no overhead associated with beginning and ending an activation.

Shallow access holds the current value of each name in statically allocated storage. When a new activation occurs, a local name in the procedure takes over the allocated storage for its previous value. The previous value can be held in the procedure's activation record and restored when it ends. This allows non-locals to be looked up directly but takes time to maintain these values when activations begin and end.

Actual parameters are the parameters used in the call of a procedure. Formal parameters are the parameters used in the procedure definition. There are several methods of parameter passing - call-by-value, call-by-reference, copy-restore and call-by-name. The result of a program can depend on the method used.

In the expression a[i] = a[j], a[i] represents a storage location and a[j] represents a value. The representation of an expression is determined by whether it is on the left or right of the expression. An l-value refers to the storage represented by an expression and an r-value refers to the value contained within the storage. Differences between parameter-passing methods are based on whether an actual parameter represents an l-value, r-value or the text of the actual parameter itself.

Call-by-value evaluates the actual parameters and passes their r-values to the procedure. Call-by-value can be implemented by treating a formal parameter as a local name and storing them in the activation record of the callee. The caller evaluates the actual parameters and places their r-values in the storage for formal parameters. Operations on the formal parameters do not affect values in the activation record of the caller.

A procedure called-by-value can affect its caller either through non-local names or pointers that are explicitly passed as values.

Call-by-reference passes the pointers to the storage address of each actual parameter to the procedure. If an actual parameter is a name or expression then the l-value is passed. If the actual parameter is an expression that has no l-value then the expression is evaluated in a new location and the address of that location is passed. This is also known as call-by-address or call-by-location.

A reference to a formal parameter in the called procedure becomes an indirect reference through the pointer passed to the called procedure.

Copy-restore is a hybrid between call-by-value and call-by-reference. It is also known as copy-in-copy-out or value-result. Before control flows to the called procedure the actual parameters are evaluated. The r-values of the actual parameters are passed to the called procedure as in call-by-value. Additionally, the l-values of those actual parameters that have l-values are determined before the call. When control returns, the current r-values of the formal parameters are copied back into the l-values of the actual parameters using the l-values computed before the call.

In call-by-name, the procedure is treated as if it were a macro. Its body is substituted for the call in the caller with the actual parameters literally substituted for the formal parameters. This literal substitution is called macro expansion or inline expansion. The local names of the called procedure are kept distinct from the names of the calling procedure.

Procedure inlining is similar to call-by-name but parameter passing and result passing are converted into assignments and variables are scoped using renaming. Procedure inlining is an optimisation that may improve the execution time of a program by reducing the overhead of calling a procedure.

A value that outlives the procedure that creates it can't be kept in an activation record. The heap is used to store data that lives indefinitely or until it is explicitly deleted. Data on the heap are not tied to the procedure activation that creates them.

The memory manager is the subsystem that allocates and deallocates space within the heap. It serves as an interface between programs and the OS.

Memory allocation reserves a contiguous block of heap memory of a requested size. It first tries to satisfy an allocation using free space on the heap, failing that it will use virtual memory from the OS. If there is no memory available it will report and error.

Memory deallocation returns memory space back to the heap.

A memory manager should aim to have good space and time efficiency and low overhead. Space efficiency aims to minimise the total heap space needed by a program and can be achieved by using fragmentation. Memory allocation and deallocation are frequent so need to be as efficient as possible.

Fragmentation can be reduced by coalescing adjacent free chunks of the heap into larger chunks.

Best-fit object placement allocates the requested memory in the smallest available hole that is large enough. This spares larger holes for subsequent larger allocation requests. It is good for improving memory utilisation but not good for spatial locality.

Next-fit object placement uses the best-fit strategy but allocates in the most recently split hole if enough space is available. This improves spatial locality and tends to improve allocation speed.

Code written in languages with manual memory management can be vulnerable to many issues including memory leaks, dangling pointers and buffer overflow.

Garbage collection automatically frees memory when it is no longer reachable. When a new object is created unused memory space is automatically allocated.

Objects are reachable if and only if a register contains a pointer to it or another reachable object contains a pointer to it. All reachable objects can be found by starting from registers and following all the pointers. Unreachable objects can never be used.

The mark and sweep garbage collection strategy finds reachable objects and marks them, then sweeps the unreachable objects and frees them. Every object has a mark bit which denotes whether it is reachable.

Optimisations

Optimisation consists of two phases

  1. Processing intermediate code
  2. Post-processing target code

Basic blocks partition IR programs into maximal sequences of consecutive three-address instructions. Flow of control can only enter a basic block through the first instruction and exit through the last. Once a basic block has been entered, it is guaranteed to execute without jumps for the entire block.

Basic blocks can then be used as the nodes of a Control Flow Graph (CFG) whose edges indicate which blocks can follow which other blocks.

Optimisations have different granularities

  • Local - Apply to a single basic block in isolation
  • Global - Apply to a control flow graph which represents a function body in isolation
  • Interprocedural - Apply across function boundaries by optimising collections of functions as a whole

Optimisations should balance difficulty of implementation, compilation time and performance payoff.

Local optimisations

Algebraic simplification replaces complex operations with simpler ones. Some statements can be deleted, like x = x + 0 and some statements can be simplified, for example x = x * 0 becomes x = 0, y = y ** 2 could be replaced with the faster y = y * y and x = x * 8 may be faster as a bit shift, x = x << 3. These simplifications are called reduction in strength optimisations.

Constant folding computes operations on constants at compile time. For example, x = 2 + 3 is replaced with x = 5.

Eliminating unreachable blocks removes code which can never be reached, making the code smaller and faster.

Common subexpression elimination eliminates instructions that compute a value that has already been computed. If basic blocks are in SSA form then two assignments having the same RHS compute the same value.

Copy propagation replaces all instances of $x$ in a block with $y$ if $x = y$. For example, if x = y + z; a = x; p = 4 * a can be replaced with x = y + z; a = x; p = 4 * x;. This does not affect performance but can enable other optimisations such as constant folding or dead code elimination. Constant propagation is when a variable is replaced with a constant, for example a = 5; x = 3 * a becomes a = 5; x = 3 * 5.

Dead code elimination removes code that does not contribute to the result of the program and can be removed. If $x = \text{RHS}$ appears and $x$ does not appear anywhere else in the program then it can be removed.

Pointers make optimisations harder because of aliasing, where two variables with different names can point to the same value. restrict can be used in C to denote that pointers do not overlap making more optimisations possible but this is not checked by the compiler.

Global optimisations

Common subexpression and copy propagation can also be done globally.

Knowing when the value of a variable will be used next is essential for carrying out optimisations. If statement $i$ is x = value and statement $j$ is z = x + y then it can be said that $j$ uses the value of $x$ computed at statement $i$ and that $x$ is live after statement $i$.

A variable is live at a particular point in the program if its value at that point will be used in the future, otherwise it is dead.

In a loop a variable whose value is derived from the number of iterations that have been executed by an enclosing loop is called an induction variable. Induction variables can be optimised by replacing them with an addition or subtraction per loop iteration. For example, if i = i + 1; x = 4 * i; is in the body of a loop, it can be replaced with i = i + 1; x = x + 4;. This has the same meaning but uses addition which is less expensive than multiplication. This is a strength reduction.

Loops can also be made more efficient by factoring out loop invariants. This called code motion. For example, x*x in for(i=0;i<100;i++){ a[i] = i*10 + (x*x) } is invariant and can be moved outside the loop body.

Data-flow analysis refers to a group of techniques that derive information about the flow of data along program execution paths. The analysis of a complex program is expressed as a combination of simple rules relating the change in information between adjacent statements. Information is transferred from one statement to the next, adhering to constraints based on the semantics of the statements and the flow of control. This produces data-flow values immediately before and after a statement. The relationship between data-flow values before and after a statement is known as a transfer function. Information in a transfer function may propagate in two directions, forward or backward along execution paths.

In a basic block, the control-flow value into a statement is the same as the control-flow value out of the previous one.

Control-flow edges between basic blocks create more complex constraints between the last statement of one block and the first statement of the next. The set of definitions reaching the first statement in a block is the union of the definitions of the last statements in each of the predecessor blocks.

A data-flow schema involves data-flow values for a given property at each point in the program.

A definition of a variable is a statement that assigns a value to it. A definition $d$ reaches a point $p$ if there is a path from the point immediately following $d$ to $p$ such that $d$ is not killed along that path. A definition is killed if there is any other definition of the same variable anywhere along the path. For example, $d:$ u = v + w generates a definition $d$ of variable u, kills all other definitions of u and does not affect the other incoming definitions. The transfer function of $d$ can be expressed as $f_d(x) = \text{gen}_d \cup (x \setminus \text{kill}_d)$. Reaching definitions are used to do global constant and copy propagation.

The transfer function of a basic block can be found by composing the transfer functions of statement contained within the basic block.

A basic block generates and kills a set of definitions. The generated set contains all the definitions inside the block that are visible immediately after the block, referred to as downwards exposed. A definition is downwards exposed in a basic block if it is not killed by a subsequent definition to the same variable inside the same basic block.

It is assumed that every CFG has two empty basic blocks, ENTRY and EXIT. No definitions reach the beginning of the graph, so the transfer function for the ENTRY block is $\text{OUT}(\text{ENTRY}) = \emptyset$. This is called the boundary condition for reaching definitions.

[missing slide 70-105]

Code generation

Code generation takes the IR produced by the front-end of the compiler and the symbol table and outputs a semantically equivalent target program.

The main tasks of the code generator are

  • Instruction selection - choosing appropriate target machine instructions
  • Register allocation - Deciding which values to keep in which registers
  • Instruction ordering/scheduling - Decide in which order to schedule to execution of instructions

Heuristic techniques are used to generate good but not necessarily optimal code.

The code generator must map the IR program into a code sequence that can be executed by the target machine. This is a pattern matching problem that depends on the level of the IR, the ISA of the machine and the desired quality of the code. The simplest solution is to translate each IR instruction into one or more machine code instructions but this is not very efficient.

The quality of generated code is determined by its speed and size. Speed means the execution time of the machine instruction sequence. Size means the number of machine instructions. Other measures like energy efficiency could be considered.

An IR program can be implemented by many different code sequences with significant cost differences. Therefore it is important to know instruction costs to design good code sequences.

The target ISA will also significantly affect instruction selection. The IR may map easily to a RISC machine but several IR operations need to be combined to make effective use of a CISC machine. For a stack machine like the JVM the registers need to be translated stack based addresses.

[missing something]

Statement-by-statement code generation often generates naive code. The generated code can be improved by carrying out optimisations on the target code. These are not guaranteed to produce a better executable but can significantly improve the running time or space of the target program.

The peephole optimisation examines a short sequence of target instructions (called the peephole) and replaced these instructions with a shorter, faster sequence whenever possible. The peephole is a small, sliding window on a program. Each improvement can create new optimisation opportunities so multiple passes need to be made.

Optimal code generation for expressions uses the abstract syntax tree of an expression to generate an optimal code sequence. This algorithm is proven to generate the shortest sequence of instructions.

[missing 3 lectures]