CSC448: Winter 2002 Notes

Lexing and parsing; LALR parser generators.
Abstract syntax trees; operations on trees; compiler as sequence of phases operating on abstract syntax trees.
Data flow and control flow analyses; type checking.
Code generation; x86 assembly language.

The homework assignments will require you to implement parts of a compiler (written in Java) for a non-trivial C-like programming language.

Overview: Why Take This Course? [4/133]

By the end of this course, you will:

Have a better insight into programming languages, and the relationship between imperative, functional, and object-oriented programming languages.
Find it easier to debug code.
Find it easier to optimize code.
Be able to apply new techniques and tools in your own projects, e.g., parser generators and graph visualization software.

Overview: Course Homepage [5/133]

Course homepage: http://www.depaul.edu/~jriely/csc448winter2011

Lecture slides are available in two HTML formats: one slide per page and all slides on one page.

Lecture slides are available if a check is shown (click on the check). Lecture slides may not be available before the class.

Overview: Contact Information [6/133]

Office hours at bottom right of the course homepage: http://www.depaul.edu/~jriely/csc448winter2011

Email: mailto:jriely@cs.depaul.edu

I check email more often than voice-mail. If you leave a phone message, repeat your phone number slowly at the beginning of your message.

Please check that the email address on your CampusConnect record is correct.

Overview: Course Mailing List [7/133]

Course mailing list: mailto:csc448winter2011@googlegroups.com

Unless your message is personal, send it to the course mailing list!

To subscribe to the list or unsubscribe from it, go to the link on the course homepage.

Last minute information will go to the mailing list.

If necessary, update your spam filter to accept messages from the mailing list.

Overview: Policies [8/133]

You are required to attend/watch all of the lectures.

Please include your name and the course in your emails:

From: fun_dude@yahoo.com
Subject: Homework

Prof,

Where's the homework???

--
fun_dude@yahoo.com

Overview: Assessment [9/133]

Your final grade will be based on:

Class/List participation: 12%
Homework: 54%
Project (including progress reports): 34%

Class/List participation includes: asking (on-topic) questions; answering questions well; finding bugs; posting (on-topic) news items; agreeing to use your homework submission for classroom discussion. Grades for participation will be assigned for the periods ending on the third, sixth, and tenth weeks. The grade will be one of: none or very unsatisfactory (0); unsatisfactory (1); satisfactory (2); good (3); excellent (4). The participation score will be the sum of the corresponding scores for each period.

Homework assignment announcements posted on http://col.cti.depaul.edu.

Assessment for homework assignments will be based on whether they achieve the set task and quality of the code. Homework assignment submissions must be created using "ant submit" unless otherwise specified.

You are expected to complete all of the homework assignments by the deadline. Late homework submissions will be penalized or rejected. Homework assignments must be submitted through the online system. Email submissions will not be accepted.

Project: parser or compiler-related development and project report.

Overview: Plagiarism [10/133]

See the academic integrity policy in the student handbook.

Homework assignments must be solved individually. You must not look at anyone else's solutions, and you must clearly acknowledge any code that you obtain from other sources (such as books, magazines, or the Internet). If you are in any doubt, contact the instructor for advice.

Your project work and report must consist of your own work. Use of material prepared by others must be minimal (check with the instructor if in any doubt) and clearly indicated by quoting and citation.

Overview: Textbooks [11/133]

Required:

Compilers: Principles, Techniques, and Tools
by Alfred V. Aho, Monica S. Lam, Ravi Sethi, and Jeffrey D. Ullman
Addison-Wesley, Copyright: 2007, second edition (not the first edition!)
ISBN-10: 0321486811
ISBN-13: 9780321486813

Overview: Prerequisites [12/133]

Prerequisites: CSC373 and CSC383 and CSC347/CSC447

You must be familiar with the following topics:

BNF or EBNF notation for context-free grammars.
Activation records / stack frames.
A little assembly language.
Java!

Overview: Development Tools [13/133]

We will use several development tools:

Java compiler: Sun Java 2 SDK v1.5
x86 assembler: (MinGW) GNU assembler
Build system: Apache Ant
Lexer and Parser generators: JFlex and CUP

Overview: Installing MinGW [14/133]

The compiler produces assembly language output. An assembler is used to produce object files, and a linker is used to produce executable files.

It is also useful to have:

a C compiler to produce assembly language examples.
a debugger to debug the result of executing our compiler's output.

We use MinGW: Minimalist GNU For Windows : free; small download; fast; powerful but unfriendly debugger.

Why MinGW?

MS Visual Studio and MASM: reasonable graphical debugger; not everyone has MS Visual Studio (let alone the same version); MS Windows only; MASM can be downloaded for free but is awkward to install.
Cygwin : free; lots of tools; uses an emulation layer; large download; slow under Win9x; powerful but unfriendly debugger.
NASM: The Netwide Assembler : free; small download; fast; requires separate compiler, linker, and debugger.

Install MinGW:

Download MinGW-3.1.0-1.exe from the SourceForge.net MinGW project file list .
Run MinGW-3.1.0-1.exe and install to c:\mingw (other directories will not work).
Add c:\mingw\bin to the PATH environmental variable (see setup.bat in the examples directory).

To verify the installation, try compiling and running a hello world C program:

gcc -o build\hello.exe src\hello.c
build\hello.exe

For example:

#include <stdio.h>
#include <stdlib.h>

int main (int argc, char *argv[])
{
  printf ("Hello World!\n");
  return 0;
}

Overview: Installing Apache Ant [15/133]

You will need Apache Ant to compile the Java examples and to create homework submissions.

Download apache-ant-1.6.5-bin.zip from Apache Ant.
Unzip it in c:\l so that the directory c:\l\apache-ant-1.6.5 is created.
Set the environmental variables ANT_HOME, JAVA_HOME, PATH appropriately (see setup.bat in the examples/java directory).

Overview: Installing CLOGS [16/133]

Download and unzip csc448source.zip from the course home page.

Edit src\setup.bat to match your directory structure.

You may also want to look at: file:working/

Overview: Tell Me About Yourself [17/133]

What do you do, and what are you looking for in this course?

Overview: Lexing and Parsing Overview [18/133]

For the compiler writer:

Lexing: from a stream of characters to a stream of tokens.
Parsing: from a stream of tokens to an abstract syntax tree.

Lexer and parser generators exist and are normally used.

Based on formalisms including:

regular expressions
finite-state automata
context-free grammars and BNF / EBNF
pushdown-stack automata

Overview: Review of Regular Expressions [19/133]

A language is a set of strings over some alphabet.

Consider the following operations on languages over the same alphabet:

The empty string: epsilon
A symbol from the alphabet: a
Union: r | s
Concatenation: r s
Iteration: r*

Regular expressions are a syntax for describing these operations.

See sections 3.1, 3.3, 3.5 of the Dragon book.

Overview: Regular Expression Examples [20/133]

Build regular expressions for the following examples:

The string "Dragon".
The strings "Alice" or "Bob".
Any non-empty string of digits.
Any string of digits of length exactly 5 characters.
Any non-empty string of digits of length less than or equal to 3 characters.
Simple URLs of the form "http://fpl.cs.depaul.edu".
The tokens of your favourite programming language (as long as it is Java).

Overview: Review of Context-Free Grammars, BNF, and EBNF [21/133]

Context-free grammars and BNF.

EBNF notation.

See sections 4.1, 4.2, 4.9 of the Dragon book.

Overview: JFlex and CUP [22/133]

Lexer and parser generators: lex, yacc, bison, JavaCC, ANTLR, CUP.

Strongly recommended: read the JFlex and CUP manuals. Several times.

See file:demo.flex.

Overview: Arithmetic [23/133]

EBNF/BNF for arithmetic expressions.

Precedence and associativity.

See the JFlex specification file:arithmetic.flex and the CUP specification file:arithmetic.cup.

Ignore semantic actions (code in braces) for now.

Overview: Homework [24/133]

See col.cti.depaul.edu for all homework postings.

Parsing: Overview of Today's Class [25/133]

Grammars and parsing with CUP.

Clogs language and compiler.

Abstract syntax trees and pretty printing.

Parsing: Discussion [26/133]

Any questions?

Parsing: Clogs [27/133]

Instructions to compile, assemble, and link Clogs programs.

Discussion of Clogs: syntax, semantics, compiler phases, and runtime system.

Exercise: write an amusing Clogs program.

Parsing: Parsing Refresher [28/133]

Grammars and productions: a production is a pair of a non-terminal symbol and a sequence (possibly empty) of terminal and non-terminal symbols.

EBNF to BNF:

Backus Normal Form (BNF). Later Backus-Naur Form (BNF). See page 82 of the Dragon book.
Extended BNF (EBNF) allows free use of the operators |, ?, *, and +.
An EBNF description can be converted into an equivalent BNF description.

Parse trees:

The root node must be labelled with the start symbol.
Non-leaf nodes must be labelled with non-terminal symbols.
Leaf nodes must be labelled with terminal symbols or epsilon.
The children of a non-leaf node labelled with A must correspond to a production for A.
The terminal symbols appear in order as the leaves of the parse tree.

Ambiguity:

A context-free grammar is ambiguous if there is a string of terminal symbols that has two or more parse trees.

Rewriting grammars to eliminate ambiguity.

Parsing: Coding Conventions [29/133]

Immutable data structures simplify transformations.

Linked lists and maps.

CAUTION: list.cons (x) does not mutate list!

list.cons (x);         // INCORRECT: DISCARDS RESULT

list2 = list.cons (x); // CORRECT: MAKES USE OF RESULT
list  = list.cons (x); // CORRECT: MAKES USE OF RESULT

Parsing: Abstract Syntax Trees [30/133]

Abstract syntax trees (ASTs) are distinct from parse trees.

We use immutable ASTs: shallow vs. deep copy.

AST for arithmetic: no parentheses; visitor design pattern vs. instanceof.

Pretty printer for Arithmetic.

Building the AST during parsing with CUP.

Parsing: Abstract Syntax Trees for Clogs [31/133]

AST for Clogs.

Pretty printer for Clogs.

Phases in the Clogs compiler.

Modification of the configuration files to parse only.

Optimizations and further analysis can be carried out as additional phases that operate on the AST, e.g., constant propagation.

Parsing: CUP for Clogs [32/133]

Develop BNF for Clogs.

Compare with the Clogs CUP file.

Building the Clogs AST during parsing with CUP.

Parsing: Homework [33/133]

Reminder: see COL website for details about homework assignments.

Type Checking: Overview of Today's Class [34/133]

Type checking!

Type Checking: Discussion [35/133]

Any questions?

Type Checking: Type Checking: Overview [36/133]

Overview of type checking goals: type safety and runtime type errors:

C?
Java?
Clogs?

Why type check intermediate AST repeatedly inside a compiler?

Overview of type checking code in Clogs.

Type Checking: Well-Formedness [37/133]

Analyzing the AST: instanceof or visitors?

Simple properties:

No duplicates
No void[]
No int foo (void x) or void x;

Type Checking: Integers and Strings [38/133]

Suppose we want to type check expressions of the following language:

exp ::= <INT>
      | <STRING>
      | exp "+" exp
      | exp "-" exp
      | "length" "(" exp ")"
      | "inttostring" "(" exp ")"

Inductive definition of type checking rules.

Type Checking: Integers, Strings, AND Variables [39/133]

What if we add variables:

exp ::= <INT>
      | <STRING>
      | exp "+" exp
      | exp "-" exp
      | "length" "(" exp ")"
      | "inttostring" "(" exp ")"
      | <ID>

Type Checking: Adding Null [40/133]

What if we add null and allow it to be either an integer or a string:

exp ::= <INT>
      | <STRING>
      | exp "+" exp
      | exp "-" exp
      | "length" "(" exp ")"
      | "inttostring" "(" exp ")"
      | <ID>
      | "null"

Top-down vs bottom-up type checking.

Type Checking: Statements [41/133]

Similar rules for a statement and declarations/statements determine scope behavior.

Type Checking: Functions [42/133]

Function declarations are added to the context.

Type Checking: Clogs Implementation [43/133]

Translation of type checking rules into code.

Type Checking: If Time Permits [44/133]

Type checking for:

Functional programming languages
Object-oriented languages: subtyping and inheritance
More unusual properties

AST Transformations: Overview of Today's Class [45/133]

Type checking continued.

Brief overview of runtime system.

Simplifying compilation units via source to source transformations:

Elimination of initializers.

int x = 1;    -->    int x;
                     x = 1;

Elimination of high-level control structures; or introduction of (conditional) branching.

if (e)        -->    if (e)
  s1                   goto l1;
else                 else
  s2                   goto l2;
                   l1:
                     s1
                     goto l3;
                   l2:
                     s2
                     goto l3;
                   l3:

Introduction of three address code (and temporary variables).

x = e1+f(e2); -->    int t1;
                     int t2;
                     int t3;
                     t1 = e1;
                     t2 = e2;
                     t3 = f(t2);
                     x = t2 + t3;

Unique renaming of labels.

l:            -->    l1:
  if (e) {             if (e) {
  l:                   l2:
    s                    s
  }                    }

Comments:

Preserve semantics of programs, e.g., scope of local variables unchanged.
Efficiency not a consideration: extra temporary variables, useless jumps, etc.

AST Transformations: Discussion [46/133]

Any questions?

AST Transformations: Eliminating Initializers I [47/133]

Implementation in transform directory.

The implementation must construct a new AST. It cannot modify the existing AST because the AST classes are immutable.

Transformation?

AST Transformations: Eliminating Initializers II [48/133]

Naive transformation vs:

void foo (int x)
{
  int y = x;
  int x = 1;
}

Re-examine Context.java.

AST Transformations: Fresh Labels/Variables [49/133]

When we manipulate blocks in source to source transformations, we must not confuse different local variables with the same name.

The semantics of our programming language allows us to add fresh variables and labels.

Yes, the given code is buggy, because it may give incorrect results if the original program contains, e.g., a label called "lab_6_true". Fixes?

AST Transformations: Eliminating High-Level Control Structures [50/133]

Constructs:

if-then-else
if-then
while

Preservation of compound statements, lifetime of local variables

Peephole optimization

Further constructs:

do-while
repeat-until
for
switch:
- fall through
- default
- efficiency
continue / break:
- enclosing blocks
- "goto and Destructors in GNU C++
  
  In C++ programs, you can safely use the goto statement. When you use it to exit a block which contains aggregates requiring destructors, the destructors will run before the goto transfers control. (In ANSI C++, goto is restricted to targets within the current block.)
  
  The compiler still forbids using goto to enter a scope that requires constructors."

AST Transformations: Three Address Code [51/133]

From complex expressions such as "e1+f(e2, e3*(x=e4))" to three address code.

Transforming an expression results in a simple expression (a literal or a variable), declarations of new temporary variable and assignment statements for those temporary variables.

Simplifying: "++e", "e++", "e1?e2:e3"

For prefix "++e" and postfix "e++", the expression "e" must be an l-value. Neither "++e" nor "e++" is an l-value. "++x" is equivalent to "x = x + 1".

Code Generation: Overview of Today's Class [52/133]

Run-time environments.

x86 assembly language and tools.

Code Generation: Discussion [53/133]

Any questions?

Code Generation: Run-Time Environment I [54/133]

Consider a small programming language of expressions and statements:

exp ::= <ID>
      | <INT>
      | exp "+" exp
      | exp "-" exp
      | exp "<=" exp
      | exp "=" exp

stat ::= <ID> ":=" exp
       | "label" <ID>
       | "if" exp "goto" <ID>
       | "print" exp

Suppose that integers represent booleans, e.g., an integer represents true if and only if it is non-zero.

For example, the following program prints numbers from 10 to 0:

x := 3;
goto l2;
label l1;
print (x);
x := x - 1;
label l2;
if (0 <= x) goto l1;

Code Generation: Run-Time Environment II [55/133]

Run-time storage:

Program Counter (PC): the number of the current statement. On x86, the eip (Instruction Pointer) register.
Variables: one storage location for every variable in the program.

Execute the program:

01  x := 3;
02  goto l2;
03  label l1;
04  print (x);
05  x := x - 1;
06  label l2;
07  if (0 <= x) goto l1;
08

Code Generation: Run-Time Environment III [56/133]

As an aside, analysis of a program may reveal that some storage locations for variables can be reused:

x := 3;
goto l2;
label l1;
print (x);
x := x - 1;
label l2;
if (0 <= x) goto l1;
y := 5;
goto l4;
label l3;
print (y);
y := y - 1;
label l4;
if (0 <= y) goto l3;

Code Generation: Run-Time Environment IV [57/133]

Suppose that the language is extended with procedures that can declare local variables:

exp ::= <ID>
      | <INT>
      | exp "+" exp
      | exp "-" exp
      | exp "<=" exp
      | exp "=" exp

stat ::= <ID> ":=" exp
       | "label" <ID>
       | "if" exp "goto" <ID>
       | "print" exp
       | "var" <ID>
       | "invoke" <ID> "(" args ")"

proc ::= "proc" <ID> "(" params ")" { stat* }

params ::= ( <ID> ( "," <ID> )* )?
args   ::= ( exp ( "," exp )* )?

If the language does not permit recursive calls, how can we organize the run-time storage?

Arguments and local variables for a procedure are grouped together into an activation record or frame.

The position of local variables is known at compile-time.

How can we record information about the calling procedures? Note that the number of calling procedures is bounded by the number of procedures.

A sample program:

proc upto (x) {
  var y;
  y := 0;
  goto l2;
  label l1;
  print (y);
  y := y + 1;
  label l2;
  if (y <= x) goto l1;
}

proc f () {
  var z;
  z := 3;
  goto l4;
  label l3;
  print (z);
  z := z - 1;
  label l4;
  if (0 <= z) goto l3;
}

Code Generation: Run-Time Environment V [58/133]

What if recursive procedure calls are permitted?

A stack allows fast allocation and deallocation of activation records.

The position of local variables is not known at compile-time.

Generated code expects to find local variables at a fixed offset from the current activation record (in a C-like language).

On x86:

The ebp (Stack-Frame Base Pointer) register points to the current activation record.
For example, -4(%ebp) refers to the memory location that is 4 bytes (or one word = one 32 bit integer) below the area pointed to by the base pointer. The following code moves the integer constant 3 into that location:
```
movl $3, -4(%ebp)
```
The esp (Stack Pointer) register points to the top element of the stack.
The stack grows downwards in memory.

Code Generation: Run-Time Environment VI [59/133]

Procedures and functions communicate by passing arguments and return values.

There are several calling conventions in common use on x86, but we will use the following convention:

Arguments are pushed onto the stack in reverse order by the caller. From the point of view of the callee, the arguments are at the top of the callee's activation record.
The caller is responsible for removing the arguments from the stack.
Results are returned in the general purpose register eax. There are other general purpose registers: ebx, ecx, edx.

For example, f(1,2,3) might compile to:

pushl $3
pushl $2
pushl $1
call f
addl $12, %esp

Note that the activation record of the caller grows temporarily before the call (as arguments are pushed onto the stack), and shrinks after the call (when the stack pointer is moved upwards).

The arguments are reversed to allow procedures or functions with a variable number of arguments.

Code Generation: Run-Time Environment VII [60/133]

Exercise: extend the language with pointers.

In a language with pointers, one invocation of a function may have a pointer into the activation record of another invocation of a function. This can lead to dangling pointers.

Data objects with a lifetime that does not correspond to the lifetime of an activation record are stored in a a separate area of memory called the heap.

Code Generation: Learning via GCC I [61/133]

A practical way to learn x86 assembly language and to understand the operation of a compiler is to look at the code generated by a C compiler.

Also beneficial when debugging C/C++ programs or identifying performance bottlenecks.

Consider loop.c:

#include <stdio.h>
#include <stdlib.h>

int main (int argc, char *argv[])
{
  int x;

  x = 3;
  goto l2;
 l1:
  printf ("%d\n", x);
  x = x - 1;
 l2:
  if (0 <= x) goto l1;

  return 0;
}

Compile, link, and run (use a ".exe" extension for loop on Windows):

$ gcc -o loop loop.c
$ ls -l loop*
-rwxrwxr-x    1 cpitcher cpitcher    13663 Feb 25 00:56 loop
-rw-rw-r--    1 cpitcher cpitcher      191 Feb 25 00:48 loop.c
$ ./loop 
3
2
1
0
$ rm loop

Code Generation: Learning via GCC II [62/133]

Compile only:

$ ls -l loop.*
-rw-rw-r--    1 cpitcher cpitcher      191 Feb 25 00:48 loop.c
-rw-rw-r--    1 cpitcher cpitcher      932 Feb 25 00:55 loop.o
$ rm loop.o

Compile to assembly language, assemble, link, and run:

$ gcc -S loop.c 
$ ls -l loop*
-rw-rw-r--    1 cpitcher cpitcher      191 Feb 25 00:48 loop.c
-rw-rw-r--    1 cpitcher cpitcher      523 Feb 25 00:57 loop.s
$ gcc -o loop loop.s
$ ls -l loop*
-rwxrwxr-x    1 cpitcher cpitcher    13663 Feb 25 00:58 loop
-rw-rw-r--    1 cpitcher cpitcher      191 Feb 25 00:48 loop.c
-rw-rw-r--    1 cpitcher cpitcher      523 Feb 25 00:57 loop.s
$ ./loop 
3
2
1
0
$ rm loop 
$ rm loop.s

Code Generation: Learning via GCC III [63/133]

Unfortunately GCC generates AT&T syntax x86 assembly language. The GNU assembler (as) can accept either AT&T or Intel syntax.

There are differences in operand order and addressing modes syntax, see the GNU AS documentation for .intel_syntax in /Machine Dependencies/i386-Dependent/i386-Syntax.

Code Generation: Simplified Example [64/133]

GCC's output can be simplified:

.global _main

.text
        .align 4

_main:
        pushl   %ebp
        movl    %esp, %ebp
        subl    $8, %esp

        movl    $3, -4(%ebp)
        jmp     lab2
lab1:
        pushl   -4(%ebp)
        pushl   $somenameforthestring
        call    _printf
        addl    $8, %esp
        leal    -4(%ebp), %eax
        decl    (%eax)
lab2:
        cmpl    $0, -4(%ebp)
        jns     lab1

        movl    $0, %eax
        leave
        ret


.section .rodata

somenameforthestring:
        .string "%d\n"

$ gcc -o loop simplified-loop.s
$ ./loop 
3
2
1
0

For Linux (note underscore changes):

.global main

.text
        .align 4

main:
        pushl   %ebp
        movl    %esp, %ebp
        subl    $8, %esp

        movl    $3, -4(%ebp)
        jmp     lab2
lab1:
        pushl   -4(%ebp)
        pushl   $somenameforthestring
        call    printf
        addl    $8, %esp
        leal    -4(%ebp), %eax
        decl    (%eax)
lab2:
        cmpl    $0, -4(%ebp)
        jns     lab1

        movl    $0, %eax
        leave
        ret


.section .rodata

somenameforthestring:
        .string "%d\n"

Code Generation: GNU Debugger (GDB) Walk Through [65/133]

To use the GNU Debugger, we must tell the assembler to include debugging information:

$ gcc -Wa,--gstabs -o loop simplified-loop.s
$ ./loop 
3
2
1
0

A sample session with the debugger:

$ gdb ./loop
GNU gdb 5.0rh-5 Red Hat Linux 7.1
Copyright 2001 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you are
welcome to change it and/or distribute copies of it under certain conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB.  Type "show warranty" for details.
This GDB was configured as "i386-redhat-linux".


(gdb) help
List of classes of commands:

aliases -- Aliases of other commands
breakpoints -- Making program stop at certain points
data -- Examining data
files -- Specifying and examining files
internals -- Maintenance commands
obscure -- Obscure features
running -- Running the program
stack -- Examining the stack
status -- Status inquiries
support -- Support facilities
tracepoints -- Tracing of program execution without stopping the program
user-defined -- User-defined commands

Type "help" followed by a class name for a list of commands in that class.
Type "help" followed by command name for full documentation.
Command name abbreviations are allowed if unambiguous.

(gdb) run
Starting program: /home/cpitcher/csc448/examples/src/runtime/./loop 
3
2
1
0

Program exited normally.

(gdb) quit


$ gdb ./loop
GNU gdb 5.0rh-5 Red Hat Linux 7.1
Copyright 2001 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you are
welcome to change it and/or distribute copies of it under certain conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB.  Type "show warranty" for details.
This GDB was configured as "i386-redhat-linux"...
(gdb) list
1	.global main
2	
3	.text
4	        .align 4
5	
6	main:
7	        pushl   %ebp
8	        movl    %esp, %ebp
9	        subl    $8, %esp
10	
(gdb) list
11	        movl    $3, -4(%ebp)
12	        jmp     lab2
13	lab1:
14	        pushl   -4(%ebp)
15	        pushl   $somenameforthestring
16	        call    printf
17	        addl    $8, %esp
18	        leal    -4(%ebp), %eax
19	        decl    (%eax)
20	lab2:
(gdb) list
21	        cmpl    $0, -4(%ebp)
22	        jns     lab1
23	
24	        movl    $0, %eax
25	        leave
26	        ret
27	
28	
29	.section .rodata
30	
(gdb) break 7
Breakpoint 1 at 0x8048460: file simplified-loop.s, line 7.
(gdb) run
Starting program: /home/cpitcher/csc448/examples/src/runtime/./loop 

Breakpoint 1, main () at simplified-loop.s:7
7	        pushl   %ebp
Current language:  auto; currently asm
(gdb) run
Starting program: /home/cpitcher/csc448/examples/src/runtime/./loop 

Breakpoint 1, main () at simplified-loop.s:7
7	        pushl   %ebp
Current language:  auto; currently asm
(gdb) step
8	        movl    %esp, %ebp
(gdb) step
main () at simplified-loop.s:9
9	        subl    $8, %esp
(gdb) step
11	        movl    $3, -4(%ebp)
(gdb) step
12	        jmp     lab2
(gdb) step
21	        cmpl    $0, -4(%ebp)
(gdb) step
22	        jns     lab1
(gdb) step
14	        pushl   -4(%ebp)
(gdb) step
15	        pushl   $somenameforthestring
(gdb) step
16	        call    printf
(gdb) step
printf (format=0x8048508 "%d\n") at printf.c:32
32	printf.c: No such file or directory.
	in printf.c
Current language:  auto; currently c
(gdb) bt  
#0  printf (format=0x8048508 "%d\n") at printf.c:32
#1  0x0804847c in lab1 () at simplified-loop.s:16
#2  0x40043177 in __libc_start_main (main=0x8048460 <main>, argc=1, 
    ubp_av=0xbffff8a4, init=0x80482e4 <_init>, fini=0x80484e0 <_fini>, 
    rtld_fini=0x4000e184 <_dl_fini>, stack_end=0xbffff89c)
    at ../sysdeps/generic/libc-start.c:129
(gdb) up
#1  0x0804847c in lab1 () at simplified-loop.s:16
16	        call    printf
Current language:  auto; currently asm
(gdb) list
11	        movl    $3, -4(%ebp)
12	        jmp     lab2
13	lab1:
14	        pushl   -4(%ebp)
15	        pushl   $somenameforthestring
16	        call    printf
17	        addl    $8, %esp
18	        leal    -4(%ebp), %eax
19	        decl    (%eax)
20	lab2:
(gdb) break 17
Breakpoint 2 at 0x804847c: file simplified-loop.s, line 17.
(gdb) continue
Continuing.
3

Breakpoint 2, lab1 () at simplified-loop.s:17
17	        addl    $8, %esp
(gdb) clear 17
Deleted breakpoint 2 
(gdb) continue
Continuing.
2
1
0

Program exited normally.
(gdb) quit

Code Generation: GNU Debugger (GDB) [66/133]

Commands. step (into function); next (step over function); help; disassemble *0x...; p/x $eax; p/x ((int*) $eax)[0]; p/x ((int*) $eax)[1]; p/x ((int*) $eax)[-1];

in gdb: use vtable_Foo vs &vtable_Foo

Code Generation: Learning via GCC IV [67/133]

Try compiling other programs and looking at the output:

Conditionals.
Loops.
Variables.
Complex expressions.
Function calls.
Structures, e.g., linked lists.

You are likely to encounter these instructions:

movl, leal
addl, subl
pushl, popl
cmpl
jmp, jlt, call, ret

Code Generation: Code Generation [68/133]

Corin did not show

Code generation produces an x86 assembly language file.

The output of the code generation phase is assembled, along with the x86 assembly language implementations of native methods, and then linked to produce an executable file.

The majority of the code generation can be described by the following pseudo-code:

for (every non-native method block and the entry block) {
  for (every statement in the block) {
    output one or more lines of assembly language to implement the statement;
  } 
}

There are some other details such as allocating storage for global variables and creating vtables for each clazz.

Code Generation: Run-Time Storage Location I [69/133]

Corin did not show

In the type-checking phase we maintained a structure that maps a variable to its type.

In the code generation phase we maintain a structure that maps a variable to its run-time location.

A Frame is created for every block (bodies of non-native methods and the entry block). It is initialized with the block and the method signature (or null for the entry block).

With this information, the Frame can return an assembly language expression for the run-time location of a variable (the getLocation method). There are four cases:

Parameter: for example, 12(%ebp), 16(%ebp), 20(%ebp).
this: 8(%ebp).
Local variable: for example, -4(%ebp), -8(%ebp), -12(%ebp).
Global variable: for example, mystdout.

Recall that:

the return address is stored at 4(%ebp).
the base pointer for the previous activation record is stored at 0(%ebp).

A diagram:

Location	Value
`20(%ebp)`	argument 3
`16(%ebp)`	argument 2
`12(%ebp)`	argument 1
`8(%ebp)`	`this`
`4(%ebp)`	return address
`0(%ebp)`	old `ebp`
`-4(%ebp)`	local variable 1
`-8(%ebp)`	local variable 2
`-12(%ebp)`	local variable 3

Global variables are stored in a fixed location that can be referenced via an assembly language symbol (with the same name as the global variable).

Code Generation: Run-Time Storage Location II [70/133]

Corin did not show

In addition, we need to know how much room to create for local variables in each block. The Frame class has this information. getLocalAllocInst returns an instructions that allocates room for the local variables of a block.

For example, if there are 5 local variables, getLocalAllocInst returns subl $20, %esp.

The Frame class has the form:

class Frame
{
  Frame (List stats, MethodSig methodSig) { ... }
  String getLocation (String varName) { ... }
  String getLocalAllocInst () { ... }
}

Code Generation: Run-Time Storage Location III [71/133]

Corin did not show

For example, consider this method:

clazz LinkedList 
{
  ...
  method sort () : LinkedList
  {
    var t : Tree;
    t := this.toTree ();
    return t.toLinkedList ();
  }
  ...
}

It is transformed to:

clazz LinkedList 
{
  ...
  method sort () : LinkedList
  {
    var tmpvar449 : Tree;
    var tmpvar485 : Tree;
    var tmpvar486 : LinkedList;
    (tmpvar485 := (this.toTree ()));
    (tmpvar449 := tmpvar485);
    (tmpvar486 := (tmpvar449.toLinkedList ()));
    return tmpvar486;
  }
  ...
}

The code generator produces:

        /* method sort of clazz LinkedList */
        .align 4
.global LinkedList$sort
LinkedList$sort:
        pushl   %ebp
        movl    %esp, %ebp
        subl    $12, %esp
        /* var tmpvar449 : Tree; */
        /* stored in -4(%ebp) */
        /* var tmpvar485 : Tree; */
        /* stored in -8(%ebp) */
        /* var tmpvar486 : LinkedList; */
        /* stored in -12(%ebp) */

        ... some assembly language removed here ...

        /* return tmpvar486; */
        movl    -12(%ebp), %eax
        movl    %ebp, %esp
        popl    %ebp
        ret

Code Generation: Code Generation for Assignment Statements I [72/133]

Corin did not show

Global variables, parameter variables, this variable, and local variables each store a pointer to an object, not the object itself.

When we assign from one variable to another, we need only copy the pointer.

For example, from:

method sort () : LinkedList
{
  var tmpvar449 : Tree;
  var tmpvar485 : Tree;
  var tmpvar486 : LinkedList;
  (tmpvar485 := (this.toTree ()));
  (tmpvar449 := tmpvar485);
  (tmpvar486 := (tmpvar449.toLinkedList ()));
  return tmpvar486;
}

The code generator produces:

        /* method sort of clazz LinkedList */
        .align 4
.global LinkedList$sort
LinkedList$sort:
        pushl   %ebp
        movl    %esp, %ebp
        subl    $12, %esp
        /* var tmpvar449 : Tree; */
        /* stored in -4(%ebp) */
        /* var tmpvar485 : Tree; */
        /* stored in -8(%ebp) */
        /* var tmpvar486 : LinkedList; */
        /* stored in -12(%ebp) */

        ... some assembly language removed here ...

        /* (tmpvar449 := tmpvar485); */
        movl    -8(%ebp), %eax
        movl    %eax, -4(%ebp)

        ... some assembly language removed here ...

We call getLocation ("tmpvar485") for the first assembly language line, and getLocation ("tmpvar449") for the second line.

The translation of each statement assumes that eax and ecx are scratch registers, which means that their values do not need to be preserved.

The line of COOL code in a comment above each group of assembly language lines is very helpful when debugging the code generator.

Code Generation: Code Generation for Assignment Statements II [73/133]

Corin did not show

After the final source-to-source transformation, we know that every StatExp contains an ExpAssign.

In addition, every proper subexpression on the left and right hand side expressions of the assignment is either null or a variable.

The method codeGenExp generates code for an ExpAssign in a StatExp.

private void codeGenExp (Exp exp, Frame frame) {
  ExpAssign expA = (ExpAssign) exp;

  // Generate code to load the value of expA.right into register eax.
  // Have to perform case analysis on expA.right, but recursion is not
  // necessary because the right hand side has a simple form.

  // Generate code to load the value of register eax into expA.left.
  // Only have to deal with the cases where expA.left is an ExpFieldAccess
  // or an ExpVar.
}

Code Generation: Code Generation for Assignment Statements III [74/133]

Corin did not show

Cases for the right hand side expression:

ExpConstructorCall: deferred.
ExpFieldAccess: deferred.
ExpInt: easy, call new$Integer with integer on the stack, receive pointer to the object back in eax.
ExpIsNull: easy, test whether the expression is zero.
ExpMethodCall: deferred.
ExpNull: easy.
ExpString: easy, call new$String with pointer to string on the stack, receive pointer to the object back in eax.
ExpVar: easy.

Hunt for examples in:

file:test096.cool (COOL program)
file:test096-flat.cool (simplified COOL program)
file:test096.s (assembly language output of code generation phase)

Code Generation: Run-Time Storage for Objects I [75/133]

Corin did not show

In order to generate code for ExpConstructorCall, ExpFieldAccess, and ExpMethodCall, we need to understand the run-time storage of objects.

Objects are stored on the heap. The stack only contains pointers to objects.

We are not concerned with memory deallocation or garbage collection. CSC548 covers garbage collection for OO languages.

If register eax contains the address of an object on the heap, then an object of a clazz with 3 fields has the following structure (using 4 words on the heap):

Location	Value
`12(%eax)`	field 3
`8(%eax)`	field 2
`4(%eax)`	field 1
`0(%eax)`	vtable for this object's clazz

Code Generation: Run-Time Storage for Objects II [76/133]

Corin did not show

For field access, find the type of the object whose field we are accessing (use tc.getType), then look up the field offset (use the lookupFieldOffset method).

For constructor calls, allocate room for the object on the heap by calling malloc (passing (the size of the object + 1) * slotSize). A pointer to the allocated memory is returned. It will eventually be a pointer to the object. Move the address of the vtable into the zero offset, then move all of the new field values into the memory allocated on the heap.

Hunt for examples in:

file:test096.cool (COOL program)
file:test096-flat.cool (simplified COOL program)
file:test096.s (assembly language output of code generation phase)

Code Generation: Run-Time Storage for Objects III [77/133]

Corin did not show

Vtables are used to implement dynamic dispatch in object-oriented languages.

A vtable contains an array of pointers to methods that can be invoked for a clazz. There is one vtable per clazz.

Objects keep track of their own methods via a pointer to their vtable. When the compiler is generating code for a method call, it generates code to extract the address of the method to call from the vtable stored in the object.

If edx contains the address of the vtable for the clazz, then the layout of a vtable with 3 methods is:

Location	Value
`8(%edx)`	address of method 3
`4(%edx)`	address of method 2
`0(%edx)`	address of method 1

If the contents of one of the vtable words (the address of a method) is loaded into eax, then the method can be called using call *%eax.

Remember to push the this pointer when compiling method calls.

Hunt for examples in:

file:test096.cool (COOL program)
file:test096-flat.cool (simplified COOL program)
file:test096.s (assembly language output of code generation phase)

Vtables are not necessary for this simple language. They become necessary when we introduce subtyping (via subclassing), in which case the compiler does not know the address of the appropriate method to invoke because it does not know the run-time type of an object.

Vtables can also be used to implement run-time type identification.

Code Generation: Code Generation for Other Statements [78/133]

Corin did not show

We still have to generate code for:

StatGoto: easy, unconditional jump (jmp).
StatIfThenGoto: easy, extract the integer field from the Boolean and jump if it is non-zero to the label (jne).
StatLabel: easy.
StatReturn: easy, use getExpVarOrExpNull.
StatVar: easy, just emit a comment.

Hunt for examples in:

file:test096.cool (COOL program)
file:test096-flat.cool (simplified COOL program)
file:test096.s (assembly language output of code generation phase)

Exercise: reflect on the complexity of a code generator that would be needed if we did not perform source-to-source transformations.

Code Generation: Examples and Library Code [79/133]

Corin did not show

Files:

file:library.cool (COOL library)
file:test096.cool (COOL program)
file:test096-flat.cool (simplified COOL program)
file:test096.s (assembly language output of code generation phase)

Native library code (hand-written):

Code Generation: Building an Executable [80/133]

Corin did not show

After running ant test, you can build a single executable with:

LINUX
$ gcc -Wa,--gstabs asmlocal/*.s output/codegenlocal/test094.s -o output/exe/test094

WINDOWS
$ gcc -Wa,--gstabs asmlocal\*.s output\codegenlocal\test094.s -o output\exe\test094.exe

If you are not on Windows 95/98/ME, you can build all executables at once:

LINUX
$ for i in output/codegenlocal/*.s; do gcc -Wa,--gstabs asmlocal/*.s $i -o output/exe/`basename $i .s`; done

WINDOWS
> for %i in (output\codegenlocal\*.s) do @gcc -Wa,--gstabs asmlocal\*.s %i -o output\exe\%~ni

Do not worry about the single example that fails to build. It is only because there is no implementation supplied for the native methods.

Be alert for missing return statements or missing variable initializations in your COOL examples. They usually cause a crash.

Interesting programs:

test085: prints a linked list.

$ ./output/exe/test085 
{ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 }

test086: prints "Hello World" in a loop.

$ ./output/exe/test086 
Hello World
Hello World
Hello World
Hello World
Hello World
Hello World
Hello World
Hello World
Hello World
Hello World

test087: prints triangular numbers backwards.

$ ./output/exe/test087
Value for 10 is 55
Value for 9 is 45
Value for 8 is 36
Value for 7 is 28
Value for 6 is 21
Value for 5 is 15
Value for 4 is 10
Value for 3 is 6
Value for 2 is 3
Value for 1 is 1

test088: prints factorials backwards.

$ ./output/exe/test088
Factorial of 10 is 3628800
Factorial of 9 is 362880
Factorial of 8 is 40320
Factorial of 7 is 5040
Factorial of 6 is 720
Factorial of 5 is 120
Factorial of 4 is 24
Factorial of 3 is 6
Factorial of 2 is 2
Factorial of 1 is 1

test090: generates and prints linked lists in a loop, iterative version.

$ ./output/exe/test090
{ 0 }
{ 1, 2 }
{ 2, 3, 4 }
{ 3, 4, 5, 6 }
{ 4, 5, 6, 7, 8 }
{ 5, 6, 7, 8, 9, 10 }
{ 6, 7, 8, 9, 10, 11, 12 }
{ 7, 8, 9, 10, 11, 12, 13, 14 }
{ 8, 9, 10, 11, 12, 13, 14, 15, 16 }
{ 9, 10, 11, 12, 13, 14, 15, 16, 17, 18 }

test091: generates and prints linked lists in a loop, recursive version.

$ ./output/exe/test091
{ 0 }
{ 1, 2 }
{ 2, 3, 4 }
{ 3, 4, 5, 6 }
{ 4, 5, 6, 7, 8 }
{ 5, 6, 7, 8, 9, 10 }
{ 6, 7, 8, 9, 10, 11, 12 }
{ 7, 8, 9, 10, 11, 12, 13, 14 }
{ 8, 9, 10, 11, 12, 13, 14, 15, 16 }
{ 9, 10, 11, 12, 13, 14, 15, 16, 17, 18 }

test092: generates and prints linked lists in a loop, recursive version, with a cons method.

$ ./output/exe/test092
{ 0 }
{ 1, 2 }
{ 2, 3, 4 }
{ 3, 4, 5, 6 }
{ 4, 5, 6, 7, 8 }
{ 5, 6, 7, 8, 9, 10 }
{ 6, 7, 8, 9, 10, 11, 12 }
{ 7, 8, 9, 10, 11, 12, 13, 14 }
{ 8, 9, 10, 11, 12, 13, 14, 15, 16 }
{ 9, 10, 11, 12, 13, 14, 15, 16, 17, 18 }

test093: silly program to test divisibility by 3 (used to show need for variable renaming).

$ ./output/exe/test093
Not divisible by 3!
9 is divisible by 3
Not divisible by 3!
Not divisible by 3!
Not divisible by 3!
6 is divisible by 3
Not divisible by 3!
Not divisible by 3!
Not divisible by 3!
3 is divisible by 3
Not divisible by 3!
Not divisible by 3!
Not divisible by 3!

test094: recursive factorial.

$ ./output/exe/test094
The factorial of 0 is 1
The factorial of 1 is 1
The factorial of 2 is 2
The factorial of 3 is 6
The factorial of 4 is 24
The factorial of 5 is 120
The factorial of 6 is 720
The factorial of 7 is 5040
The factorial of 8 is 40320
The factorial of 9 is 362880

test095: prints a binary tree.

$ ./output/exe/test095
(. 7 .) 6 (((. 1 .) 3 (. 2 .)) 5 (. 4 .))
6
  7
    .
    .
  5
    3
      1
        .
        .
      2
        .
        .
    4
      .
      .
Sum of tree contents = 28

test096: sorts a linked list.

$ ./output/exe/test096
Original list: { 11, 2, 1, 4, 10, 9, 4, 3, 4, 5 }
Sorted list: { 1, 2, 3, 4, 5, 9, 10, 11 }

test097: uses global variables.

$ ./output/exe/test097 
john = John Doe is 50 years old.
jane = Jane Doe is 52 years old.
p = John Doe is 50 years old.

Code Generation: Naming and Differences Between Windows and Linux [81/133]

Corin did not show

mystdout is used to avoid clashes with stdout from the C standard library.

The code generator can generate the same code for both Windows and Linux.

There is one difference. On Windows, C function names must be prefixed by underscores, e.g., _main, _malloc, _printf. On Linux, C function names must not be prefixed by underscores, e.g., main, malloc, printf.

Your code generator should generate C function names with underscores (not for COOL method names) as in the example. The Ant file:build.xml can remove the underscores.

If you are using Windows, remove or comment out the two replace elements in file:build.xml.

Code Generation: Homework [82/133]

Corin did not show

Reading: chapter 9 (section 9.1, section 9.2, section 9.3 ) of the Dragon book.

Download hw-week-9.zip and extract it to c:\compiler (UNIX users: anywhere will do, but the ZIP file does not include a top-level directory, so unzip it in an empty directory).

Add an existing JavaCC directory as before.

Read through CodeGen.java, DualPrintWriter.java, and Frame.java in the src/codegen/ directory and implement the changes described there (look for TODO comments, only in CodeGen.java). Test your code using ant test as usual. Look in the TEST-* files for summaries of the tests and the appropriate subdirectories created under the output/codegen directory with the results of code generation.

Your output need not be identical to that in the lecture slides, but it must be correct (meaning: the output program must have the same behaviour as the input program).

Debugging hint: you will not be able to run an executable until everything is completed, but you can make a small change to your code generator and then check that it is correct by comparing your test096.s against file:test096.s.

Write a text file named grader.txt that contains your full name and email address(es) and save it in the c:\compiler directory. Add a description to say whether or not you completed the assignment. If you did not complete the assignment, describe how much you have completed.

To submit your solution, ensure that your grader.txt file is still in place, then run ant submit from the c:\compiler directory and upload the resulting ZIP file to the COL/DL system for "Week 9 Homework".

Due at the start of the next lecture.

Program Analysis I: Overview of Today's Class [83/133]

Program analysis: control-flow graphs; program dependence graphs; calculating them; visualization.

Program Analysis I: Discussion [84/133]

Any questions?

Program Analysis I: Control Flow Graphs I [85/133]

Intra-procedural analysis usually requires knowledge of the flow of control in a method block or entry block.

Normally, control flows from one statement to the next statement.

Goto statements (conditional and unconditional) introduce non-sequential control flow.

Consider file:test027.hob.

A (directed) graph that represents the control flow between statements in the entry block: file:test027_entry.pdf

Normally the statements are grouped into basic blocks for efficiency.

What about while loops and if-then-else statements?

Program Analysis I: Control Flow Graphs II [86/133]

The control flow graph does not tell us everything about a block (unless we use the statement labels!).

Consider file:test024.hob.

A (directed) graph that represents the control flow between statements in the entry block: file:test024_entry.pdf

Program Analysis I: Control Flow Graphs III [87/133]

It is straightforward to build a control flow graph from a non-native block (especially if the high-level control structures have already been eliminated).

Code to build a naive control flow graph: file:controlflow/

In particular, a file:ControlFlowGraph.java instance is created for each non-native block by the code in file:BuildDiagrams.java.

Program Analysis I: Control Flow Graphs IV [88/133]

The graphs are generated using the dot program from the Graphviz package.

If you want to try this yourself, call the buildDiagrams in the BuildDiagrams class from the Compile class:

    new BuildDiagrams ().buildDiagrams (basename, cu);

This produces a dot input file for each non-native block. For example: file:test024_entry.dot

dot converts this kind of file into a Postscript file (making layout decisions in the process):

$ (cd tests; for i in *.dot; do dot -Gsize="8,10" -Tps $i -o `echo $i|basename $i .dot`.ps; done)

Alternatively, you can produce GIF output instead of Postscript:

$ (cd tests; for i in *.dot; do dot -Tgif $i -o `echo $i|basename $i .dot`.gif; done)

Program Analysis I: Obtaining Graphviz, Ghostscript, and Ghostview [89/133]

Graphviz: http://www.research.att.com/sw/tools/graphviz/

Download and run the Windows package gv182.exe .

To view the Postscript output, you can use Ghostview: http://www.cs.wisc.edu/~ghost/

Follow the link to "Obtaining AFPL Ghostscript 7.04", download and run gs704w32.exe and gsv42w32.exe .

Postscript examples:

Program Analysis I: Does a Block Return Anything? I [90/133]

We have seen an example of a program crashing because of a missing return statement.

How can we check that every path through a block ends with a return statement?

Program Analysis I: Does a Block Return Anything? II [91/133]

A simple graph algorithm suffices.

Implementation: file:CheckReturns.java

file:test024.hob fails (even if we initialise b):

$ java hobbes.Compile tests/test024.hob 
PHASE: BuildDiagrams
PHASE: CheckReturns
Exception in thread "main" hobbes.checks.CheckException: entry may fail to return
        at hobbes.controlflow.CheckReturns.checkBlock(CheckReturns.java:57)
        at hobbes.controlflow.CheckReturns.checkReturns(CheckReturns.java:40)
        at hobbes.Compile.checkReturns(Compile.java:176)
        at hobbes.Compile.main(Compile.java:90)

Program Analysis I: Similar Problems [92/133]

Redundant code elimination?

How can we check that variables are initialised before use?

Program Analysis II: Overview of Today's Class [93/133]

Program analysis II

Program Analysis II: Discussion [94/133]

Any questions?

Parsing Revisited: Overview of Today's Class [95/133]

Parsing revisited: pushdown-stack machines, parsing tables.

Parsing Revisited: Discussion [96/133]

Any questions?

Parsing Revisited: TO DO [97/133]


Ambiguities and conflicts.  See ch8 of Levine's book.


Using left/right recursion to make it easier to group data as it is
read, e.g., parsing right associativity using left recursion.
Message: left/right recursion does not constrain the associativity.

Levine's comments about relative efficiency of left/right recursion in
yacc.  Depends upon discussion of stack machines.

error recovery?

Parsing Revisited: Derivations [98/133]

Parsing Revisited: Stack Machines [99/133]

Parsing Revisited: LALR Parsing [100/133]

Parsing Revisited: Parsing Issues [101/133]

precedence; associativity

if-then-else

conflicts: shift-reduce and reduce-reduce

Parsing Revisited: Derivations I [102/133]

Derivations are an alternative to parse trees. They carry more information about how parse trees are built.

The first line consists of the start symbol.

In each successive line, one non-terminal symbol can be replaced with one of its productions.

exp 
=> exp "*" exp
=> "(" exp ")" "*" exp
=> "(" exp "+" exp ")" "*" exp
=> "(" num "+" exp ")" "*" exp
=> "(" digit "+" exp ")" "*" exp
=> "(" digit "+" num ")" "*" exp
=> "(" digit "+" digit ")" "*" exp
=> "(" digit "+" digit ")" "*" num
=> "(" digit "+" digit ")" "*" digit

See section 4.2 of the Dragon book.

Parsing Revisited: Derivations II [103/133]

It is possible to find alternate derivations of the same string of terminal symbols:

exp 
=> exp "*" exp
=> exp "*" num
=> exp "*" digit
=> "(" exp ")" "*" digit
=> "(" exp "+" exp ")" "*" digit
=> "(" exp "+" num ")" "*" digit
=> "(" exp "+" digit ")" "*" digit
=> "(" num "+" digit ")" "*" digit
=> "(" digit "+" digit ")" "*" digit

These derivations determine the same parse tree.

Parsing Revisited: Derivations III [104/133]

A derivation is leftmost if the chosen non-terminal symbol is always the leftmost non-terminal.

Leftmost derivations are interesting, because they correspond to the parsing strategy used by a JavaCC parser.

As a JavaCC parser sees the terminals symbols "(" digit "+" digit ")" "*" digit, it makes "local" decisions about which production to use. The decisions must be based upon the next terminal symbol.

exp 
=> exp "*" exp
=> "(" exp ")" "*" exp
=> "(" exp "+" exp ")" "*" exp
=> "(" num "+" exp ")" "*" exp
=> "(" digit "+" exp ")" "*" exp
=> "(" digit "+" num ")" "*" exp
=> "(" digit "+" digit ")" "*" exp
=> "(" digit "+" digit ")" "*" num
=> "(" digit "+" digit ")" "*" digit

See page 169 and section 4.4 of the Dragon book.

Parsing Revisited: Parsing Tables [105/133]

How can we build a parsing table for a predictive parser?

Nullable, FIRST, and FOLLOW sets. Calculated as fixed points.

Parsing Revisited: Notes [106/133]

Left-factoring; first and follow sets; LL(1) grammar definition; building abstract syntax trees; Hobbes grammar; dealing with tricky grammars.

Examples: abstract syntax tree classes and JavaCC files for arithmetic expressions and/or a simple while language.

Homework assignment: given a set of abstract syntax tree classes and token definitions, write the rest of the JavaCC file that works for the given set of examples. Should I give the grammar too?

Parsing Revisited: LL(1) [107/133]

Definition of first sets for non-terminals.

Definition of follow sets for non-terminals.

Definition of LL(1) grammars/languages.

Parsing Revisited: Hand-Written Parsers [108/133]


top-down parsing:
* recursive-descent parsing: 
* predictive parsing: non-backtracking, efficient, top-down

* predictive parsing is a special case of recursive-descent parsing where backtracking is not
  required.
* recursive-descent parsing is a technique for top-down parsing.

* parse a URL.

Type Checking Revisited: Overview of Today's Class [109/133]

Type checking revisited: for functional and OO languages. Parametric polymorphism. Hindley-Milner. Subtyping.

Type Checking Revisited: Discussion [110/133]

Any questions?

Type Checking Revisited: Generic Polymorphism I [111/133]

Consider an ML function to reverse a list:

fun reverse [] = []
|   reverse (x::xs) = (reverse xs) @ [x];

ML infers the type 'a list -> 'a list for reverse:

> val 'a reverse = fn : 'a list -> 'a list

And we can apply reverse to lists of different types:

- reverse [1,2,3];
> val it = [3, 2, 1] : int list

- reverse ["hello", "world"];
> val it = ["world", "hello"] : string list

The type 'a list -> 'a list means that for any type T, we can use reverse with type T list -> T list. For example, we can use reverse with type int list -> int list or with type string list -> string list.

Type Checking Revisited: Generic Polymorphism II [112/133]

This feature of a language is known as generic or parametric polymorphism, but it should be distinguished from the subtype polymorphism found in almost all object-oriented languages.

How does this fragment of Java code:

  List reverse1 (List l)
  {
    List result = new ArrayList ();
    Iterator iter = l.iterator ();
    while (iter.hasNext ()) {
      Object o = iter.next ();
      result.add (o);
    }
    return result;
  }

differ from this ML code?

> val 'a reverse = fn : 'a list -> 'a list

fun reverse [] = []
|   reverse (x::xs) = (reverse xs) @ [x];

Type Checking Revisited: Generic Polymorphism III [113/133]

To confuse the issue, Java v1.5 also includes generic polymorphism, so we can write:

  <A> List<A> reverse2 (List<A> l)
  {
    List<A> result = new ArrayList<A> ();
    Iterator<A> iter = l.iterator ();
    while (iter.hasNext ()) {
      A a = iter.next ();
      result.add (a);
    }
    return result;
  }

Code to invoke the two reverse methods:

  void test ()
  {
    List<String> list = new ArrayList<String> ();
    list.add ("the"); list.add ("rain"); list.add ("in"); list.add ("Spain");
    list.add ("falls"); list.add ("mainly"); list.add ("on"); list.add ("the"); list.add ("plain");

    List l1a = reverse1 (list);
    Object o1 = l1a.get (0);
    List<String> l1b = reverse1 (list);
    String s1b = l1b.get (0);
    // Next line fails to typecheck:
    //    String s1c = reverse1 (list).get (0);

    List l2a = reverse2 (list);
    Object o2 = l2a.get (0);
    List<String> l2b = reverse2 (list);
    String s2b = l2b.get (0);
    String s2c = reverse2 (list).get (0);

    // Type inference means that we don't have to type:
    List<String> l2d = this.<String>reverse2 (list);
  }

Type Checking Revisited: Type Inference [114/133]

Type inference can help developers using functions or methods that exhibit generic polymorphism.

We will look at type inference for a simple functional language. See section 6.6 of the Dragon book for more details.

The type inference process subsumes type checking: if the type inference mechanism returns a type, then the expression must be well-typed.

In addition, we should infer the most general type possible. For example, we would be unhappy if the type inference process found the type int list -> int list for reverse and refused to allows its use on lists of strings.

Type Checking Revisited: Substitutions and Instances [115/133]

Consider the following grammar for types:

t, u ::= int
       | string
       | 'a          (type variable)
       | t list      (list of t)
       | t -> u      (functions from t to u)

A substitution is a map from type variables to other types. Applying a substitution replaces type variables whenever they are in the domain of the map.

If a type t can be obtained by applying a substitution to another type u, then we write t <= u.

What is the relationship between the following types?

'a list
'b list
int list
string list
('b -> 'c) list
('c -> 'b) list
(int -> 'c) list
('b -> string) list
(int -> string) list
(int -> int) list
(string -> string) list
('d -> 'd) list

Now we can say that the type inference process should find the maximum element (assuming that there are any maximal elements and that there is only one maximal element). We have to work up to the equivalence induced by the preorder <=.

See p370 of the Dragon book.

Type Checking Revisited: Simple Functional Language [116/133]

Consider a simple functional language:

M, N ::= n                    (n is a number)
       | s                    (s is a string)
       | x                    (x is a variable)
       | []                   (the empty list)
       | M :: N               (cons M onto the list N)
       | if M then N₁ else N₂
       | fun f x => M         (function that can call itself recursively using f inside M)
       | M N                  (application of a function to an argument)
       | let x = M in N       (bind the result of evaluating M to x inside N)

We might assume some builtin functions:

head : 'a list -> 'a
tail : 'a list -> 'a list
add : int -> (int -> int)
is_zero : int -> boolean
length : 'a list -> int

Aside: how would we translate the higher-order type int -> (int -> int) to a Java type?

Type Checking Revisited: Example Program [117/133]

An example program that uses take at two different types:

let take = 
  fun f i => 
    fun g xs => 
      if (is_zero (i)) then
        []
      else
        cons (head (xs)) (f (i - 1) (tail (xs)))
in
  (length (take 2 ["a", "b", "c"])) + (length (take 1 [1, 2, 3]))

Type Checking Revisited: Hindley-Milner Type Inference [118/133]

We differentiate between types of the form:

int
'a -> 'a list
('a -> 'b) -> ('a list) -> ('b list)

And type schemes that bind free type variables:

forall [] int
forall ['a] 'a -> 'a list
forall ['a, 'b] ('a -> 'b) -> ('a list) -> ('b list)

The type inference process assigns fresh type variables to bound variables in an expression, and then generates equality constraints between types. Those constraints must be satisfiable for the expression to be well-typed.

In addition, when we use a type scheme, we will instantiate the type scheme with fresh type variables.

For example, suppose we started with:

fun f x => head₁ (head₂ x)

We would generate fresh type variables 'a and 'b and say:

f : 'a -> 'b
x : 'a
head (head x) must have type 'b

head has type scheme forall ['a] ('a list) -> 'a, but the type variable 'a is bound and so is unrelated to the previous 'a. We instantiate the innermost head with a new type variable 'c:

head : 'c list -> 'c

Now we have head x, so we know that:

'c list = 'a
head x : 'c

Now we use the type scheme for head with another fresh type variable to get:

head : 'd list -> 'd

Applying that to head (head x), we have:

'd list = 'c
head (head x) : 'd

Finally, we must add the constraint:

'b = 'd

So the overall constraints are:

'b = 'd
'd list = 'c
'c list = 'a

And the expression has type 'a -> 'b, which simplifies to ('d list) list -> 'd.

Thus the expression has the type scheme forall ['d] ('d list) list -> 'd.

Type Checking Revisited: Hindley-Milner Type Inference: Implementation I [119/133]

Not a full implementation because it generates constraints but does not solve them.

See the file:hm directory.

Expressions (file:Exp.java) and types (file:Type.java) are represented in the usual way.

A type scheme (file:TypeScheme.java) is a list of bound type variables and a type.

A constraint (file:Constraint.java) represents equality between a pair of types.

The work is performed by file:Infer.java.

The test file (file:Test.java) infers a type for the take function.

Compile and run with:

C:\l\csc448\examples\hm>javac -classpath . *.java
C:\l\csc448\examples\hm>java -classpath . Test

Output is:

(fun f i => (fun g xs => (if (is_zero i) then [] else ((head xs)::((f ((add i) -1)) (tail xs))))))

Inferred type is: ('A0 -> 'A1)

Constraints:
('A2 -> 'A3) = 'A1
('A6 list) = 'A3
('A6 list) = 'A20
'A5 = boolean
('A9 list) = 'A20
'A18 = 'A19
'A15 = ('A19 -> 'A20)
'A2 = 'A17
('A16 list) = 'A18
('A16 list) = 'A17
'A13 = 'A14
'A1 = 'A15
'A0 = 'A14
'A12 = int
'A11 = ('A12 -> 'A13)
'A0 = 'A10
(int -> int) = 'A11
'A10 = int
'A2 = 'A8
'A7 = 'A9
('A7 list) = 'A8
'A0 = 'A4
'A5 = boolean
'A4 = int

Type Checking Revisited: Hindley-Milner Type Inference: Implementation II [120/133]

We can also see the inferred types of subexpressions and the constraints as they are created:

(fun f i => (fun g xs => (if (is_zero i) then [] else ((head xs)::((f ((add i) -1)) (tail xs))))))

Inferred is_zero : (int -> boolean)
Inferred i : 'A0
Added constraint: 'A4 = int
Added constraint: 'A5 = boolean
Added constraint: 'A0 = 'A4
Inferred (is_zero i) : 'A5
Inferred [] : ('A6 list)
Inferred head : (('A7 list) -> 'A7)
Inferred xs : 'A2
Added constraint: ('A7 list) = 'A8
Added constraint: 'A7 = 'A9
Added constraint: 'A2 = 'A8
Inferred (head xs) : 'A9
Inferred f : ('A0 -> 'A1)
Inferred add : (int -> (int -> int))
Inferred i : 'A0
Added constraint: 'A10 = int
Added constraint: (int -> int) = 'A11
Added constraint: 'A0 = 'A10
Inferred (add i) : 'A11
Inferred -1 : int
Added constraint: 'A11 = ('A12 -> 'A13)
Added constraint: 'A12 = int
Inferred ((add i) -1) : 'A13
Added constraint: 'A0 = 'A14
Added constraint: 'A1 = 'A15
Added constraint: 'A13 = 'A14
Inferred (f ((add i) -1)) : 'A15
Inferred tail : (('A16 list) -> ('A16 list))
Inferred xs : 'A2
Added constraint: ('A16 list) = 'A17
Added constraint: ('A16 list) = 'A18
Added constraint: 'A2 = 'A17
Inferred (tail xs) : 'A18
Added constraint: 'A15 = ('A19 -> 'A20)
Added constraint: 'A18 = 'A19
Inferred ((f ((add i) -1)) (tail xs)) : 'A20
Added constraint: ('A9 list) = 'A20
Inferred ((head xs)::((f ((add i) -1)) (tail xs))) : 'A20
Added constraint: 'A5 = boolean
Added constraint: ('A6 list) = 'A20
Inferred (if (is_zero i) then [] else ((head xs)::((f ((add i) -1)) (tail xs)))) : ('A6 list)
Added constraint: ('A6 list) = 'A3
Inferred (fun g xs => (if (is_zero i) then [] else ((head xs)::((f ((add i) -1)) (tail xs))))) : ('A2 -> 'A3)
Added constraint: ('A2 -> 'A3) = 'A1
Inferred (fun f i => (fun g xs => (if (is_zero i) then [] else ((head xs)::((f ((add i) -1)) (tail xs)))))) : ('A0 -> 'A1)

Inferred type is: ('A0 -> 'A1)

Constraints:
('A2 -> 'A3) = 'A1
('A6 list) = 'A3
('A6 list) = 'A20
'A5 = boolean
('A9 list) = 'A20
'A18 = 'A19
'A15 = ('A19 -> 'A20)
'A2 = 'A17
('A16 list) = 'A18
('A16 list) = 'A17
'A13 = 'A14
'A1 = 'A15
'A0 = 'A14
'A12 = int
'A11 = ('A12 -> 'A13)
'A0 = 'A10
(int -> int) = 'A11
'A10 = int
'A2 = 'A8
'A7 = 'A9
('A7 list) = 'A8
'A0 = 'A4
'A5 = boolean
'A4 = int

Type Checking Revisited: Let Polymorphism I [121/133]

Why is let x = M in N more difficult?

let take = 
  fun f i => 
    fun g xs => 
      if (is_zero (i)) then
        []
      else
        cons (head (xs)) (f (i - 1) (tail (xs)))
in
  (length (take 2 ["a", "b", "c"])) + (length (take 1 [1, 2, 3]))

We need take to be bound to a type scheme instead of a type.

Type Checking Revisited: Let Polymorphism II [122/133]

However, we must be careful about binding type variables that are used elsewhere when we introduce type schemes.

fun f x =>
  let y = x 
  in y (head y)

First assign fresh type variables:

f : 'a -> 'b
x : 'a

Then we see that y is bound to x : 'a, so we might naively say that y has type scheme forall ['a] 'a.

Unfortunately, this generalisation is not sound.

The type inference must check that type variables do not occur freely in the context before generalising them in this way. In this case, the type variable 'a appears freely in the context because x has that type.

Code Generation Revisited: Overview of Today's Class [123/133]

Code generation revisited: for functional and OO languages.

Code Generation Revisited: Discussion [124/133]

Any questions?

Code Generation Revisited: Inheritance [125/133]

Single inheritance vs multiple inheritance.

The vtables used here are sufficient for OO languages with single inheritance. We have to think about:

Which operations are performed on objects and object references (aka pointers)?
Why is subtyping permitted?

But what about multiple inheritance?

Why would you care about inheritance? For example, Microsoft's Component Object Model (COM).

Code Generation Revisited: Exceptions [126/133]

The traditional solution to dealing with exceptional cases is to distinguish normal values from exceptional values, and to require every piece of code to check for exceptional values. However, this is neither robust nor manageable, even in modestly sized systems, nor is it appropriate for processing asynchronous errors.

Exceptions provide a control-flow mechanism that is quite different from other control-flow constructs found in common languages.

Exceptions provide a form of non-local jump where the destination is unknown: we must provide an exception name and the runtime system will jump to the most recent exception handler block that we are currently inside.

In addition, the runtime system may:

Pass information with the jump, e.g., passing a subclass of Throwable in Java.
Execute code in between throwing the exception and catching it in the exception handler, e.g., executing finally blocks in Java.

Some examples:

Pass through many activation records and reach the top-level exception handler, as well as checked vs unchecked exceptions: file:Test1.java, file:E1.java, file:E2.java
More complex control flow: file:Test2.java
Finally blocks always executed: file:Test3.java
Finally blocks always executed, even in odd situations: file:Test4.java

There are many tricky details:

Java forces you to declare checked exceptions. The unchecked exceptions are subclasses of RuntimeException or Error, e.g., you need need not declare ArrayStoreException or SecurityException everywhere.
What happens if we evaluate f (e1, e2), e2 has a side-effect, and e1 throws an exception? Is there a guaranteed order of evaluation?
Java exceptions are heap-allocated objects and can contain arbitrary information. In particular, a stack trace is extremely useful.
APIs with possible exceptions in the method signatures is handy for static analysis, but it can be restrictive for implementations. For this reason, Java has added "cause" exceptions, where one exception wraps another.

Code Generation Revisited: Syntax and Operational Semantics [127/133]

A simple functional language with exceptions, where we assume a fixed set of exception names.

M, N ::= ...usual arithmetic and relational operators...
       | x
       | fun x => M                           (no recursive function use)
       | M N
       | raise E                              (throw in Java)
       | M handle E₁ => N_1 | ... | E_n => N_n  (try-catch block in Java)

We can give an operational semantics that describes the runtime behaviour of such programs using two judgement forms:

M evaluates to V
M raises E

Code Generation Revisited: Implementation Strategies I [128/133]

Exception handling complicates the runtime system.

We must ensure that we can find the right exception handler block in the current function/method if there is one.

If not, we must remove one activation record and look for exception handler blocks in the preceding activation record.

Moreover, we must ensure execution of each dynamically enclosing finally block, and call destructors/finalizers in languages such as C++.

A simple strategy is to maintain a stack of exception handlers. Whenever we enter a new exception handler block, we push a new handler onto the stack. As with case statements, there are several implementation strategies: the handler might be a pointer to code, or it might be a pointer to a jump table with an entry for each exception handled.

In addition, handlers must be inserted to remove activation records when a function/method ends because of an exception.

This strategy has significant runtime costs, because we must maintain the stack of exception handlers as the normal flow of execution passes through them. Thus, we are constantly paying for the rare case when an exception is thrown.

Code Generation Revisited: Implementation Strategies II [129/133]

An alternative is to maintain a table for each function/method that contains the instruction pointer ranges for each exception handler block.

When an exception is encountered, the runtime system searches the table using the current instruction pointer as the key. It will find the address of a handler block to use.

If an handler block does not handle an exception, then the runtime system begins again with the address of the failed handler.

Code Generation Revisited: Type and Effect Analysis [130/133]

Java compilers check that we have caught or declared all possible checked exceptions and that we have not declared any exceptions that can't be thrown.

How is this analysis done?

We can write down a type and effect system that checks that an expressions' declared exceptions are correct.

Then we can try to turn the checking system on its head to create an inference system that tells us which exceptions we should declare.

Oddly enough, we end up generating constraints and solving them again.

Code Generation Revisited: Better Exceptions [131/133]

Corin did not show

Exceptional Syntax by Nick Benton and Andrew Kennedy

Code Generation Revisited: Compilation of Functional Languages [132/133]

Compilation of functional languages is not as far removed from compilation of object-oriented languages as you might have thought:

Code Generation Revisited: Finally Blocks [133/133]

Operational semantics for finally blocks?

Note that the Java issue with return statements does not appear here.

Revised: 2008/01/14 12:13