CSC373/406: Lecture 11 (Final Exam)

Contents [0/13]

1 [1/13]

2 [2/13]

3 [3/13]

4 [4/13]

5 [5/13]

6 [6/13]

7 [7/13]

8 [8/13]

9 [9/13]

10 [10/13]

11 [11/13]

12 [12/13]

13 [13/13]

1 [1/13]

Figure 1.3: The compilation system.

Figure 1.4: Hardware organization of a typical system.

Figure 1.5: Reading the hello command from the keyboard.

Figure 1.6: Loading the executable from disk into main memory.

Figure 1.7: Writing the output string from memory to the display.

Figure 1.8: Cache memories.

Figure 1.9: An example of a memory hierarchy.

Figure 1.10: Layered view of a computer system.

Figure 1.11: Abstractions provided by an operating system.

Figure 1.12: Process context switching.

Figure 1.13: Process virtual address space.

Figure 1.14: A network is another I/O device.

Figure 1.15: Using telnet to run hello remotely over a network.

2 [2/13]

Figure 2.11: Conversion from two's complement to unsigned.

Figure 2.12: Conversion from unsigned to two's complement.

Figure 2.14: Integer addition.

Figure 2.15: Relation between integer addition and unsigned addition.

Figure 2.16: Unsigned addition.

Figure 2.17: Relation between integer and two's-complement addition.

Figure 2.19: two's-complement addition.

Figure 2.22: Representable values for six-bit floating-point format.

Figure 2.22: Representable values for six-bit floating-point format.

3 [3/13]

Figure 3.5: Illustration of stack operation.

Figure 3.17: Stack frame structure.

Figure 3.19: Stack frames for caller and swap_add.

Figure 3.22: Stack frame for recursive Fibonacci function.

Figure 3.28: Stack organization for echo function.

4 [4/13]

Figure 4.1: Y86 programmer-visible state.

Figure 4.2: Y86 instruction set.

Figure 4.3: Function codes for Y86 instruction set.

Figure 4.8: Logic gate types.

Figure 4.9: Combinational circuit to test for bit equality.

Figure 4.10: Single-bit multiplexor circuit.

Figure 4.11: Word-level equality test circuit.

Figure 4.12: Word-level multiplexor circuit.

Figure 4.13: Arithmetic/logic unit (ALU).

Figure 4.14: Register operation.

Figure 4.20: Abstract view of SEQ, a sequential implementation.

Figure 4.21: Hardware structure of SEQ, a sequential implementation.

Figure 4.23: Tracing two cycles of execution by SEQ.

Figure 4.25: SEQ fetch stage.

Figure 4.26: SEQ decode and write-back stage.

Figure 4.27: SEQ execute stage.

Figure 4.28: SEQ memory stage.

Figure 4.29: SEQ PC update stage.

Figure 4.30: SEQ+ abstract view.

Figure 4.31: SEQ+ hardware structure.

Figure 4.32: Unpipelined computation hardware.

Figure 4.33: Three-stage pipelined computation hardware.

Figure 4.34: Three-stage pipeline timing.

Figure 4.35: One clock cycle of pipeline operation.

Figure 4.36: Limitations of pipelining due to nonuniform stage delays.

Figure 4.37: Limitations of pipelining due to overhead.

Figure 4.38: Limitations of pipelining due to logical dependencies.

Figure 4.39: Abstract view of PIPE-, an initial pipelined implementation.

Figure 4.40: Example of instruction flow through pipeline.

Figure 4.41: Hardware structure of PIPE-, an initial pipelined implementation.

Figure 4.42: Pipelined execution of prog1 without special pipeline control.

Figure 4.43: Pipelined execution of prog2 without special pipeline control.

Figure 4.44: Pipelined execution of prog3 without special pipeline control.

Figure 4.45: Pipelined execution of prog4 without special pipeline control.

Figure 4.46: Pipelined execution of prog2 using stalls.

Figure 4.47: Pipelined execution of prog3 using stalls.

Figure 4.48: Pipelined execution of prog4 using stalls.

Figure 4.49: Pipelined execution of prog2 using forwarding.

Figure 4.50: Pipelined execution of prog3 using forwarding.

Figure 4.51: Pipelined execution of prog4 using forwarding.

Figure 4.52: Abstract view of PIPE, our final pipelined implementation.

Figure 4.53: Hardware structure of PIPE, our final pipelined implementation.

Figure 4.54: Example of load/use data hazard.

Figure 4.55: Handling a load/use hazard by stalling.

Figure 4.56: PIPE PC selection and fetch logic.

Figure 4.57: PIPE decode and write-back stage logic.

Figure 4.58: Demonstration of forwarding priority.

Figure 4.59: PIPE execute stage logic.

Figure 4.60: PIPE memory stage logic.

Figure 4.61: Simplified view of ret instruction processing.

Figure 4.62: Actual processing of the ret instruction.

Figure 4.63: Processing mispredicted branch instructions.

Figure 4.65: Additional pipeline register operations.

Figure 4.67: Pipeline states for special control conditions.

Figure 4.68: PIPE pipeline control logic.

Figure 4.69: Execute and memory stages capable of load forwarding.

5 [5/13]

Figure 5.2: Performance of vector sum functions.

Figure 5.3: Vector abstract data type.

Figure 5.8: Comparative performance of lower-case conversion routines.

Figure 5.11: Block diagram of a modern processor.

Figure 5.13: Operations for first iteration of inner loop of combine4 for integer multiplication.

Figure 5.14: Operations for first iteration of inner loop of combine4 for integer addition.

Figure 5.15: Scheduling of operations for integer multiplication with unlimited number of execution units.

Figure 5.16: Scheduling of operations for integer addition with unbounded resource constraints.

Figure 5.17: Scheduling of operations for integer multiplication with actual resource constraints.

Figure 5.18: Scheduling of operations for integer addition with actual resource constraints.

Figure 5.20: Operations for first iteration of inner loop of three-way unrolled integer addition.

Figure 5.21: Scheduling of operations for three-way unrolled integer sum with bounded resource constraints.

Figure 5.25: Operations for first iteration of inner loop of two-way unrolled, two-way parallel integer multiplication.

Figure 5.26: Scheduling of operations for two-way unrolled, two-way parallel integer multiplication with unlimited resources.

Figure 5.31: Scheduling of operations for list length function.

Figure 5.33: Code to write and read memory locations, along with illustrative executions.

Figure 5.34: Detail of load and store units.

Figure 5.35: Timing of write_read for example A.

Figure 5.36: Timing of write_read for example B.

Figure 5.37: Profile results for different version of word frequency counting program.

Figure 5.37: Profile results for different version of word frequency counting program.

6 [6/13]

Figure 6.1: Inverted pendulum.

Figure 6.3: High level view of a 128-bit 16 x 8 DRAM chip.

Figure 6.4: Reading the contents of a DRAM supercell.

Figure 6.4: Reading the contents of a DRAM supercell.

Figure 6.5: Reading the contents of a memory module.

Figure 6.6: Typical bus structure that connects the CPU and main memory.

Figure 6.7: Memory read transaction for a load operation: movl A,%eax.

Figure 6.7: Memory read transaction for a load operation: movl A,%eax.

Figure 6.7: Memory read transaction for a load operation: movl A,%eax.

Figure 6.8: Memory write transaction for a store operation: movl %eax,A.

Figure 6.8: Memory write transaction for a store operation: movl %eax,A.

Figure 6.8: Memory write transaction for a store operation: movl %eax,A.

Figure 6.9: Disk geometry.

Figure 6.9: Disk geometry.

Figure 6.10: Disk dynamics.

Figure 6.10: Disk dynamics.

Figure 6.11: Typical bus structure that connects the CPU, main memory, and I/O devices.

Figure 6.12: Reading a disk sector.

Figure 6.12: Reading a disk sector.

Figure 6.12: Reading a disk sector.

Figure 6.16: The increasing gap between DRAM, disk, and CPU speeds.

Figure 6.21: The memory hierarchy.

Figure 6.22: The basic principle of caching in a memory hierarchy.

Figure 6.24: Typical bus structure for L1 and L2 caches.

Figure 6.25: General organization of cache (S, E, B, m).

Figure 6.25: General organization of cache (S, E, B, m).

Figure 6.27: Direct-mapped cache (E = 1).

Figure 6.28: Set selection in a direct-mapped cache.

Figure 6.29: Line matching and word selection in a direct-mapped cache.

Figure 6.31: Why caches index with the middle bits.

Figure 6.32: Set associative cache (1 < E < C/B).

Figure 6.33: Set selection in a set associative cache.

Figure 6.34: Line matching and word selection in a set associative cache.

Figure 6.35: Fully set associative cache (E = C/B).

Figure 6.36: Set selection in a fully associative cache.

Figure 6.37: Line matching and word selection in a fully associative cache.

Figure 6.38: A typical multi-level cache organization.

Figure 6.42: The memory mountain.

Figure 6.43: Ridges of temporal locality in the memory mountain.

Figure 6.44: A slope of spatial locality.

Figure 6.47: Pentium III Xeon matrix multiply performance.

Figure 6.49: Graphical interpretation of blocked matrix multiply

Figure 6.50: Pentium III Xeon blocked matrix multiply performance.

7 [7/13]

Figure 7.2: Static linking.

Figure 7.3: Typical ELF relocatable object file.

Figure 7.7: Linking with static libraries.

Figure 7.11: Typical ELF executable object file

Figure 7.13: Linux run-time memory image

Figure 7.15: Dynamic linking with shared libraries.

8 [8/13]

Figure 8.1: Anatomy of an exception.

Figure 8.2: Exception table.

Figure 8.3: Generating the address of an exception handler.

Figure 8.5: Interrupt handling.

Figure 8.6: Trap handling.

Figure 8.7: Fault handling.

Figure 8.8: Abort handling.

Figure 8.10: Logical control flows.

Figure 8.11: Process address space.

Figure 8.12: Anatomy of a process context switch.

Figure 8.14: Examples of fork programs.

Figure 8.14: Examples of fork programs.

Figure 8.14: Examples of fork programs.

Figure 8.17: Organization of an argument list.

Figure 8.18: Organization of an environment variable list.

Figure 8.19: Typical organization of the user stack when a new program starts.

Figure 8.24: Foreground and background process groups.

9 [9/13]

Figure 9.1: Time scale of computer system events.

Figure 9.2: System's vs.~applications view of time.

Figure 9.4: Graphical representation of trace in Figure ....

Figure 9.6: Graphical representation of activity periods for trace in Figure ....

Figure 9.7: Process timing by interval counting.

Figure 9.8: Measuring interval counting accuracy.

Figure 9.11: Measurements of long duration procedure under different loading conditions.

Figure 9.12: measurements of short duration procedure under different loading conditions.

Figure 9.14: Experimental validation of K-best measurement scheme on Linux system

Figure 9.15: Effectiveness of K-best scheme for different values of K.

Figure 9.15: Effectiveness of K-best scheme for different values of K.

Figure 9.15: Effectiveness of K-best scheme for different values of K.

Figure 9.15: Effectiveness of K-best scheme for different values of K.

Figure 9.16: Measurements with compensation for timer interrupt overhead.

Figure 9.17: Experimental validation of K-best measurement scheme on IA32/Linux system with older version of the kernel.

Figure 9.18: Experimental validation of K-best measurement scheme on Windows-NT system.

Figure 9.19: Experimental validation of K-best measurement scheme on Compaq Alpha system.

Figure 9.22: Experimental validation of K-best measurement scheme using gettimeofday function.

10 [10/13]

Figure 10.1: A system that uses physical addressing.

Figure 10.2: A system that uses virtual addressing.

Figure 10.3: How a VM system uses main memory as a cache.

Figure 10.4: Page table.

Figure 10.5: VM page hit.

Figure 10.6: VM page fault (before).

Figure 10.7: VM page fault (after).

Figure 10.8: Allocating a new virtual page.

Figure 10.9: How VM provides processes with separate address spaces.

Figure 10.10: The memory image of a Linux process.

Figure 10.11: Using VM to provide page-level memory protection.

Figure 10.13: Address translation with a page table.

Figure 10.14: Operational view of page hits and page faults.

Figure 10.14: Operational view of page hits and page faults.

Figure 10.15: Integrating VM with a physically addressed cache.

Figure 10.16: Components of a virtual address that are used to access the TLB.

Figure 10.17: Operational view of a TLB hit and miss.

Figure 10.17: Operational view of a TLB hit and miss.

Figure 10.18: A two-level page table hierarchy.

Figure 10.19: Address translation with a k-level page table.

Figure 10.20: Addressing for small memory system.

Figure 10.21: TLB, page table, and cache for small memory system.

Figure 10.21: TLB, page table, and cache for small memory system.

Figure 10.21: TLB, page table, and cache for small memory system.

Figure 10.22: The Pentium memory system.

Figure 10.23: Summary of Pentium address translation.

Figure 10.24: Pentium multi level page table.

Figure 10.25: Formats of Pentium page directory entry (PDE) and page table entry (PTE).

Figure 10.25: Formats of Pentium page directory entry (PDE) and page table entry (PTE).

Figure 10.26: Pentium page table translation.

Figure 10.27: Pentium TLB translation.

Figure 10.28: The virtual memory of a Linux process.

Figure 10.29: How Linux organizes virtual memory.

Figure 10.30: Linux page fault handling.

Figure 10.31: A shared object.

Figure 10.31: A shared object.

Figure 10.32: A private copy-on-write object.

Figure 10.32: A private copy-on-write object.

Figure 10.33: How the loader maps the areas of the user address space.

Figure 10.34: Visual interpretation of mmap arguments.