Cover
Parallel Computer Organization and Design
Title
Copyright
Contents
Preface
Book outline
Acknowledgments
Michel Dubois
Murali Annavaram
Per Stenström
1 Introduction
1.1 WHAT IS COMPUTER ARCHITECTURE?
1.2 COMPONENTS OF A PARALLEL ARCHITECTURE
1.2.1 Processors
1.2.2 Memory
1.2.3 Interconnects
1.3 PARALLELISM IN ARCHITECTURES
1.3.1 Instruction-level parallelism (ILP)
1.3.2 Thread-level parallelism (TLP)
1.3.3 Vector and array processors
1.4 PERFORMANCE
1.4.1 Benchmarking
1.4.2 Amdahl's law
1.5 TECHNOLOGICAL CHALLENGES
1.5.1 Power and energy
1.5.2 Reliability
1.5.3 Wire delays
1.5.4 Design complexity
1.5.5 Limits of miniaturization and the CMOS endpoint
2 Impact of technology
2.1 CHAPTER OVERVIEW
2.2 BASIC LAWS OF ELECTRICITY
2.2.1 Ohm's law
2.2.2 Resistors
2.2.3 Capacitors
2.3 THE MOSFET TRANSISTOR AND CMOS INVERTER
2.4 TECHNOLOGY SCALING
2.5 POWER AND ENERGY
2.5.1 Dynamic power
2.5.2 Static power
2.5.3 Power and energy metrics
2.6 RELIABILITY
2.6.1 Faults versus errors
2.6.2 Reliability metrics
2.6.3 Failure rate and burn-in
2.6.4 Transient faults
2.6.5 Intermittent faults
2.6.6 Permanent faults
2.6.7 Process variations and their impact on faults
3 Processor architecture
3.1 CHAPTER OVERVIEW
3.2 INSTRUCTION SET ARCHITECTURE
3.2.1 Instruction types and opcodes
3.2.2 Instruction mixes
3.2.3 Instruction operands
3.2.4 Exceptions, traps, and interrupts
3.2.5 Memory-consistency model
3.2.6 Core ISA used in this book
3.2.7 CISC vs. RISC
3.3 STATICALLY SCHEDULED PIPELINES
3.3.1 The classic 5-stage pipeline
3.3.2 Out-of-order instruction completion
3.3.3 Superpipelined and superscalar CPUs
3.3.4 Branch prediction
3.3.5 Static instruction scheduling
3.3.6 Strengths and weaknesses of static pipelines
3.4 DYNAMICALLY SCHEDULED PIPELINES
3.4.1 Enforcing data dependencies: Tomasulo algorithm
3.4.2 Speculative execution: execution beyond unresolved branches
3.4.3 Dynamic branch prediction
3.4.4 Adding speculation to the Tomasulo algorithm
3.4.5 Dynamic memory disambiguation
3.4.6 Explicit register renaming
3.4.7 Register fetch after instruction issue
3.4.8 Speculative instruction scheduling
3.4.9 Beating the data-flow limit: value prediction
3.4.10 Multiple instructions per clock
3.4.11 Dealing with complex ISAs
3.5 VLIW MICROARCHITECTURES
3.5.1 Duality of dynamic and static techniques
3.5.2 VLIW architecture
3.5.3 Loop unrolling
3.5.4 Software pipelining
3.5.5 Non-cyclic VLIW scheduling
3.5.6 Predicated instructions
3.5.7 Speculative memory disambiguation
3.5.8 Exceptions
3.6 EPIC MICROARCHITECTURES
3.7 VECTOR MICROARCHITECTURES
3.7.1 Arithmetic/logic vector instructions
3.7.2 Memory vector instructions
3.7.3 Vector strip mining and chaining
3.7.4 Conditional statements
3.7.5 Scatter and gather
4 Memory hierarchies
4.1 CHAPTER OVERVIEW
4.2 THE PYRAMID OF MEMORY LEVELS
4.2.1 Memory-access locality
4.2.2 Memory hierarchy coherence
4.2.3 Memory inclusion
4.3 CACHE HIERARCHY
4.3.1 Cache mapping and organization
4.3.2 Replacement policies
4.3.3 Write policies
4.3.4 Cache hierarchy performance
4.3.5 Classification of cache misses
4.3.6 Non-blocking (lockup-free) caches
4.3.7 Cache prefetching and preloading
4.4 VIRTUAL MEMORY
4.4.1 Motivations for virtual memory
4.4.2 Operating system's view of virtual memory
4.4.3 Virtual address translation
4.4.4 Memory-access control
4.4.5 Hierarchical page tables
4.4.6 Inverted page table
4.4.7 Translation lookaside buffer
4.4.8 Virtual-address caches with physical tags
4.4.9 Virtual-address caches with virtual tags
5 Multiprocessor systems
5.1 CHAPTER OVERVIEW
5.2 PARALLEL-PROGRAMMING MODEL ABSTRACTIONS
5.2.1 Shared-memory systems
5.2.2 Message-passing systems
5.3 MESSAGE-PASSING MULTIPROCESSOR SYSTEMS
5.3.1 Message-passing primitives
5.3.2 Message-passing protocols
5.3.3 Hardware support for message-passing protocols
5.4 BUS-BASED SHARED-MEMORY SYSTEMS
5.4.1 Multiprocessor cache organizations
5.4.2 A simple snoopy cache protocol
5.4.3 Design space of snoopy cache protocols
5.4.4 Protocol variations
5.4.5 Design issues for multi-phase snoopy cache protocols
5.4.6 Classification of communication events
5.4.7 Translation lookaside buffer (TLB) consistency
5.5 SCALABLE SHARED-MEMORY SYSTEMS
5.5.1 Directory protocols: concepts and terminology
5.5.2 Implementation of a directory protocol
5.5.3 Scalability of directory protocols
5.5.4 Hierarchical systems
5.5.5 Page migration and replication
5.6 CACHE-ONLY SHARED-MEMORY SYSTEMS
5.6.1 Basic concepts, hardware structures, and protocols
5.6.2 Flat COMA
6 Interconnection networks
6.1 CHAPTER OVERVIEW
6.2 DESIGN SPACE OF INTERCONNECTION NETWORKS
6.2.1 Overview of design concepts
6.2.2 Latency and bandwidth models
6.3 SWITCHING STRATEGIES
6.4 TOPOLOGIES
6.4.1 Indirect networks
6.4.2 Direct networks
6.5 ROUTING TECHNIQUES
6.5.1 Routing algorithms
6.5.2 Deadlock avoidance and deterministic routing
6.5.3 Relaxing routing restrictions: virtual channels and the turn model
6.5.4 Relaxing routing further: adaptive routing
6.6 SWITCH ARCHITECTURE
7 Coherence, synchronization, and memory consistency
7.1 CHAPTER OVERVIEW
7.2 BACKGROUND
7.2.1 Shared-memory communication model
7.2.2 Hardware components
7.3 COHERENCE AND STORE ATOMICITY
7.3.1 Why is coherence in multiprocessors so hard?
7.3.2 Cache protocols
7.3.3 Store atomicity
7.3.4 Plain coherence
7.3.5 Store atomicity and memory interleaving
7.4 SEQUENTIAL CONSISTENCY
7.4.1 Formal model for sequential consistency
7.4.2 Access ordering rules for sequential consistency
7.4.3 Inbound message management
7.4.4 Store synchronization
7.5 SYNCHRONIZATION
7.5.1 Basic synchronization primitives
7.5.2 Hardware-based synchronization
7.5.3 Software-based synchronization
7.6 RELAXED MEMORY-CONSISTENCY MODELS
7.6.1 Relaxed models not relying on synchronization
7.6.2 Relaxed models relying on synchronization
7.7 SPECULATIVE VIOLATIONS OF MEMORY ORDERS
7.7.1 Conservative memory model enforcement in OoO processors
7.7.2 Speculative violations of memory orders
8 Chip multiprocessors
8.1 CHAPTER OVERVIEW
8.2 RATIONALE BEHIND CMPS
8.2.1 Technological trends
8.2.2 Opportunities
8.3 CORE MULTI-THREADING
8.3.1 Software-supported multi-threading
8.3.2 Hardware-supported multi-threading
8.3.3 Block (coarse-grain) multi-threading
8.3.4 Interleaved (fine-grain) multi-threading
8.3.5 Simultaneous multi-threading in OoO processors
8.4 CHIP MULTIPROCESSOR ARCHITECTURES
8.4.1 Homogeneous CMP architectures
8.4.2 CMPs with heterogeneous cores
8.4.3 Conjoined cores
8.5 PROGRAMMING MODELS
8.5.1 Independent processes
8.5.2 Explicit thread parallelization
8.5.3 Transactional memory
8.5.4 Thread-level speculation
8.5.5 Helper threads
8.5.6 Redundant execution to improve reliability
9 Quantitative evaluations
9.1 CHAPTER OVERVIEW
9.2 TAXONOMY OF SIMULATORS
9.2.1 User-level versus full-system simulators
9.2.2 Functional versus cycle-accurate simulators
9.2.3 Trace-driven, execution-driven and direct-execution simulators
9.3 INTEGRATING SIMULATORS
9.3.1 Functional-first simulator integration
9.3.2 Timing-first simulator integration
9.4 MULTIPROCESSOR SIMULATORS
9.4.1 Sequential multiprocessor simulators
9.4.2 Parallel multiprocessor simulators
9.5 POWER AND THERMAL SIMULATIONS
9.6 WORKLOAD SAMPLING
9.6.1 Sampling microarchitecture simulation
9.6.2 SimPoint
9.7 WORKLOAD CHARACTERIZATION
9.7.1 Understanding performance bottlenecks
9.7.2 Synthetic benchmarks
9.7.3 Projecting workload behavior
Index
#
A
B
C
D
E
F
G
H
I
J
K
L
M
N
O
P
Q
R
S
T
U
V
W
X
Z