logo资料库

Computer Architecture A Quantitative Approach 6th Edition.pdf

第1页 / 共1527页
第2页 / 共1527页
第3页 / 共1527页
第4页 / 共1527页
第5页 / 共1527页
第6页 / 共1527页
第7页 / 共1527页
第8页 / 共1527页
资料共1527页,剩余部分请下载后查看
Computer Architecture A Quantitative Approach 6th Edition.pdf
Front Cover
Inside Front Cover
In Praise of Computer Architecture: A Quantitative ApproachSixth Edition
Computer Architecture: A Quantitative Approach
Copyright
Foreword
Contents
Preface
Why We Wrote This Book
This Edition
Topic Selection and Organization
An Overview of the Content
Navigating the Text
Chapter Structure
Case Studies With Exercises
Supplemental Materials
Helping Improve This Book
Concluding Remarks
Acknowledgments
Contributors to the Sixth Edition
Reviewers
Appendices
Case Studies With Exercises
Additional Material
Contributors to Previous Editions
Reviewers
Appendices
Exercises
Case Studies With Exercises
Special Thanks
Chapter 1: Fundamentals of Quantitative Design and Analysis
1.1. Introduction
1.2. Classes of Computers
Internet of Things/Embedded Computers
Personal Mobile Device
Desktop Computing
Servers
Clusters/Warehouse-Scale Computers
Classes of Parallelism and Parallel Architectures
1.3. Defining Computer Architecture
Instruction Set Architecture: The Myopic View of Computer Architecture
Genuine Computer Architecture: Designing the Organization and Hardware to Meet Goals and Functional Requirements
1.4. Trends in Technology
Performance Trends: Bandwidth Over Latency
Scaling of Transistor Performance and Wires
1.5. Trends in Power and Energy in Integrated Circuits
Power and Energy: A Systems Perspective
Energy and Power Within a Microprocessor
The Shift in Computer Architecture Because of Limits of Energy
1.6. Trends in Cost
The Impact of Time, Volume, and Commoditization
Cost of an Integrated Circuit
Cost Versus Price
Cost of Manufacturing Versus Cost of Operation
1.7. Dependability
1.8. Measuring, Reporting, and Summarizing Performance
Benchmarks
Desktop Benchmarks
Server Benchmarks
Reporting Performance Results
Summarizing Performance Results
1.9. Quantitative Principles of Computer Design
Take Advantage of Parallelism
Principle of Locality
Focus on the Common Case
Amdahls Law
The Processor Performance Equation
1.10. Putting It All Together: Performance, Price, and Power
1.11. Fallacies and Pitfalls
1.12. Concluding Remarks
1.13. Historical Perspectives and References
Case Studies and Exercises by Diana Franklin
Case Study 1: Chip Fabrication Cost
Concepts illustrated by this case study
Case Study 2: Power Consumption in Computer Systems
Concepts illustrated by this case study
Exercises
Chapter 2: Memory Hierarchy Design
2.1. Introduction
Basics of Memory Hierarchies: A Quick Review
2.2. Memory Technology and Optimizations
SRAM Technology
DRAM Technology
Improving Memory Performance Inside a DRAM Chip: SDRAMs
Reducing Power Consumption in SDRAMs
Graphics Data RAMs
Packaging Innovation: Stacked or Embedded DRAMs
Flash Memory
Phase-Change Memory Technology
Enhancing Dependability in Memory Systems
2.3. Ten Advanced Optimizations of Cache Performance
First Optimization: Small and Simple First-Level Caches to Reduce Hit Time and Power
Second Optimization: Way Prediction to Reduce Hit Time
Third Optimization: Pipelined Access and Multibanked Caches to Increase Bandwidth
Fourth Optimization: Nonblocking Caches to Increase Cache Bandwidth
Implementing a Nonblocking Cache
Fifth Optimization: Critical Word First and Early Restart to Reduce Miss Penalty
Sixth Optimization: Merging Write Buffer to Reduce Miss Penalty
Seventh Optimization: Compiler Optimizations to Reduce Miss Rate
Loop Interchange
Blocking
Eighth Optimization: Hardware Prefetching of Instructions and Data to Reduce Miss Penalty or Miss Rate
Ninth Optimization: Compiler-Controlled Prefetching to Reduce Miss Penalty or Miss Rate
Tenth Optimization: Using HBM to Extend the Memory Hierarchy
Cache Optimization Summary
2.4. Virtual Memory and Virtual Machines
Protection via Virtual Memory
Protection via Virtual Machines
Requirements of a Virtual Machine Monitor
Instruction Set Architecture Support for Virtual Machines
Impact of Virtual Machines on Virtual Memory and I/O
Extending the Instruction Set for Efficient Virtualization and Better Security
An Example VMM: The Xen Virtual Machine
2.5. Cross-Cutting Issues: The Design of Memory Hierarchies
Protection, Virtualization, and Instruction Set Architecture
Autonomous Instruction Fetch Units
Speculation and Memory Access
Special Instruction Caches
Coherency of Cached Data
2.6. Putting It All Together: Memory Hierarchies in the ARM Cortex-A53 and Intel Core i7 6700
The ARM Cortex-A53
Performance of the Cortex-A53 Memory Hierarchy
The Intel Core i7 6700
Performance of the i7 memory system
2.7. Fallacies and Pitfalls
2.8. Concluding Remarks: Looking Ahead
2.9. Historical Perspectives and References
Case Studies and Exercises by Norman P. Jouppi, Rajeev Balasubramonian, Naveen Muralimanohar, and Sheng Li
Case Study 1: Optimizing Cache Performance via Advanced Techniques
Concepts illustrated by this case study
Case Study 2: Putting It All Together: Highly Parallel Memory Systems
Concept illustrated by this case study
Case Study 3: Studying the Impact of Various Memory System Organizations
Concepts illustrated by this case study
Exercises
Chapter 3: Instruction-Level Parallelism and Its Exploitation
3.1. Instruction-Level Parallelism: Concepts and Challenges
What Is Instruction-Level Parallelism?
Data Dependences and Hazards
Data Dependences
Name Dependences
Data Hazards
Control Dependences
3.2. Basic Compiler Techniques for Exposing ILP
Basic Pipeline Scheduling and Loop Unrolling
Summary of the Loop Unrolling and Scheduling
3.3. Reducing Branch Costs With Advanced Branch Prediction
Correlating Branch Predictors
Tournament Predictors: Adaptively Combining Local and Global Predictors
Tagged Hybrid Predictors
The Evolution of the Intel Core i7 Branch Predictor
3.4. Overcoming Data Hazards With Dynamic Scheduling
Dynamic Scheduling: The Idea
Dynamic Scheduling Using Tomasulo's Approach
3.5. Dynamic Scheduling: Examples and the Algorithm
Tomasulo's Algorithm: The Details
Tomasulo's Algorithm: A Loop-Based Example
3.6. Hardware-Based Speculation
3.7. Exploiting ILP Using Multiple Issue and Static Scheduling
The Basic VLIW Approach
3.8. Exploiting ILP Using Dynamic Scheduling, Multiple Issue, and Speculation
3.9. Advanced Techniques for Instruction Delivery and Speculation
Increasing Instruction Fetch Bandwidth
Branch-Target Buffers
Specialized Branch Predictors: Predicting Procedure Returns, Indirect Jumps, and Loop Branches
Integrated Instruction Fetch Units
Speculation: Implementation Issues and Extensions
Speculation Support: Register Renaming Versus Reorder Buffers
The Challenge of More Issues per Clock
How Much to Speculate
Speculating Through Multiple Branches
Speculation and the Challenge of Energy Efficiency
Address Aliasing Prediction
3.10. Cross-Cutting Issues
Hardware Versus Software Speculation
Speculative Execution and the Memory System
3.11. Multithreading: Exploiting Thread-Level Parallelism to Improve Uniprocessor Throughput
Effectiveness of Simultaneous Multithreading on Superscalar Processors
3.12. Putting It All Together: The Intel Core i7 6700 and ARM Cortex-A53
The ARM Cortex-A53
Performance of the A53 Pipeline
The Intel Core i7
Performance of the i7
3.13. Fallacies and Pitfalls
3.14. Concluding Remarks: What's Ahead?
3.15. Historical Perspective and References
Case Studies and Exercises by Jason D. Bakos and Robert P. Colwell
Case Study: Exploring the Impact of Microarchitectural Techniques
Concepts illustrated by this case study
Exercises
Chapter 4: Data-Level Parallelism in Vector, SIMD, and GPU Architectures
4.1. Introduction
4.2. Vector Architecture
RV64V Extension
How Vector Processors Work: An Example
Vector Execution Time
Multiple Lanes: Beyond One Element per Clock Cycle
Vector-Length Registers: Handling Loops Not Equal to 32
Predicate Registers: Handling IF Statements in Vector Loops
Memory Banks: Supplying Bandwidth for Vector Load/Store Units
Stride: Handling Multidimensional Arrays in Vector Architectures
Gather-Scatter: Handling Sparse Matrices in Vector Architectures
Programming Vector Architectures
4.3. SIMD Instruction Set Extensions for Multimedia
Programming Multimedia SIMD Architectures
The Roofline Visual Performance Model
4.4. Graphics Processing Units
Programming the GPU
NVIDIA GPU Computational Structures
NVIDA GPU Instruction Set Architecture
Conditional Branching in GPUs
NVIDIA GPU Memory Structures
Innovations in the Pascal GPU Architecture
Similarities and Differences Between Vector Architectures and GPUs
Similarities and Differences Between Multimedia SIMD Computers and GPUs
Summary
4.5. Detecting and Enhancing Loop-Level Parallelism
Finding Dependences
Eliminating Dependent Computations
4.6. Cross-Cutting Issues
Energy and DLP: Slow and Wide Versus Fast and Narrow
Banked Memory and Graphics Memory
Strided Accesses and TLB Misses
4.7. Putting It All Together: Embedded Versus Server GPUs and Tesla Versus Core i7
Comparison of a GPU and a MIMD With Multimedia SIMD
Comparison Update
4.8. Fallacies and Pitfalls
4.9. Concluding Remarks
4.10. Historical Perspective and References
Case Study and Exercises by Jason D. Bakos
Case Study: Implementing a Vector Kernel on a Vector Processor and GPU
Concepts illustrated by this case study
Exercises
Chapter 5: Thread-Level Parallelism
5.1. Introduction
Multiprocessor Architecture: Issues and Approach
Challenges of Parallel Processing
5.2. Centralized Shared-Memory Architectures
What Is Multiprocessor Cache Coherence?
Basic Schemes for Enforcing Coherence
Snooping Coherence Protocols
Basic Implementation Techniques
An Example Protocol
Extensions to the Basic Coherence Protocol
Limitations in Symmetric Shared-Memory Multiprocessors and Snooping Protocols
Implementing Snooping Cache Coherence
5.3. Performance of Symmetric Shared-Memory Multiprocessors
A Commercial Workload
A Multiprogramming and OS Workload
Performance of the Multiprogramming and OS Workload
5.4. Distributed Shared-Memory and Directory-Based Coherence
Directory-Based Cache Coherence Protocols: The Basics
An Example Directory Protocol
5.5. Synchronization: The Basics
Basic Hardware Primitives
Implementing Locks Using Coherence
5.6. Models of Memory Consistency: An Introduction
The Programmer's View
Relaxed Consistency Models: The Basics and Release Consistency
5.7. Cross-Cutting Issues
Compiler Optimization and the Consistency Model
Using Speculation to Hide Latency in Strict Consistency Models
Inclusion and Its Implementation
Performance Gains From Multiprocessing and Multithreading
5.8. Putting It All Together: Multicore Processors and Their Performance
Performance of Multicore-Based Multiprocessors on a Multiprogrammed Workload
Performance of Multicore-Based Multiprocessors on a Multiprogrammed Workload
Scalability in an Xeon MP With Different Workloads
Performance and Energy Efficiency of the Intel i7 920 Multicore
Putting Multicore and SMT Together
5.9. Fallacies and Pitfalls
5.10. The Future of Multicore Scaling
5.11. Concluding Remarks
5.12. Historical Perspectives and References
Case Studies and Exercises by Amr Zaky and David A. Wood
Case Study 1: Single Chip Multicore Multiprocessor
Concepts illustrated by this case study
Case Study 2: Simple Directory-Based Coherence
Concepts illustrated by this case study
Read/Write Notation
Messages
Case Study 3: Memory Consistency
Concepts Illustrated by This Case Study
Exercises
Chapter 6: Warehouse-Scale Computers to Exploit Request-Level and Data-Level Parallelism
6.1. Introduction
6.2. Programming Models and Workloads for Warehouse-Scale Computers
6.3. Computer Architecture of Warehouse-Scale Computers
Storage
WSC Memory Hierarchy
6.4. The Efficiency and Cost of Warehouse-Scale Computers
Measuring Efficiency of a WSC
Cost of a WSC
6.5. Cloud Computing: The Return of Utility Computing
Amazon Web Services
How Big Is the AWS Cloud?
6.6. Cross-Cutting Issues
Preventing the WSC Network From Being a Bottleneck
Using Energy Efficiently Inside the Server
6.7. Putting It All Together: A Google Warehouse-Scale Computer
Power Distribution in a Google WSC
Cooling in a Google WSC
Racks of a Google WSC
Networking in a Google WSC
Servers in a Google WSC
Conclusion
6.8. Fallacies and Pitfalls
6.9. Concluding Remarks
6.10. Historical Perspectives and References
Case Studies and Exercises by Parthasarathy Ranganathan
Case Study 1: Total Cost of Ownership Influencing Warehouse-Scale Computer Design Decisions
Concepts illustrated by this case study
Case Study 2: Resource Allocation in WSCs and TCO
Concepts illustrated by this case study
Exercises
Chapter 7: Domain-Specific Architectures
7.1. Introduction
7.2. Guidelines for DSAs
7.3. Example Domain: Deep Neural Networks
The Neurons of DNNs
Training Versus Inference
Multilayer Perceptron
Convolutional Neural Network
Recurrent Neural Network
Batches
Quantization
Summary of DNNs
7.4. Googles Tensor Processing Unit, an Inference Data Center Accelerator
TPU Origin
TPU Architecture
TPU Instruction Set Architecture
TPU Microarchitecture
TPU Implementation
TPU Software
Improving the TPU
Summary: How TPU Follows the Guidelines
7.5. Microsoft Catapult, a Flexible Data Center Accelerator
Catapult Implementation and Architecture
Catapult Software
CNNs on Catapult
Search Acceleration on Catapult
Catapult Version 1 Deployment
Catapult Version 2
Summary: How Catapult Follows the Guidelines
7.6. Intel Crest, a Data Center Accelerator for Training
7.7. Pixel Visual Core, a Personal Mobile Device Image Processing Unit
ISPs, the Hardwired Predecessors of IPUs
Pixel Visual Core Software
Pixel Visual Core Architecture Philosophy
The Pixel Visual Core Halo
A Processor of the Pixel Visual Core
Pixel Visual Core Instruction Set Architecture
Pixel Visual Core Example
Pixel Visual Core Processing Element
Two-Dimensional Line Buffers and Their Controller
Pixel Visual Core Implementation
Summary: How Pixel Visual Core Follows the Guidelines
7.8. Cross-Cutting Issues
Heterogeneity and System on a Chip (SOC)
An Open Instruction Set
7.9. Putting It All Together: CPUs Versus GPUs Versus DNN Accelerators
Performance: Rooflines, Response Time, and Throughput
Cost-Performance, TCO, and Performance/Watt
Evaluating Catapult and Pixel Visual Core
7.10. Fallacies and Pitfalls
7.11. Concluding Remarks
An Architecture Renaissance
7.12. Historical Perspectives and References
Case Studies and Exercises by Cliff Young
Case Study: Googles Tensor Processing Unit and Acceleration of Deep Neural Networks
Concepts illustrated by this case study
Exercises
Appendix A: Instruction Set Principles
A.1. Introduction
A.2. Classifying Instruction Set Architectures
Summary: Classifying Instruction Set Architectures
A.3. Memory Addressing
Interpreting Memory Addresses
Addressing Modes
Displacement Addressing Mode
Immediate or Literal Addressing Mode
Summary: Memory Addressing
A.4. Type and Size of Operands
A.5. Operations in the Instruction Set
A.6. Instructions for Control Flow
Addressing Modes for Control Flow Instructions
Conditional Branch Options
Procedure Invocation Options
Summary: Instructions for Control Flow
A.7. Encoding an Instruction Set
Reduced Code Size in RISCs
Summary: Encoding an Instruction Set
A.8. Cross-Cutting Issues: The Role of Compilers
The Structure of Recent Compilers
Register Allocation
Impact of Optimizations on Performance
The Impact of Compiler Technology on the Architects Decisions
How the Architect Can Help the Compiler Writer
Compiler Support (or Lack Thereof) for Multimedia Instructions
Summary: The Role of Compilers
A.9. Putting It All Together: The RISC-V Architecture
RISC-V Instruction Set Organization
Registers for RISC-V
Data Types for RISC-V
Addressing Modes for RISC-V Data Transfers
RISC-V Instruction Format
RISC-V Operations
RISC-V Control Flow Instructions
RISC-V Floating-Point Operations
RISC-V Instruction Set Usage
A.10. Fallacies and Pitfalls
A.11. Concluding Remarks
A.12. Historical Perspective and References
Exercises by Gregory D. Peterson
Appendix B: Review of Memory Hierarchy
B.1. Introduction
Cache Performance Review
Four Memory Hierarchy Questions
Q1: Where Can a Block be Placed in a Cache?
Q2: How Is a Block Found If It Is in the Cache?
Q3: Which Block Should be Replaced on a Cache Miss?
Q4: What Happens on a Write?
An Example: The Opteron Data Cache
B.2. Cache Performance
Average Memory Access Time and Processor Performance
Miss Penalty and Out-of-Order Execution Processors
B.3. Six Basic Cache Optimizations
First Optimization: Larger Block Size to Reduce Miss Rate
Second Optimization: Larger Caches to Reduce Miss Rate
Third Optimization: Higher Associativity to Reduce Miss Rate
Fourth Optimization: Multilevel Caches to Reduce Miss Penalty
Fifth Optimization: Giving Priority to Read Misses over Writes to Reduce Miss Penalty
Sixth Optimization: Avoiding Address Translation During Indexing of the Cache to Reduce Hit Time
Summary of Basic Cache Optimization
B.4. Virtual Memory
Four Memory Hierarchy Questions Revisited
Q1: Where Can a Block be Placed in Main Memory?
Q2: How Is a Block Found If It Is in Main Memory?
Q3: Which Block Should be Replaced on a Virtual Memory Miss?
Q4: What Happens on a Write?
Techniques for Fast Address Translation
Selecting a Page Size
Summary of Virtual Memory and Caches
B.5. Protection and Examples of Virtual Memory
Protecting Processes
A Segmented Virtual Memory Example: Protection in the Intel Pentium
Adding Bounds Checking and Memory Mapping
Adding Sharing and Protection
Adding Safe Calls from User to OS Gates and Inheriting Protection Level for Parameters
A Paged Virtual Memory Example: The 64-Bit Opteron Memory Management
Summary: Protection on the 32-Bit Intel Pentium Versus the 64-Bit AMD Opteron
B.6. Fallacies and Pitfalls
B.7. Concluding Remarks
B.8. Historical Perspective and References
Exercises by Amr Zaky
Appendix C: Pipelining: Basic and Intermediate Concepts
C.1. Introduction
What Is Pipelining?
The Basics of the RISC V Instruction Set
A Simple Implementation of a RISC Instruction Set
The Classic Five-Stage Pipeline for a RISC Processor
Basic Performance Issues in Pipelining
C.2. The Major Hurdle of Pipelining-Pipeline Hazards
Performance of Pipelines With Stalls
Data Hazards
Minimizing Data Hazard Stalls by Forwarding
Data Hazards Requiring Stalls
Branch Hazards
Reducing Pipeline Branch Penalties
Performance of Branch Schemes
Reducing the Cost of Branches Through Prediction
Static Branch Prediction
Dynamic Branch Prediction and Branch-Prediction Buffers
C.3. How Is Pipelining Implemented?
A Simple Implementation of RISC V
A Basic Pipeline for RISC V
Implementing the Control for the RISC V Pipeline
Dealing With Branches in the Pipeline
C.4. What Makes Pipelining Hard to Implement?
Dealing With Exceptions
Types of Exceptions and Requirements
Stopping and Restarting Execution
Exceptions in RISC V
Instruction Set Complications
C.5. Extending the RISC V Integer Pipeline to Handle Multicycle Operations
Hazards and Forwarding in Longer Latency Pipelines
Maintaining Precise Exceptions
Performance of a Simple RISC V FP Pipeline
C.6. Putting It All Together: The MIPS R4000 Pipeline
The Floating-Point Pipeline
Performance of the R4000 Pipeline
C.7. Cross-Cutting Issues
RISC Instruction Sets and Efficiency of Pipelining
Dynamically Scheduled Pipelines
Dynamic Scheduling With a Scoreboard
C.8. Fallacies and Pitfalls
C.9. Concluding Remarks
C.10. Historical Perspective and References
Updated Exercises by Diana Franklin
Appendix D: Storage Systems
D.1. Introduction
D.2. Advanced Topics in Disk Storage
Disk Power
Advanced Topics in Disk Arrays
RAID 10 versus 01 (or 1+0 versus RAID 0+1)
RAID 6: Beyond a Single Disk Failure
D.3. Definition and Examples of Real Faults and Failures
Berkeleys Tertiary Disk
Tandem
Other Studies of the Role of Operators in Dependability
D.4. I/O Performance, Reliability Measures, and Benchmarks
Throughput versus Response Time
Transaction-Processing Benchmarks
SPEC System-Level File Server, Mail, and Web Benchmarks
Examples of Benchmarks of Dependability
D.5. A Little Queuing Theory
Poisson Distribution of Random Variables
D.6. Crosscutting Issues
Point-to-Point Links and Switches Replacing Buses
Block Servers versus Filers
Asynchronous I/O and Operating Systems
D.7. Designing and Evaluating an I/O System-The Internet Archive Cluster
The Internet Archive Cluster
Estimating Performance, Dependability, and Cost of the Internet Archive Cluster
Calculating MTTF of the TB-80 Cluster
D.8. Putting It All Together: NetApp FAS6000 Filer
D.9. Fallacies and Pitfalls
D.10. Concluding Remarks
D.11. Historical Perspective and References
Case Studies with Exercises by Andrea C. Arpaci-Dusseau and Remzi H. Arpaci-Dusseau
Case Study 1: Deconstructing a Disk
Concepts illustrated by this case study
Case Study 2: Deconstructing a Disk Array
Concepts illustrated by this case study
Case Study 3: RAID Reconstruction
Concepts illustrated by this case study
Case Study 4: Performance Prediction for RAIDs
Concepts illustrated by this case study
Case Study 5: I/O Subsystem Design
Concepts illustrated by this case study
Case Study 6: Dirty Rotten Bits
Concepts illustrated by this case study
Case Study 7: Sorting Things Out
Concepts illustrated by this case study
Appendix E: Embedded Systems
E.1. Introduction
Real-Time Processing
E.2. Signal Processing and Embedded Applications: The Digital Signal Processor
The TI 320C55
The TI 320C6x
Media Extensions
E.3. Embedded Benchmarks
Power Consumption and Efficiency as the Metric
E.4. Embedded Multiprocessors
E.5. Case Study: The Emotion Engine of the Sony PlayStation 2
E.6. Case Study: Sanyo VPC-SX500 Digital Camera
E.7. Case Study: Inside a Cell Phone
Background on Wireless Networks
The Cell Phone
Cell Phone Standards and Evolution
E.8. Concluding Remarks
Appendix F: Interconnection Networks
F.1. Introduction
Interconnection Network Domains
Approach and Organization of This Appendix
F.2. Interconnecting Two Devices
Network Interface Functions: Composing and Processing Messages
Basic Network Structure and Functions: Media and Form Factor, Packet Transport, Flow Control, and Error Handling
Characterizing Performance: Latency and Effective Bandwidth
F.3. Connecting More than Two Devices
Additional Network Structure and Functions: Topology, Routing, Arbitration, and Switching
Shared-Media Networks
Switched-Media Networks
Comparison of Shared- and Switched-Media Networks
Characterizing Performance: Latency and Effective Bandwidth
F.4. Network Topology
Centralized Switched Networks
Distributed Switched Networks
Effects of Topology on Network Performance
F.5. Network Routing, Arbitration, and Switching
Routing
Arbitration
Switching
Impact on Network Performance
F.6. Switch Microarchitecture
Basic Switch Microarchitecture
Buffer Organizations
Routing Algorithm Implementation
Pipelining the Switch Microarchitecture
Other Switch Microarchitecture Enhancements
F.7. Practical Issues for Commercial Interconnection Networks
Connectivity
Standardization: Cross-Company Interoperability
Congestion Management
Fault Tolerance
F.8. Examples of Interconnection Networks
On-Chip Network: Intel Single-Chip Cloud Computer
System Area Network: IBM Blue Gene/L 3D Torus Network
System/Storage Area Network: InfiniBand
Ethernet: The Local Area Network
Wide Area Network: ATM
F.9. Internetworking
F.10. Crosscutting Issues for Interconnection Networks
Density-Optimized Processors versus SPEC-Optimized Processors
Smart Switches versus Smart Interface Cards
Protection and User Access to the Network
Efficient Interface to the Memory Hierarchy versus the Network
Compute-Optimized Processors versus Receiver Overhead
F.11. Fallacies and Pitfalls
F.12. Concluding Remarks
Acknowledgments
F.13. Historical Perspective and References
Wide Area Networks
Local Area Networks
System Area Networks
Storage Area Networks
On-Chip Networks
References
Exercises
Appendix G: Vector Processors in More Depth
G.1. Introduction
G.2. Vector Performance in More Depth
Pipelined Instruction Start-Up and Multiple Lanes
G.3. Vector Memory Systems in More Depth
G.4. Enhancing Vector Performance
Chaining in More Depth
Sparse Matrices in More Depth
G.5. Effectiveness of Compiler Vectorization
G.6. Putting It All Together: Performance of Vector Processors
Measures of Vector Performance
The Peak Performance of VMIPS on DAXPY
Sustained Performance of VMIPS on the Linpack Benchmark
DAXPY Performance on an Enhanced VMIPS
G.7. A Modern Vector Supercomputer: The Cray X1
Multi-Streaming Processors
Cray X1E
G.8. Concluding Remarks
G.9. Historical Perspective and References
References
Exercises
Appendix H: Hardware and Software for VLIW and EPIC
H.1. Introduction: Exploiting Instruction-Level Parallelism Statically
H.2. Detecting and Enhancing Loop-Level Parallelism
Finding Dependences
Eliminating Dependent Computations
H.3. Scheduling and Structuring Code for Parallelism
Software Pipelining: Symbolic Loop Unrolling
Global Code Scheduling
Trace Scheduling: Focusing on the Critical Path
Superblocks
H.4. Hardware Support for Exposing Parallelism: Predicated Instructions
H.5. Hardware Support for Compiler Speculation
Hardware Support for Preserving Exception Behavior
Hardware Support for Memory Reference Speculation
H.6. The Intel IA-64 Architecture and Itanium Processor
The Intel IA-64 Instruction Set Architecture
The IA-64 Register Model
Instruction Format and Support for Explicit Parallelism
Instruction Set Basics
Predication and Speculation Support
The Itanium 2 Processor
Functional Units and Instruction Issue
Itanium 2 Performance
H.7. Concluding Remarks
Reference
Appendix I: Large-Scale Multiprocessors and Scientific Applications
I.1. Introduction
I.2. Interprocessor Communication: The Critical Performance Issue
Advantages of Different Communication Mechanisms
I.3. Characteristics of Scientific Applications
Characteristics of Scientific Applications
The FFT Kernel
The LU Kernel
The Barnes Application
The Ocean Application
Computation/Communication for the Parallel Programs
I.4. Synchronization: Scaling Up
Synchronization Performance Challenges
Barrier Synchronization
Synchronization Mechanisms for Larger-Scale Multiprocessors
Software Implementations
Hardware Primitives
I.5. Performance of Scientific Applications on Shared-Memory Multiprocessors
Performance of a Scientific Workload on a Symmetric Shared-Memory Multiprocessor
Performance of a Scientific Workload on a Distributed-Memory Multiprocessor
I.6. Performance Measurement of Parallel Processors with Scientific Applications
I.7. Implementing Cache Coherence
Implementing Cache Coherence in a DSM Multiprocessor
Avoiding Deadlock from Limited Buffering
Implementing the Directory Controller
I.8. The Custom Cluster Approach: Blue Gene/L
The Blue Gene/L Computing Node
I.9. Concluding Remarks
Appendix J: Computer Arithmetic
J.1. Introduction
J.2. Basic Techniques of Integer Arithmetic
Ripple-Carry Addition
Radix-2 Multiplication and Division
Signed Numbers
Systems Issues
J.3. Floating Point
Special Values and Denormals
Representation of Floating-Point Numbers
J.4. Floating-Point Multiplication
Denormals
Precision of Multiplication
J.5. Floating-Point Addition
Speeding Up Addition
Denormalized Numbers
J.6. Division and Remainder
Iterative Division
Floating-Point Remainder
J.7. More on Floating-Point Arithmetic
Fused Multiply-Add
Precisions
Exceptions
Underflow
J.8. Speeding Up Integer Addition
Carry-Lookahead
Carry-Skip Adders
Carry-Select Adder
J.9. Speeding Up Integer Multiplication and Division
Shifting over Zeros
SRT Division
Speeding Up Multiplication with a Single Adder
Faster Multiplication with Many Adders
Faster Division with One Adder
J.10. Putting It All Together
J.11. Fallacies and Pitfalls
J.12. Historical Perspective and References
References
Exercises
Appendix K: Survey of Instruction Set Architectures
K.1. Introduction
K.2. A Survey of RISC Architectures for Desktop, Server, and Embedded Computers
Introduction
Addressing Modes and Instruction Formats
Instructions
RV64G Core Instructions
Compare and Conditional Branch
RV64GC Core 16-bit Instructions
Instructions: Common Extensions beyond RV64G
Instructions Unique to MIPS64 R6
Instructions Unique to SPARC v.9
Register Windows
Fast Traps
Support for LISP and Smalltalk
Instructions Unique to ARM
Instructions Unique to Power3
Branch Registers: Link and Counter
Instructions: Multimedia Extensions of the Desktop/Server RISCs
Instructions: Digital Signal-Processing Extensions of the Embedded RISCs
Concluding Remarks
K.3. The Intel 80x86
Introduction
80x86 Registers and Data Addressing Modes
80x86 Integer Operations
80x86 Floating-Point Operations
80x86 Instruction Encoding
Putting It All Together: Measurements of Instruction Set Usage
Measurements of 80x86 Operand Addressing
Comparative Operation Measurements
Concluding Remarks
Beauty is in the eye of the beholder.
K.4. The VAX Architecture
Introduction
VAX Operands and Addressing Modes
Encoding VAX Instructions
VAX Operations
Number of Operations
Branches, Jumps, and Procedure Calls
An Example to Put It All Together: swap
Register Allocation for swap
Code for the Body of the Procedure swap
Preserving Registers across Procedure Invocation of swap
The Full Procedure swap
A Longer Example: sort
Register Allocation for sort
Code for the Body of the sort Procedure
The Outer Loop
The Inner Loop
The Procedure Call
Passing Parameters
Preserving Registers across Procedure Invocation of sort
The Full Procedure sort
Fallacies and Pitfalls
Concluding Remarks
Exercises
K.5. The IBM 360/370 Architecture for Mainframe Computers
Introduction
System/360 Instruction Set
Integer/Logical and Floating-Point R-R Instructions
Branches and Status Setting R-R Instructions
Branches/Logical and Floating-Point Instructions-RX Format
Branches and Special Loads and Stores-RX Format
RS and SI Format Instructions
SS Format Instructions
360 Detailed Measurements
K.6. Historical Perspective and References
Acknowledgments
Appendix L: Advanced Concepts on Address Translation
Appendix M: Historical Perspectives and References
M.1. Introduction
M.2. The Early Development of Computers (Chapter 1)
The First General-Purpose Electronic Computers
Important Special-Purpose Machines
Commercial Developments
Development of Quantitative Performance Measures: Successes and Failures
M.3. The Development of Memory Hierarchy and Protection (Chapter 2 and Appendix B)
M.4. The Evolution of Instruction Sets (Appendices A, J, and K)
Stack Architectures
Computer Architecture Defined
High-Level Language Computer Architecture
Reduced Instruction Set Computers
M.5. The Development of Pipelining and Instruction-Level Parallelism (Chapter 3 and Appendices C and H)
Early Pipelined CPUs
The Introduction of Dynamic Scheduling
The IBM 360 Model 91: A Landmark Computer
Branch-Prediction Schemes
The Development of Multiple-Issue Processors
Compiler Technology and Hardware Support for Scheduling
EPIC and the IA-64 Development
Studies of ILP and Ideas to Increase ILP
Going Beyond the Data Flow Limit
Recent Advanced Microprocessors
Multithreading and Simultaneous Multithreading
M.6. The Development of SIMD Supercomputers, Vector Computers, Multimedia SIMD Instruction Extensions, and Graphical Proc ...
SIMD Supercomputers
Vector Computers
Multimedia SIMD Instruction Extensions
Graphical Processor Units
Scalable GPUs
Graphics Pipelines
GPGPU: An Intermediate Step
GPU Computing
References
SIMD Supercomputers
Vector Architecture
Multimedia SIMD
GPU
M.7. The History of Multiprocessors and Parallel Processing (Chapter 5 and Appendices F, G, and I)
SIMD Computers: Attractive Idea, Many Attempts, No Lasting Successes
Other Early Experiments
Great Debates in Parallel Processing
More Recent Advances and Developments
The Development of Bus-Based Coherent Multiprocessors
Toward Large-Scale Multiprocessors
Clusters
Recent Trends in Large-Scale Multiprocessors
Developments in Synchronization and Consistency Models
Other References
M.8. The Development of Clusters (Chapter 6)
Clusters, the Forerunner of WSCs
Utility Computing, the Forerunner of Cloud Computing
Containers
M.9. Historical Perspectives and References
M.10. The History of Magnetic Storage, RAID, and I/O Buses (Appendix D)
Magnetic Storage
RAID
I/O Buses and Controllers
References
References
Index
Back End Sheet
Inside Back Cover
Back Cover
Computer Architecture Formulas 1. CPU time = Instruction count ⫻ Clock cycles per instruction ⫻ Clock cycle time / 2. X is n times faster than Y: n Execution timeY Execution timeX = = / PerformanceX PerformanceY 3. Amdahl’s Law: Speedupoverall = Execution timeold ------------------------------------------- Execution timenew = 1 --------------------------------------------------------------------------------------------- ) Fractionenhanced ( –# + ------------------------------------ 1 Fractionenhanced Speedupenhanced 4. 5. 6. Energydynamic 1 2/ ⫻ Capacitive load Voltage2 ⫻ Powerdynamic 1 2/ ⫻ Capacitive load ⫻ Voltage2 Frequency switched ⫻ Powerstatic Currentstatic Voltage ⫻ 7. Availability = Mean time to fail / (Mean time to fail + Mean time to repair) 8. Die yield Wafer yield = ⫻ 1 (/ 1 Defects per unit area Die area ⫻ + N ) where Wafer yield accounts for wafers that are so bad they need not be tested and the process-complexity factor, a measure of manufacturing difficulty. is a parameter called ranges from 11.5 to 15.5 in 2011. N N 9. Means—arithmetic (AM), weighted arithmetic (WAM), and geometric (GM): AM = 1 --- n n i 1= n n n Timei WAM = Weighti Timei ⫻ GM = Timei i 1= i 1= where Timei is the execution time for the ith program of a total of n in the workload, Weighti is the weighting of the ith program in the workload. 10. Average memory-access time = Hit time + Miss rate ⫻ Miss penalty 11. Misses per instruction = Miss rate ⫻ Memory access per instruction 12. Cache index size: 2index = Cache size /(Block size ⫻ Set associativity) 13. Power Utilization Effectiveness (PUE) of a Warehouse Scale Computer = Total Facility Power -------------------------------------------------- IT Equipment Power Rules of Thumb 1. Amdahl/Case Rule: A balanced computer system needs about 1 MB of main memory capacity and 1 megabit per second of I/O bandwidth per MIPS of CPU performance. 2. 90/10 Locality Rule: A program executes about 90% of its instructions in 10% of its code. 3. Bandwidth Rule: Bandwidth grows by at least the square of the improvement in latency. 4. 2:1 Cache Rule: The miss rate of a direct-mapped cache of size N is about the same as a two-way set- associative cache of size N/2. 5. Dependability Rule: Design with no single point of failure. 6. Watt-Year Rule: The fully burdened cost of a Watt per year in a Warehouse Scale Computer in North America in 2011, including the cost of amortizing the power and cooling infrastructure, is about $2.
In Praise of Computer Architecture: A Quantitative Approach Sixth Edition “Although important concepts of architecture are timeless, this edition has been thoroughly updated with the latest technology developments, costs, examples, and references. Keeping pace with recent developments in open-sourced architec- ture, the instruction set architecture used in the book has been updated to use the RISC-V ISA.” —from the foreword by Norman P. Jouppi, Google “Computer Architecture: A Quantitative Approach is a classic that, like fine wine, just keeps getting better. I bought my first copy as I finished up my undergraduate degree and it remains one of my most frequently referenced texts today.” —James Hamilton, Amazon Web Service “Hennessy and Patterson wrote the first edition of this book when graduate stu- dents built computers with 50,000 transistors. Today, warehouse-size computers contain that many servers, each consisting of dozens of independent processors and billions of transistors. The evolution of computer architecture has been rapid and relentless, but Computer Architecture: A Quantitative Approach has kept pace, with each edition accurately explaining and analyzing the important emerging ideas that make this field so exciting.” —James Larus, Microsoft Research “Another timely and relevant update to a classic, once again also serving as a win- dow into the relentless and exciting evolution of computer architecture! The new discussions in this edition on the slowing of Moore's law and implications for future systems are must-reads for both computer architects and practitioners working on broader systems.” —Parthasarathy (Partha) Ranganathan, Google “I love the ‘Quantitative Approach’ books because they are written by engineers, for engineers. John Hennessy and Dave Patterson show the limits imposed by mathematics and the possibilities enabled by materials science. Then they teach through real-world examples how architects analyze, measure, and compromise to build working systems. This sixth edition comes at a critical time: Moore’s Law is fading just as deep learning demands unprecedented compute cycles. The new chapter on domain-specific architectures documents a number of prom- ising approaches and prophesies a rebirth in computer architecture. Like the scholars of the European Renaissance, computer architects must understand our own history, and then combine the lessons of that history with new techniques to remake the world.” —Cliff Young, Google
This page intentionally left blank
Computer Architecture A Quantitative Approach Sixth Edition
John L. Hennessy is a Professor of Electrical Engineering and Computer Science at Stanford University, where he has been a member of the faculty since 1977 and was, from 2000 to 2016, its 10th President. He currently serves as the Director of the Knight-Hennessy Fellow- ship, which provides graduate fellowships to potential future leaders. Hennessy is a Fellow of the IEEE and ACM, a member of the National Academy of Engineering, the National Acad- emy of Science, and the American Philosophical Society, and a Fellow of the American Acad- emy of Arts and Sciences. Among his many awards are the 2001 Eckert-Mauchly Award for his contributions to RISC technology, the 2001 Seymour Cray Computer Engineering Award, and the 2000 John von Neumann Award, which he shared with David Patterson. He has also received 10 honorary doctorates. In 1981, he started the MIPS project at Stanford with a handful of graduate students. After completing the project in 1984, he took a leave from the university to cofound MIPS Com- puter Systems, which developed one of the first commercial RISC microprocessors. As of 2017, over 5 billion MIPS microprocessors have been shipped in devices ranging from video games and palmtop computers to laser printers and network switches. Hennessy subse- quently led the DASH (Director Architecture for Shared Memory) project, which prototyped the first scalable cache coherent multiprocessor; many of the key ideas have been adopted in modern multiprocessors. In addition to his technical activities and university responsibil- ities, he has continued to work with numerous start-ups, both as an early-stage advisor and an investor. David A. Patterson became a Distinguished Engineer at Google in 2016 after 40 years as a UC Berkeley professor. He joined UC Berkeley immediately after graduating from UCLA. He still spends a day a week in Berkeley as an Emeritus Professor of Computer Science. His teaching has been honored by the Distinguished Teaching Award from the University of California, the Karlstrom Award from ACM, and the Mulligan Education Medal and Under- graduate Teaching Award from IEEE. Patterson received the IEEE Technical Achievement Award and the ACM Eckert-Mauchly Award for contributions to RISC, and he shared the IEEE Johnson Information Storage Award for contributions to RAID. He also shared the IEEE John von Neumann Medal and the C & C Prize with John Hennessy. Like his co-author, Patterson is a Fellow of the American Academy of Arts and Sciences, the Computer History Museum, ACM, and IEEE, and he was elected to the National Academy of Engineering, the National Academy of Sciences, and the Silicon Valley Engineering Hall of Fame. He served on the Information Technology Advisory Committee to the President of the United States, as chair of the CS division in the Berkeley EECS department, as chair of the Computing Research Association, and as President of ACM. This record led to Distinguished Service Awards from ACM, CRA, and SIGARCH. He is currently Vice-Chair of the Board of Directors of the RISC-V Foundation. At Berkeley, Patterson led the design and implementation of RISC I, likely the first VLSI reduced instruction set computer, and the foundation of the commercial SPARC architec- ture. He was a leader of the Redundant Arrays of Inexpensive Disks (RAID) project, which led to dependable storage systems from many companies. He was also involved in the Network of Workstations (NOW) project, which led to cluster technology used by Internet companies and later to cloud computing. His current interests are in designing domain-specific archi- tectures for machine learning, spreading the word on the open RISC-V instruction set archi- tecture, and in helping the UC Berkeley RISELab (Real-time Intelligent Secure Execution).
Computer Architecture A Quantitative Approach Sixth Edition John L. Hennessy Stanford University David A. Patterson University of California, Berkeley With Contributions by Krste Asanovic University of California, Berkeley Jason D. Bakos University of South Carolina Robert P. Colwell R&E Colwell & Assoc. Inc. Abhishek Bhattacharjee Rutgers University Thomas M. Conte Georgia Tech Jose Duato Proemisa Diana Franklin University of Chicago David Goldberg eBay Norman P. Jouppi Google Sheng Li Intel Labs Naveen Muralimanohar HP Labs Gregory D. Peterson University of Tennessee Timothy M. Pinkston University of Southern California Parthasarathy Ranganathan Google David A. Wood University of Wisconsin–Madison Cliff Young Google Amr Zaky University of Santa Clara
Morgan Kaufmann is an imprint of Elsevier 50 Hampshire Street, 5th Floor, Cambridge, MA 02139, United States © 2019 Elsevier Inc. All rights reserved. No part of this publication may be reproduced or transmitted in any form or by any means, electronic or mechanical, including photocopying, recording, or any information storage and retrieval system, without permission in writing from the publisher. Details on how to seek permission, further information about the Publisher’s permissions policies and our arrangements with organizations such as the Copyright Clearance Center and the Copyright Licensing Agency, can be found at our website: www.elsevier.com/permissions. This book and the individual contributions contained in it are protected under copyright by the Publisher (other than as may be noted herein). Notices Knowledge and best practice in this field are constantly changing. As new research and experience broaden our understanding, changes in research methods, professional practices, or medical treatment may become necessary. Practitioners and researchers must always rely on their own experience and knowledge in evaluating and using any information, methods, compounds, or experiments described herein. In using such information or methods they should be mindful of their own safety and the safety of others, including parties for whom they have a professional responsibility. To the fullest extent of the law, neither the Publisher nor the authors, contributors, or editors, assume any liability for any injury and/or damage to persons or property as a matter of products liability, negligence or otherwise, or from any use or operation of any methods, products, instructions, or ideas contained in the material herein. Library of Congress Cataloging-in-Publication Data A catalog record for this book is available from the Library of Congress British Library Cataloguing-in-Publication Data A catalogue record for this book is available from the British Library ISBN: 978-0-12-811905-1 For information on all Morgan Kaufmann publications visit our website at https://www.elsevier.com/books-and-journals Publisher: Katey Birtcher Acquisition Editor: Stephen Merken Developmental Editor: Nate McFadden Production Project Manager: Stalin Viswanathan Cover Designer: Christian J. Bilbow Typeset by SPi Global, India
分享到:
收藏