Computer Architecture A Quantitative Approach 6th Edition.pdf

发布时间：2022-06-08 发布人：admin 分类：说明书资料大小：33.47M 资料格式：pdf 举报版权申诉

huntwin-10939850-4744302543430516084.pdf-第1页.png

第1页 / 共1527页

huntwin-10939850-4744302543430516084.pdf-第2页.png

第2页 / 共1527页

huntwin-10939850-4744302543430516084.pdf-第3页.png

第3页 / 共1527页

huntwin-10939850-4744302543430516084.pdf-第4页.png

第4页 / 共1527页

huntwin-10939850-4744302543430516084.pdf-第5页.png

第5页 / 共1527页

huntwin-10939850-4744302543430516084.pdf-第6页.png

第6页 / 共1527页

huntwin-10939850-4744302543430516084.pdf-第7页.png

第7页 / 共1527页

huntwin-10939850-4744302543430516084.pdf-第8页.png

第8页 / 共1527页

Computer Architecture A Quantitative Approach 6th Edition.pdf

Front Cover

Inside Front Cover

In Praise of Computer Architecture: A Quantitative ApproachSixth Edition

Computer Architecture: A Quantitative Approach

Foreword

Contents

Preface

Why We Wrote This Book

This Edition

Topic Selection and Organization

An Overview of the Content

Navigating the Text

Chapter Structure

Case Studies With Exercises

Supplemental Materials

Helping Improve This Book

Concluding Remarks

Acknowledgments

Contributors to the Sixth Edition

Reviewers

Appendices

Case Studies With Exercises

Additional Material

Contributors to Previous Editions

Reviewers

Appendices

Exercises

Case Studies With Exercises

Special Thanks

Chapter 1: Fundamentals of Quantitative Design and Analysis

1.1. Introduction

1.2. Classes of Computers

Internet of Things/Embedded Computers

Personal Mobile Device

Desktop Computing

Servers

Clusters/Warehouse-Scale Computers

Classes of Parallelism and Parallel Architectures

1.3. Defining Computer Architecture

Instruction Set Architecture: The Myopic View of Computer Architecture

Genuine Computer Architecture: Designing the Organization and Hardware to Meet Goals and Functional Requirements

1.4. Trends in Technology

Performance Trends: Bandwidth Over Latency

Scaling of Transistor Performance and Wires

1.5. Trends in Power and Energy in Integrated Circuits

Power and Energy: A Systems Perspective

Energy and Power Within a Microprocessor

The Shift in Computer Architecture Because of Limits of Energy

1.6. Trends in Cost

The Impact of Time, Volume, and Commoditization

Cost of an Integrated Circuit

Cost Versus Price

Cost of Manufacturing Versus Cost of Operation

1.7. Dependability

1.8. Measuring, Reporting, and Summarizing Performance

Benchmarks

Desktop Benchmarks

Server Benchmarks

Reporting Performance Results

Summarizing Performance Results

1.9. Quantitative Principles of Computer Design

Take Advantage of Parallelism

Principle of Locality

Focus on the Common Case

Amdahls Law

The Processor Performance Equation

1.10. Putting It All Together: Performance, Price, and Power

1.11. Fallacies and Pitfalls

1.12. Concluding Remarks

1.13. Historical Perspectives and References

Case Studies and Exercises by Diana Franklin

Case Study 1: Chip Fabrication Cost

Concepts illustrated by this case study

Case Study 2: Power Consumption in Computer Systems

Concepts illustrated by this case study

Exercises

Chapter 2: Memory Hierarchy Design

2.1. Introduction

Basics of Memory Hierarchies: A Quick Review

2.2. Memory Technology and Optimizations

SRAM Technology

DRAM Technology

Improving Memory Performance Inside a DRAM Chip: SDRAMs

Reducing Power Consumption in SDRAMs

Graphics Data RAMs

Packaging Innovation: Stacked or Embedded DRAMs

Flash Memory

Phase-Change Memory Technology

Enhancing Dependability in Memory Systems

2.3. Ten Advanced Optimizations of Cache Performance

First Optimization: Small and Simple First-Level Caches to Reduce Hit Time and Power

Second Optimization: Way Prediction to Reduce Hit Time

Third Optimization: Pipelined Access and Multibanked Caches to Increase Bandwidth

Fourth Optimization: Nonblocking Caches to Increase Cache Bandwidth

Implementing a Nonblocking Cache

Fifth Optimization: Critical Word First and Early Restart to Reduce Miss Penalty

Sixth Optimization: Merging Write Buffer to Reduce Miss Penalty

Seventh Optimization: Compiler Optimizations to Reduce Miss Rate

Loop Interchange

Blocking

Eighth Optimization: Hardware Prefetching of Instructions and Data to Reduce Miss Penalty or Miss Rate

Ninth Optimization: Compiler-Controlled Prefetching to Reduce Miss Penalty or Miss Rate

Tenth Optimization: Using HBM to Extend the Memory Hierarchy

Cache Optimization Summary

2.4. Virtual Memory and Virtual Machines

Protection via Virtual Memory

Protection via Virtual Machines

Requirements of a Virtual Machine Monitor

Instruction Set Architecture Support for Virtual Machines

Impact of Virtual Machines on Virtual Memory and I/O

Extending the Instruction Set for Efficient Virtualization and Better Security

An Example VMM: The Xen Virtual Machine

2.5. Cross-Cutting Issues: The Design of Memory Hierarchies

Protection, Virtualization, and Instruction Set Architecture

Autonomous Instruction Fetch Units

Speculation and Memory Access

Special Instruction Caches

Coherency of Cached Data

2.6. Putting It All Together: Memory Hierarchies in the ARM Cortex-A53 and Intel Core i7 6700

The ARM Cortex-A53

Performance of the Cortex-A53 Memory Hierarchy

The Intel Core i7 6700

Performance of the i7 memory system

2.7. Fallacies and Pitfalls

2.8. Concluding Remarks: Looking Ahead

2.9. Historical Perspectives and References

Case Studies and Exercises by Norman P. Jouppi, Rajeev Balasubramonian, Naveen Muralimanohar, and Sheng Li

Case Study 1: Optimizing Cache Performance via Advanced Techniques

Concepts illustrated by this case study

Case Study 2: Putting It All Together: Highly Parallel Memory Systems

Concept illustrated by this case study

Case Study 3: Studying the Impact of Various Memory System Organizations

Concepts illustrated by this case study

Exercises

Chapter 3: Instruction-Level Parallelism and Its Exploitation

3.1. Instruction-Level Parallelism: Concepts and Challenges

What Is Instruction-Level Parallelism?

Data Dependences and Hazards

Data Dependences

Name Dependences

Data Hazards

Control Dependences

3.2. Basic Compiler Techniques for Exposing ILP

Basic Pipeline Scheduling and Loop Unrolling

Summary of the Loop Unrolling and Scheduling

3.3. Reducing Branch Costs With Advanced Branch Prediction

Correlating Branch Predictors

Tournament Predictors: Adaptively Combining Local and Global Predictors

Tagged Hybrid Predictors

The Evolution of the Intel Core i7 Branch Predictor

3.4. Overcoming Data Hazards With Dynamic Scheduling

Dynamic Scheduling: The Idea

Dynamic Scheduling Using Tomasulo's Approach

3.5. Dynamic Scheduling: Examples and the Algorithm

Tomasulo's Algorithm: The Details

Tomasulo's Algorithm: A Loop-Based Example

3.6. Hardware-Based Speculation

3.7. Exploiting ILP Using Multiple Issue and Static Scheduling

The Basic VLIW Approach

3.8. Exploiting ILP Using Dynamic Scheduling, Multiple Issue, and Speculation

3.9. Advanced Techniques for Instruction Delivery and Speculation

Increasing Instruction Fetch Bandwidth

Branch-Target Buffers

Specialized Branch Predictors: Predicting Procedure Returns, Indirect Jumps, and Loop Branches

Integrated Instruction Fetch Units

Speculation: Implementation Issues and Extensions

Speculation Support: Register Renaming Versus Reorder Buffers

The Challenge of More Issues per Clock

How Much to Speculate

Speculating Through Multiple Branches

Speculation and the Challenge of Energy Efficiency

Address Aliasing Prediction

3.10. Cross-Cutting Issues

Hardware Versus Software Speculation

Speculative Execution and the Memory System

3.11. Multithreading: Exploiting Thread-Level Parallelism to Improve Uniprocessor Throughput

Effectiveness of Simultaneous Multithreading on Superscalar Processors

3.12. Putting It All Together: The Intel Core i7 6700 and ARM Cortex-A53

The ARM Cortex-A53

Performance of the A53 Pipeline

The Intel Core i7

Performance of the i7

3.13. Fallacies and Pitfalls

3.14. Concluding Remarks: What's Ahead?

3.15. Historical Perspective and References

Case Studies and Exercises by Jason D. Bakos and Robert P. Colwell

Case Study: Exploring the Impact of Microarchitectural Techniques

Concepts illustrated by this case study

Exercises

Chapter 4: Data-Level Parallelism in Vector, SIMD, and GPU Architectures

4.1. Introduction

4.2. Vector Architecture

RV64V Extension

How Vector Processors Work: An Example

Vector Execution Time

Multiple Lanes: Beyond One Element per Clock Cycle

Vector-Length Registers: Handling Loops Not Equal to 32

Predicate Registers: Handling IF Statements in Vector Loops

Memory Banks: Supplying Bandwidth for Vector Load/Store Units

Stride: Handling Multidimensional Arrays in Vector Architectures

Gather-Scatter: Handling Sparse Matrices in Vector Architectures

Programming Vector Architectures

4.3. SIMD Instruction Set Extensions for Multimedia

Programming Multimedia SIMD Architectures

The Roofline Visual Performance Model

4.4. Graphics Processing Units

Programming the GPU

NVIDIA GPU Computational Structures

NVIDA GPU Instruction Set Architecture

Conditional Branching in GPUs

NVIDIA GPU Memory Structures

Innovations in the Pascal GPU Architecture

Similarities and Differences Between Vector Architectures and GPUs

Similarities and Differences Between Multimedia SIMD Computers and GPUs

Summary

4.5. Detecting and Enhancing Loop-Level Parallelism

Finding Dependences

Eliminating Dependent Computations

4.6. Cross-Cutting Issues

Energy and DLP: Slow and Wide Versus Fast and Narrow

Banked Memory and Graphics Memory

Strided Accesses and TLB Misses

4.7. Putting It All Together: Embedded Versus Server GPUs and Tesla Versus Core i7

Comparison of a GPU and a MIMD With Multimedia SIMD

Comparison Update

4.8. Fallacies and Pitfalls

4.9. Concluding Remarks

4.10. Historical Perspective and References

Case Study and Exercises by Jason D. Bakos

Case Study: Implementing a Vector Kernel on a Vector Processor and GPU

Concepts illustrated by this case study

Exercises

Chapter 5: Thread-Level Parallelism

5.1. Introduction

Multiprocessor Architecture: Issues and Approach

Challenges of Parallel Processing

5.2. Centralized Shared-Memory Architectures

What Is Multiprocessor Cache Coherence?

Basic Schemes for Enforcing Coherence

Snooping Coherence Protocols

Basic Implementation Techniques

An Example Protocol

Extensions to the Basic Coherence Protocol

Limitations in Symmetric Shared-Memory Multiprocessors and Snooping Protocols

Implementing Snooping Cache Coherence

5.3. Performance of Symmetric Shared-Memory Multiprocessors

A Commercial Workload

A Multiprogramming and OS Workload

Performance of the Multiprogramming and OS Workload

5.4. Distributed Shared-Memory and Directory-Based Coherence

Directory-Based Cache Coherence Protocols: The Basics

An Example Directory Protocol

5.5. Synchronization: The Basics

Basic Hardware Primitives

Implementing Locks Using Coherence

5.6. Models of Memory Consistency: An Introduction

The Programmer's View

Relaxed Consistency Models: The Basics and Release Consistency

5.7. Cross-Cutting Issues

Compiler Optimization and the Consistency Model

Using Speculation to Hide Latency in Strict Consistency Models

Inclusion and Its Implementation

Performance Gains From Multiprocessing and Multithreading

5.8. Putting It All Together: Multicore Processors and Their Performance

Performance of Multicore-Based Multiprocessors on a Multiprogrammed Workload

Scalability in an Xeon MP With Different Workloads

Performance and Energy Efficiency of the Intel i7 920 Multicore

Putting Multicore and SMT Together

5.9. Fallacies and Pitfalls

5.10. The Future of Multicore Scaling

5.11. Concluding Remarks

5.12. Historical Perspectives and References

Case Studies and Exercises by Amr Zaky and David A. Wood

Case Study 1: Single Chip Multicore Multiprocessor

Concepts illustrated by this case study

Case Study 2: Simple Directory-Based Coherence

Concepts illustrated by this case study

Read/Write Notation

Messages

Case Study 3: Memory Consistency

Concepts Illustrated by This Case Study

Exercises

Chapter 6: Warehouse-Scale Computers to Exploit Request-Level and Data-Level Parallelism

6.1. Introduction

6.2. Programming Models and Workloads for Warehouse-Scale Computers

6.3. Computer Architecture of Warehouse-Scale Computers

Storage

WSC Memory Hierarchy

6.4. The Efficiency and Cost of Warehouse-Scale Computers

Measuring Efficiency of a WSC

Cost of a WSC

6.5. Cloud Computing: The Return of Utility Computing

Amazon Web Services

How Big Is the AWS Cloud?

6.6. Cross-Cutting Issues

Preventing the WSC Network From Being a Bottleneck

Using Energy Efficiently Inside the Server

6.7. Putting It All Together: A Google Warehouse-Scale Computer

Power Distribution in a Google WSC

Cooling in a Google WSC

Racks of a Google WSC

Networking in a Google WSC

Servers in a Google WSC

Conclusion

6.8. Fallacies and Pitfalls

6.9. Concluding Remarks

6.10. Historical Perspectives and References

Case Studies and Exercises by Parthasarathy Ranganathan

Case Study 1: Total Cost of Ownership Influencing Warehouse-Scale Computer Design Decisions

Concepts illustrated by this case study

Case Study 2: Resource Allocation in WSCs and TCO

Concepts illustrated by this case study

Exercises

Chapter 7: Domain-Specific Architectures

7.1. Introduction

7.2. Guidelines for DSAs

7.3. Example Domain: Deep Neural Networks

The Neurons of DNNs

Training Versus Inference

Multilayer Perceptron

Convolutional Neural Network

Recurrent Neural Network

Batches

Quantization

Summary of DNNs

7.4. Googles Tensor Processing Unit, an Inference Data Center Accelerator

TPU Origin

TPU Architecture

TPU Instruction Set Architecture

TPU Microarchitecture

TPU Implementation

TPU Software

Improving the TPU

Summary: How TPU Follows the Guidelines

7.5. Microsoft Catapult, a Flexible Data Center Accelerator

Catapult Implementation and Architecture

Catapult Software

CNNs on Catapult

Search Acceleration on Catapult

Catapult Version 1 Deployment

Catapult Version 2

Summary: How Catapult Follows the Guidelines

7.6. Intel Crest, a Data Center Accelerator for Training

7.7. Pixel Visual Core, a Personal Mobile Device Image Processing Unit

ISPs, the Hardwired Predecessors of IPUs

Pixel Visual Core Software

Pixel Visual Core Architecture Philosophy

The Pixel Visual Core Halo

A Processor of the Pixel Visual Core

Pixel Visual Core Instruction Set Architecture

Pixel Visual Core Example

Pixel Visual Core Processing Element

Two-Dimensional Line Buffers and Their Controller

Pixel Visual Core Implementation

Summary: How Pixel Visual Core Follows the Guidelines

7.8. Cross-Cutting Issues

Heterogeneity and System on a Chip (SOC)

An Open Instruction Set

7.9. Putting It All Together: CPUs Versus GPUs Versus DNN Accelerators

Performance: Rooflines, Response Time, and Throughput

Cost-Performance, TCO, and Performance/Watt

Evaluating Catapult and Pixel Visual Core

7.10. Fallacies and Pitfalls

7.11. Concluding Remarks

An Architecture Renaissance

7.12. Historical Perspectives and References

Case Studies and Exercises by Cliff Young

Case Study: Googles Tensor Processing Unit and Acceleration of Deep Neural Networks

Concepts illustrated by this case study

Exercises

Appendix A: Instruction Set Principles

A.1. Introduction

A.2. Classifying Instruction Set Architectures

Summary: Classifying Instruction Set Architectures

A.3. Memory Addressing

Interpreting Memory Addresses

Addressing Modes

Displacement Addressing Mode

Immediate or Literal Addressing Mode

Summary: Memory Addressing

A.4. Type and Size of Operands

A.5. Operations in the Instruction Set

A.6. Instructions for Control Flow

Addressing Modes for Control Flow Instructions

Conditional Branch Options

Procedure Invocation Options

Summary: Instructions for Control Flow

A.7. Encoding an Instruction Set

Reduced Code Size in RISCs

Summary: Encoding an Instruction Set

A.8. Cross-Cutting Issues: The Role of Compilers

The Structure of Recent Compilers

Impact of Optimizations on Performance

The Impact of Compiler Technology on the Architects Decisions

How the Architect Can Help the Compiler Writer

Compiler Support (or Lack Thereof) for Multimedia Instructions

Summary: The Role of Compilers

A.9. Putting It All Together: The RISC-V Architecture

RISC-V Instruction Set Organization

Registers for RISC-V

Data Types for RISC-V

Addressing Modes for RISC-V Data Transfers

RISC-V Instruction Format

RISC-V Operations

RISC-V Control Flow Instructions

RISC-V Floating-Point Operations

RISC-V Instruction Set Usage

A.10. Fallacies and Pitfalls

A.11. Concluding Remarks

A.12. Historical Perspective and References

Exercises by Gregory D. Peterson

Appendix B: Review of Memory Hierarchy

B.1. Introduction

Cache Performance Review

Four Memory Hierarchy Questions

Q1: Where Can a Block be Placed in a Cache?

Q2: How Is a Block Found If It Is in the Cache?

Q3: Which Block Should be Replaced on a Cache Miss?

Q4: What Happens on a Write?

An Example: The Opteron Data Cache

B.2. Cache Performance

Average Memory Access Time and Processor Performance

Miss Penalty and Out-of-Order Execution Processors

B.3. Six Basic Cache Optimizations

First Optimization: Larger Block Size to Reduce Miss Rate

Second Optimization: Larger Caches to Reduce Miss Rate

Third Optimization: Higher Associativity to Reduce Miss Rate

Fourth Optimization: Multilevel Caches to Reduce Miss Penalty

Fifth Optimization: Giving Priority to Read Misses over Writes to Reduce Miss Penalty

Sixth Optimization: Avoiding Address Translation During Indexing of the Cache to Reduce Hit Time

Summary of Basic Cache Optimization

B.4. Virtual Memory

Four Memory Hierarchy Questions Revisited

Q1: Where Can a Block be Placed in Main Memory?

Q2: How Is a Block Found If It Is in Main Memory?

Q3: Which Block Should be Replaced on a Virtual Memory Miss?

Q4: What Happens on a Write?

Techniques for Fast Address Translation

Selecting a Page Size

Summary of Virtual Memory and Caches

B.5. Protection and Examples of Virtual Memory

Protecting Processes

A Segmented Virtual Memory Example: Protection in the Intel Pentium

Adding Bounds Checking and Memory Mapping

Adding Sharing and Protection

Adding Safe Calls from User to OS Gates and Inheriting Protection Level for Parameters

A Paged Virtual Memory Example: The 64-Bit Opteron Memory Management

Summary: Protection on the 32-Bit Intel Pentium Versus the 64-Bit AMD Opteron

B.6. Fallacies and Pitfalls

B.7. Concluding Remarks

B.8. Historical Perspective and References

Exercises by Amr Zaky

Appendix C: Pipelining: Basic and Intermediate Concepts

C.1. Introduction

What Is Pipelining?

The Basics of the RISC V Instruction Set

A Simple Implementation of a RISC Instruction Set

The Classic Five-Stage Pipeline for a RISC Processor

Basic Performance Issues in Pipelining

C.2. The Major Hurdle of Pipelining-Pipeline Hazards

Performance of Pipelines With Stalls

Data Hazards

Minimizing Data Hazard Stalls by Forwarding

Data Hazards Requiring Stalls

Branch Hazards

Reducing Pipeline Branch Penalties

Performance of Branch Schemes

Reducing the Cost of Branches Through Prediction

Static Branch Prediction

Dynamic Branch Prediction and Branch-Prediction Buffers

C.3. How Is Pipelining Implemented?

A Simple Implementation of RISC V

A Basic Pipeline for RISC V

Implementing the Control for the RISC V Pipeline

Dealing With Branches in the Pipeline

C.4. What Makes Pipelining Hard to Implement?

Dealing With Exceptions

Types of Exceptions and Requirements

Stopping and Restarting Execution

Exceptions in RISC V

Instruction Set Complications

C.5. Extending the RISC V Integer Pipeline to Handle Multicycle Operations

Hazards and Forwarding in Longer Latency Pipelines

Maintaining Precise Exceptions

Performance of a Simple RISC V FP Pipeline

C.6. Putting It All Together: The MIPS R4000 Pipeline

The Floating-Point Pipeline

Performance of the R4000 Pipeline

C.7. Cross-Cutting Issues

RISC Instruction Sets and Efficiency of Pipelining

Dynamically Scheduled Pipelines

Dynamic Scheduling With a Scoreboard

C.8. Fallacies and Pitfalls

C.9. Concluding Remarks

C.10. Historical Perspective and References

Updated Exercises by Diana Franklin

Appendix D: Storage Systems

D.1. Introduction

D.2. Advanced Topics in Disk Storage

Disk Power

Advanced Topics in Disk Arrays

RAID 10 versus 01 (or 1+0 versus RAID 0+1)

RAID 6: Beyond a Single Disk Failure

D.3. Definition and Examples of Real Faults and Failures

Berkeleys Tertiary Disk

Tandem

Other Studies of the Role of Operators in Dependability

D.4. I/O Performance, Reliability Measures, and Benchmarks

Throughput versus Response Time

Transaction-Processing Benchmarks

SPEC System-Level File Server, Mail, and Web Benchmarks

Examples of Benchmarks of Dependability

D.5. A Little Queuing Theory

Poisson Distribution of Random Variables

D.6. Crosscutting Issues

Point-to-Point Links and Switches Replacing Buses

Block Servers versus Filers

Asynchronous I/O and Operating Systems

D.7. Designing and Evaluating an I/O System-The Internet Archive Cluster

The Internet Archive Cluster

Estimating Performance, Dependability, and Cost of the Internet Archive Cluster

Calculating MTTF of the TB-80 Cluster

D.8. Putting It All Together: NetApp FAS6000 Filer

D.9. Fallacies and Pitfalls

D.10. Concluding Remarks

D.11. Historical Perspective and References

Case Studies with Exercises by Andrea C. Arpaci-Dusseau and Remzi H. Arpaci-Dusseau

Case Study 1: Deconstructing a Disk

Concepts illustrated by this case study

Case Study 2: Deconstructing a Disk Array

Concepts illustrated by this case study

Case Study 3: RAID Reconstruction

Concepts illustrated by this case study

Case Study 4: Performance Prediction for RAIDs

Concepts illustrated by this case study

Case Study 5: I/O Subsystem Design

Concepts illustrated by this case study

Case Study 6: Dirty Rotten Bits

Concepts illustrated by this case study

Case Study 7: Sorting Things Out

Concepts illustrated by this case study

Appendix E: Embedded Systems

E.1. Introduction

Real-Time Processing

E.2. Signal Processing and Embedded Applications: The Digital Signal Processor

The TI 320C55

The TI 320C6x

Media Extensions

E.3. Embedded Benchmarks

Power Consumption and Efficiency as the Metric

E.4. Embedded Multiprocessors

E.5. Case Study: The Emotion Engine of the Sony PlayStation 2

E.6. Case Study: Sanyo VPC-SX500 Digital Camera

E.7. Case Study: Inside a Cell Phone

Background on Wireless Networks

The Cell Phone

Cell Phone Standards and Evolution

E.8. Concluding Remarks

Appendix F: Interconnection Networks

F.1. Introduction

Interconnection Network Domains

Approach and Organization of This Appendix

F.2. Interconnecting Two Devices

Network Interface Functions: Composing and Processing Messages

Basic Network Structure and Functions: Media and Form Factor, Packet Transport, Flow Control, and Error Handling

Characterizing Performance: Latency and Effective Bandwidth

F.3. Connecting More than Two Devices

Additional Network Structure and Functions: Topology, Routing, Arbitration, and Switching

Shared-Media Networks

Switched-Media Networks

Comparison of Shared- and Switched-Media Networks

Characterizing Performance: Latency and Effective Bandwidth

F.4. Network Topology

Centralized Switched Networks

Distributed Switched Networks

Effects of Topology on Network Performance

F.5. Network Routing, Arbitration, and Switching

Routing

Arbitration

Switching

Impact on Network Performance

F.6. Switch Microarchitecture

Basic Switch Microarchitecture

Buffer Organizations

Routing Algorithm Implementation

Pipelining the Switch Microarchitecture

Other Switch Microarchitecture Enhancements

F.7. Practical Issues for Commercial Interconnection Networks

Connectivity

Standardization: Cross-Company Interoperability

Congestion Management

Fault Tolerance

F.8. Examples of Interconnection Networks

On-Chip Network: Intel Single-Chip Cloud Computer

System Area Network: IBM Blue Gene/L 3D Torus Network

System/Storage Area Network: InfiniBand

Ethernet: The Local Area Network

Wide Area Network: ATM

F.9. Internetworking

F.10. Crosscutting Issues for Interconnection Networks

Density-Optimized Processors versus SPEC-Optimized Processors

Smart Switches versus Smart Interface Cards

Protection and User Access to the Network

Efficient Interface to the Memory Hierarchy versus the Network

Compute-Optimized Processors versus Receiver Overhead

F.11. Fallacies and Pitfalls

F.12. Concluding Remarks

Acknowledgments

F.13. Historical Perspective and References

Wide Area Networks

Local Area Networks

System Area Networks

Storage Area Networks

On-Chip Networks

References

Exercises

Appendix G: Vector Processors in More Depth

G.1. Introduction

G.2. Vector Performance in More Depth

Pipelined Instruction Start-Up and Multiple Lanes

G.3. Vector Memory Systems in More Depth

G.4. Enhancing Vector Performance

Chaining in More Depth

Sparse Matrices in More Depth

G.5. Effectiveness of Compiler Vectorization

G.6. Putting It All Together: Performance of Vector Processors

Measures of Vector Performance

The Peak Performance of VMIPS on DAXPY

Sustained Performance of VMIPS on the Linpack Benchmark

DAXPY Performance on an Enhanced VMIPS

G.7. A Modern Vector Supercomputer: The Cray X1

Multi-Streaming Processors

Cray X1E

G.8. Concluding Remarks

G.9. Historical Perspective and References

References

Exercises

Appendix H: Hardware and Software for VLIW and EPIC

H.1. Introduction: Exploiting Instruction-Level Parallelism Statically

H.2. Detecting and Enhancing Loop-Level Parallelism

Finding Dependences

Eliminating Dependent Computations

H.3. Scheduling and Structuring Code for Parallelism

Software Pipelining: Symbolic Loop Unrolling

Global Code Scheduling

Trace Scheduling: Focusing on the Critical Path

Superblocks

H.4. Hardware Support for Exposing Parallelism: Predicated Instructions

H.5. Hardware Support for Compiler Speculation

Hardware Support for Preserving Exception Behavior

Hardware Support for Memory Reference Speculation

H.6. The Intel IA-64 Architecture and Itanium Processor

The Intel IA-64 Instruction Set Architecture

The IA-64 Register Model

Instruction Format and Support for Explicit Parallelism

Instruction Set Basics

Predication and Speculation Support

The Itanium 2 Processor

Functional Units and Instruction Issue

Itanium 2 Performance

H.7. Concluding Remarks

Reference

Appendix I: Large-Scale Multiprocessors and Scientific Applications

I.1. Introduction

I.2. Interprocessor Communication: The Critical Performance Issue

Advantages of Different Communication Mechanisms

I.3. Characteristics of Scientific Applications

Characteristics of Scientific Applications

The FFT Kernel

The LU Kernel

The Barnes Application

The Ocean Application

Computation/Communication for the Parallel Programs

I.4. Synchronization: Scaling Up

Synchronization Performance Challenges

Barrier Synchronization

Synchronization Mechanisms for Larger-Scale Multiprocessors

Software Implementations

Hardware Primitives

I.5. Performance of Scientific Applications on Shared-Memory Multiprocessors

Performance of a Scientific Workload on a Symmetric Shared-Memory Multiprocessor

Performance of a Scientific Workload on a Distributed-Memory Multiprocessor

I.6. Performance Measurement of Parallel Processors with Scientific Applications

I.7. Implementing Cache Coherence

Implementing Cache Coherence in a DSM Multiprocessor

Avoiding Deadlock from Limited Buffering

Implementing the Directory Controller

I.8. The Custom Cluster Approach: Blue Gene/L

The Blue Gene/L Computing Node

I.9. Concluding Remarks

Appendix J: Computer Arithmetic

J.1. Introduction

J.2. Basic Techniques of Integer Arithmetic

Ripple-Carry Addition

Radix-2 Multiplication and Division

Signed Numbers

Systems Issues

J.3. Floating Point

Special Values and Denormals

Representation of Floating-Point Numbers

J.4. Floating-Point Multiplication

Denormals

Precision of Multiplication

J.5. Floating-Point Addition

Speeding Up Addition

Denormalized Numbers

J.6. Division and Remainder

Iterative Division

Floating-Point Remainder

J.7. More on Floating-Point Arithmetic

Fused Multiply-Add

Precisions

Exceptions

Underflow

J.8. Speeding Up Integer Addition

Carry-Lookahead

Carry-Skip Adders

Carry-Select Adder

J.9. Speeding Up Integer Multiplication and Division

Shifting over Zeros

SRT Division

Speeding Up Multiplication with a Single Adder

Faster Multiplication with Many Adders

Faster Division with One Adder

J.10. Putting It All Together

J.11. Fallacies and Pitfalls

J.12. Historical Perspective and References

References

Exercises

Appendix K: Survey of Instruction Set Architectures

K.1. Introduction

K.2. A Survey of RISC Architectures for Desktop, Server, and Embedded Computers

Introduction

Addressing Modes and Instruction Formats

Instructions

RV64G Core Instructions

Compare and Conditional Branch

RV64GC Core 16-bit Instructions

Instructions: Common Extensions beyond RV64G

Instructions Unique to MIPS64 R6

Instructions Unique to SPARC v.9

Fast Traps

Support for LISP and Smalltalk

Instructions Unique to ARM

Instructions Unique to Power3

Branch Registers: Link and Counter

Instructions: Multimedia Extensions of the Desktop/Server RISCs

Instructions: Digital Signal-Processing Extensions of the Embedded RISCs

Concluding Remarks

K.3. The Intel 80x86

Introduction

80x86 Registers and Data Addressing Modes

80x86 Integer Operations

80x86 Floating-Point Operations

80x86 Instruction Encoding

Putting It All Together: Measurements of Instruction Set Usage

Measurements of 80x86 Operand Addressing

Comparative Operation Measurements

Concluding Remarks

Beauty is in the eye of the beholder.

K.4. The VAX Architecture

Introduction

VAX Operands and Addressing Modes

Encoding VAX Instructions

VAX Operations

Number of Operations

Branches, Jumps, and Procedure Calls

An Example to Put It All Together: swap

Code for the Body of the Procedure swap

Preserving Registers across Procedure Invocation of swap

The Full Procedure swap

A Longer Example: sort

Code for the Body of the sort Procedure

The Outer Loop

The Inner Loop

The Procedure Call

Passing Parameters

Preserving Registers across Procedure Invocation of sort

The Full Procedure sort

Fallacies and Pitfalls

Concluding Remarks

Exercises

K.5. The IBM 360/370 Architecture for Mainframe Computers

Introduction

System/360 Instruction Set

Integer/Logical and Floating-Point R-R Instructions

Branches and Status Setting R-R Instructions

Branches/Logical and Floating-Point Instructions-RX Format

Branches and Special Loads and Stores-RX Format

RS and SI Format Instructions

SS Format Instructions

360 Detailed Measurements

K.6. Historical Perspective and References

Acknowledgments

Appendix L: Advanced Concepts on Address Translation

Appendix M: Historical Perspectives and References

M.1. Introduction

M.2. The Early Development of Computers (Chapter 1)

The First General-Purpose Electronic Computers

Important Special-Purpose Machines

Commercial Developments

Development of Quantitative Performance Measures: Successes and Failures

M.3. The Development of Memory Hierarchy and Protection (Chapter 2 and Appendix B)

M.4. The Evolution of Instruction Sets (Appendices A, J, and K)

Stack Architectures

Computer Architecture Defined

High-Level Language Computer Architecture

Reduced Instruction Set Computers

M.5. The Development of Pipelining and Instruction-Level Parallelism (Chapter 3 and Appendices C and H)

Early Pipelined CPUs

The Introduction of Dynamic Scheduling

The IBM 360 Model 91: A Landmark Computer

Branch-Prediction Schemes

The Development of Multiple-Issue Processors

Compiler Technology and Hardware Support for Scheduling

EPIC and the IA-64 Development

Studies of ILP and Ideas to Increase ILP

Going Beyond the Data Flow Limit

Recent Advanced Microprocessors

Multithreading and Simultaneous Multithreading

M.6. The Development of SIMD Supercomputers, Vector Computers, Multimedia SIMD Instruction Extensions, and Graphical Proc ...

SIMD Supercomputers

Vector Computers

Multimedia SIMD Instruction Extensions

Graphical Processor Units

Scalable GPUs

Graphics Pipelines

GPGPU: An Intermediate Step

GPU Computing

References

SIMD Supercomputers

Vector Architecture

Multimedia SIMD

GPU

M.7. The History of Multiprocessors and Parallel Processing (Chapter 5 and Appendices F, G, and I)

SIMD Computers: Attractive Idea, Many Attempts, No Lasting Successes

Other Early Experiments

Great Debates in Parallel Processing

More Recent Advances and Developments

The Development of Bus-Based Coherent Multiprocessors

Toward Large-Scale Multiprocessors

Clusters

Recent Trends in Large-Scale Multiprocessors

Developments in Synchronization and Consistency Models

Other References

M.8. The Development of Clusters (Chapter 6)

Clusters, the Forerunner of WSCs

Utility Computing, the Forerunner of Cloud Computing

Containers

M.9. Historical Perspectives and References

M.10. The History of Magnetic Storage, RAID, and I/O Buses (Appendix D)

Magnetic Storage

RAID

I/O Buses and Controllers

References

Index

Back End Sheet

Inside Back Cover

Back Cover

Computer Architecture Formulas 1. CPU time = Instruction count ⫻ Clock cycles per instruction ⫻ Clock cycle time / 2. X is n times faster than Y: n Execution timeY Execution timeX = = / PerformanceX PerformanceY 3. Amdahl’s Law: Speedupoverall = Execution timeold ------------------------------------------- Execution timenew = 1 --------------------------------------------------------------------------------------------- ) Fractionenhanced ( –# + ------------------------------------ 1 Fractionenhanced Speedupenhanced 4. 5. 6. Energydynamic 1 2/ ⫻ Capacitive load Voltage2 ⫻ Powerdynamic 1 2/ ⫻ Capacitive load ⫻ Voltage2 Frequency switched ⫻ Powerstatic Currentstatic Voltage ⫻ 7. Availability = Mean time to fail / (Mean time to fail + Mean time to repair) 8. Die yield Wafer yield = ⫻ 1 (/ 1 Defects per unit area Die area ⫻ + N ) where Wafer yield accounts for wafers that are so bad they need not be tested and the process-complexity factor, a measure of manufacturing difficulty. is a parameter called ranges from 11.5 to 15.5 in 2011. N N 9. Means—arithmetic (AM), weighted arithmetic (WAM), and geometric (GM): AM = 1 --- n n i 1= n n n Timei WAM = Weighti Timei ⫻ GM = Timei i 1= i 1= where Timei is the execution time for the ith program of a total of n in the workload, Weighti is the weighting of the ith program in the workload. 10. Average memory-access time = Hit time + Miss rate ⫻ Miss penalty 11. Misses per instruction = Miss rate ⫻ Memory access per instruction 12. Cache index size: 2index = Cache size /(Block size ⫻ Set associativity) 13. Power Utilization Effectiveness (PUE) of a Warehouse Scale Computer = Total Facility Power -------------------------------------------------- IT Equipment Power Rules of Thumb 1. Amdahl/Case Rule: A balanced computer system needs about 1 MB of main memory capacity and 1 megabit per second of I/O bandwidth per MIPS of CPU performance. 2. 90/10 Locality Rule: A program executes about 90% of its instructions in 10% of its code. 3. Bandwidth Rule: Bandwidth grows by at least the square of the improvement in latency. 4. 2:1 Cache Rule: The miss rate of a direct-mapped cache of size N is about the same as a two-way set- associative cache of size N/2. 5. Dependability Rule: Design with no single point of failure. 6. Watt-Year Rule: The fully burdened cost of a Watt per year in a Warehouse Scale Computer in North America in 2011, including the cost of amortizing the power and cooling infrastructure, is about $2.

In Praise of Computer Architecture: A Quantitative Approach Sixth Edition “Although important concepts of architecture are timeless, this edition has been thoroughly updated with the latest technology developments, costs, examples, and references. Keeping pace with recent developments in open-sourced architec- ture, the instruction set architecture used in the book has been updated to use the RISC-V ISA.” —from the foreword by Norman P. Jouppi, Google “Computer Architecture: A Quantitative Approach is a classic that, like fine wine, just keeps getting better. I bought my first copy as I finished up my undergraduate degree and it remains one of my most frequently referenced texts today.” —James Hamilton, Amazon Web Service “Hennessy and Patterson wrote the first edition of this book when graduate stu- dents built computers with 50,000 transistors. Today, warehouse-size computers contain that many servers, each consisting of dozens of independent processors and billions of transistors. The evolution of computer architecture has been rapid and relentless, but Computer Architecture: A Quantitative Approach has kept pace, with each edition accurately explaining and analyzing the important emerging ideas that make this field so exciting.” —James Larus, Microsoft Research “Another timely and relevant update to a classic, once again also serving as a win- dow into the relentless and exciting evolution of computer architecture! The new discussions in this edition on the slowing of Moore's law and implications for future systems are must-reads for both computer architects and practitioners working on broader systems.” —Parthasarathy (Partha) Ranganathan, Google “I love the ‘Quantitative Approach’ books because they are written by engineers, for engineers. John Hennessy and Dave Patterson show the limits imposed by mathematics and the possibilities enabled by materials science. Then they teach through real-world examples how architects analyze, measure, and compromise to build working systems. This sixth edition comes at a critical time: Moore’s Law is fading just as deep learning demands unprecedented compute cycles. The new chapter on domain-specific architectures documents a number of prom- ising approaches and prophesies a rebirth in computer architecture. Like the scholars of the European Renaissance, computer architects must understand our own history, and then combine the lessons of that history with new techniques to remake the world.” —Cliff Young, Google

This page intentionally left blank

Computer Architecture A Quantitative Approach Sixth Edition

John L. Hennessy is a Professor of Electrical Engineering and Computer Science at Stanford University, where he has been a member of the faculty since 1977 and was, from 2000 to 2016, its 10th President. He currently serves as the Director of the Knight-Hennessy Fellow- ship, which provides graduate fellowships to potential future leaders. Hennessy is a Fellow of the IEEE and ACM, a member of the National Academy of Engineering, the National Acad- emy of Science, and the American Philosophical Society, and a Fellow of the American Acad- emy of Arts and Sciences. Among his many awards are the 2001 Eckert-Mauchly Award for his contributions to RISC technology, the 2001 Seymour Cray Computer Engineering Award, and the 2000 John von Neumann Award, which he shared with David Patterson. He has also received 10 honorary doctorates. In 1981, he started the MIPS project at Stanford with a handful of graduate students. After completing the project in 1984, he took a leave from the university to cofound MIPS Com- puter Systems, which developed one of the first commercial RISC microprocessors. As of 2017, over 5 billion MIPS microprocessors have been shipped in devices ranging from video games and palmtop computers to laser printers and network switches. Hennessy subse- quently led the DASH (Director Architecture for Shared Memory) project, which prototyped the first scalable cache coherent multiprocessor; many of the key ideas have been adopted in modern multiprocessors. In addition to his technical activities and university responsibil- ities, he has continued to work with numerous start-ups, both as an early-stage advisor and an investor. David A. Patterson became a Distinguished Engineer at Google in 2016 after 40 years as a UC Berkeley professor. He joined UC Berkeley immediately after graduating from UCLA. He still spends a day a week in Berkeley as an Emeritus Professor of Computer Science. His teaching has been honored by the Distinguished Teaching Award from the University of California, the Karlstrom Award from ACM, and the Mulligan Education Medal and Under- graduate Teaching Award from IEEE. Patterson received the IEEE Technical Achievement Award and the ACM Eckert-Mauchly Award for contributions to RISC, and he shared the IEEE Johnson Information Storage Award for contributions to RAID. He also shared the IEEE John von Neumann Medal and the C & C Prize with John Hennessy. Like his co-author, Patterson is a Fellow of the American Academy of Arts and Sciences, the Computer History Museum, ACM, and IEEE, and he was elected to the National Academy of Engineering, the National Academy of Sciences, and the Silicon Valley Engineering Hall of Fame. He served on the Information Technology Advisory Committee to the President of the United States, as chair of the CS division in the Berkeley EECS department, as chair of the Computing Research Association, and as President of ACM. This record led to Distinguished Service Awards from ACM, CRA, and SIGARCH. He is currently Vice-Chair of the Board of Directors of the RISC-V Foundation. At Berkeley, Patterson led the design and implementation of RISC I, likely the first VLSI reduced instruction set computer, and the foundation of the commercial SPARC architec- ture. He was a leader of the Redundant Arrays of Inexpensive Disks (RAID) project, which led to dependable storage systems from many companies. He was also involved in the Network of Workstations (NOW) project, which led to cluster technology used by Internet companies and later to cloud computing. His current interests are in designing domain-specific archi- tectures for machine learning, spreading the word on the open RISC-V instruction set archi- tecture, and in helping the UC Berkeley RISELab (Real-time Intelligent Secure Execution).

Computer Architecture A Quantitative Approach Sixth Edition John L. Hennessy Stanford University David A. Patterson University of California, Berkeley With Contributions by Krste Asanovic University of California, Berkeley Jason D. Bakos University of South Carolina Robert P. Colwell R&E Colwell & Assoc. Inc. Abhishek Bhattacharjee Rutgers University Thomas M. Conte Georgia Tech Jose Duato Proemisa Diana Franklin University of Chicago David Goldberg eBay Norman P. Jouppi Google Sheng Li Intel Labs Naveen Muralimanohar HP Labs Gregory D. Peterson University of Tennessee Timothy M. Pinkston University of Southern California Parthasarathy Ranganathan Google David A. Wood University of Wisconsin–Madison Cliff Young Google Amr Zaky University of Santa Clara

Morgan Kaufmann is an imprint of Elsevier 50 Hampshire Street, 5th Floor, Cambridge, MA 02139, United States © 2019 Elsevier Inc. All rights reserved. No part of this publication may be reproduced or transmitted in any form or by any means, electronic or mechanical, including photocopying, recording, or any information storage and retrieval system, without permission in writing from the publisher. Details on how to seek permission, further information about the Publisher’s permissions policies and our arrangements with organizations such as the Copyright Clearance Center and the Copyright Licensing Agency, can be found at our website: www.elsevier.com/permissions. This book and the individual contributions contained in it are protected under copyright by the Publisher (other than as may be noted herein). Notices Knowledge and best practice in this field are constantly changing. As new research and experience broaden our understanding, changes in research methods, professional practices, or medical treatment may become necessary. Practitioners and researchers must always rely on their own experience and knowledge in evaluating and using any information, methods, compounds, or experiments described herein. In using such information or methods they should be mindful of their own safety and the safety of others, including parties for whom they have a professional responsibility. To the fullest extent of the law, neither the Publisher nor the authors, contributors, or editors, assume any liability for any injury and/or damage to persons or property as a matter of products liability, negligence or otherwise, or from any use or operation of any methods, products, instructions, or ideas contained in the material herein. Library of Congress Cataloging-in-Publication Data A catalog record for this book is available from the Library of Congress British Library Cataloguing-in-Publication Data A catalogue record for this book is available from the British Library ISBN: 978-0-12-811905-1 For information on all Morgan Kaufmann publications visit our website at https://www.elsevier.com/books-and-journals Publisher: Katey Birtcher Acquisition Editor: Stephen Merken Developmental Editor: Nate McFadden Production Project Manager: Stalin Viswanathan Cover Designer: Christian J. Bilbow Typeset by SPi Global, India

分享到：

赞收藏

资料库

Computer Architecture A Quantitative Approach 6th Edition.pdf

相关推荐

课程资源

热门标签

最新资料