logo资料库

Parallel Computer Architecture: A Hardware / Software Approach.pdf

第1页 / 共877页
第2页 / 共877页
第3页 / 共877页
第4页 / 共877页
第5页 / 共877页
第6页 / 共877页
第7页 / 共877页
第8页 / 共877页
资料共877页,剩余部分请下载后查看
Título
Prefacio
Índice
Capítulo 1
Chapter - CHAPTER 1 Introduction
1Heading - 1.1 Introduction
1Heading - 1.2 Why Parallel Architecture
Figure - Figure 1-1 Performance trends over time of micro, ...
2Heading - 1.2.1 Application Trends
Equation - .
Equation - .
3Heading - Scientific and Engineering Computing
Figure - Figure 1-2 Grand Challenge Application Requirement...
Figure - Figure 1-3 Speedup on Three Versions of a Parallel...
3Heading - Commercial Computing
Figure - Figure 1-4 TPC-C throughput versus number of proce...
Example - Example 1-1 The tpmC for the Tandem Himalaya and I...
2Heading - 1.2.2 Technology Trends
Figure - Figure 1-5 Improvement in logic density and clock ...
2Heading - 1.2.3 Architectural Trends
Figure - Figure 1-6 Number of transistors per processor chi...
Figure - Figure 1-7 Distribution of potential instruction-l...
Figure - Figure 1-8 Number of processors in fully configure...
Figure - Figure 1-9 Bandwidth of the shared memory bus in c...
2Heading - 1.2.4 Supercomputers
Figure - Figure 1-10 Uniprocessor performance of supercompu...
Figure - Figure 1-11 Performance of supercomputers and MPPs...
Figure - Figure 1-12 Type of systems used in 500 fastest co...
2Heading - 1.2.5 Summary
1Heading - 1.3 Convergence of Parallel Architectures
2Heading - 1.3.1 Communication Architecture
Figure - Figure 1-13 Layers of Abstraction in Parallel Comp...
2Heading - 1.3.2 Shared Memory
Figure - Figure 1-14 Typical memory model for shared-memory...
Figure - Figure 1-15 Extending a system into a shared-memor...
Figure - Figure 1-16 Typical Shared-Memory Multiprocessor I...
Figure - Figure 1-17 Physical and logical organization of t...
Figure - Figure 1-18 Physical and logical organization of t...
Figure - Figure 1-19 Non-uniform Memory Access (NUMA) Scala...
Figure - Figure 1-20 Cray T3E Scalable Shared Address Space...
2Heading - 1.3.3 Message-Passing
Figure - Figure 1-21 User level send/receive message passin...
Figure - Figure 1-22 Typical structure of an early message ...
Figure - Figure 1-23 IBM SP-2 Message Passing Machine
Figure - Figure 1-24 Intel Paragon
2Heading - 1.3.4 Convergence
2Heading - 1.3.5 Data Parallel Processing
Figure - Figure 1-25 Typical organization of a data paralle...
2Heading - 1.3.6 Other Parallel Architectures
3Heading - Dataflow Architecture
Figure - Figure 1-26 Dataflow graph and basic execution pip...
3Heading - Systolic Architectures
Figure - Figure 1-27 Systolic Array computation of an inner...
2Heading - 1.3.7 A Generic Parallel Architecture
Figure - Figure 1-28 Generic scalable multiprocessor organi...
1Heading - 1.4 Fundamental Design Issues
2Heading - 1.4.1 Communication Abstraction
2Heading - 1.4.2 Programming Model Requirements
2Heading - 1.4.3 Naming
3Heading - Operations
2Heading - 1.4.4 Ordering
2Heading - 1.4.5 Communication and Replication
2Heading - 1.4.6 Performance
Example - Example 1-2 Suppose a component can perform a spec...
3Heading - Data Transfer Time
Equation -
Equation - .
3Heading - Overhead and Occupancy
Equation -
3Heading - Communication Cost
Equation -
3Heading - Summary
1Heading - 1.5 Concluding Remarks
1Heading - 1.6 References
1Heading - 1.7 Exercises
Exercise - 1.1 Compute the annual growth rate in number of tr...
Exercise - 1.2 Compute the annual performance growth rates fo...
TableTitle - Table 1-1 Performance of leading workstations
Exercise - 1.3 Suppose you are given a program which does a f...
Exercise - 1.4 Given a histogram of available parallelism suc...
Exercise - 1.5 Locate the current TPC performance data on the...
Exercise - 1.6 In message passing models each process is prov...
Exercise - 1.7 To move an message along links in an unloaded ...
Exercise - 1.8 Consider a simple 2D finite difference scheme ...
ExerciseStep - All the values are 64-bit floating point numbers. ...
Exercise - 1.9 Consider the simple pipelined component descri...
Exercise - 1.10 Show that Equation1.4 follows from Equation...
Exercise - 1.11 What is the x-intercept of the line in Equati...
Exercise - 1.12 Suppose we have a machine with a 64-bit wide ...
Exercise - 1.13 Suppose this 32-byte line is transferred to a...
Exercise - 1.14 Suppose we have a machine with a message star...
Exercise - 1.15 Derive a general formula for the “half-power ...
Exercise - 1.16 Assuming that before transmitting a message t...
Exercise - 1.17 Consider a machine running at 100 MIPS on som...
Exercise - 1.18 Assume the instruction mix and miss rate as i...
Exercise - 1.19 Of course, no matter how many processors you ...
Capítulo 2
Parallel Computer Architecture A Hardware / Software Approach David Culler University of California, Berkeley Jaswinder Pal Singh Princeton University with Anoop Gupta Stanford University Morgan Kaufmann is pleased to present material from a preliminary draft of Parallel Computer Architecture; the material is (c) Copyright 1997 Morgan Kaufmann Publishers. This material may not be used or distributed for any commercial purpose without the express written consent of Morgan Kaufmann Publishers. Please note that this material is a draft of forthcoming publication, and as such neither Morgan Kaufmann nor the authors can be held liable for changes or alterations in the final edition. 8/28/97 DRAFT: Parallel Computer Architecture 1
2 DRAFT: Parallel Computer Architecture 8/28/97
Preface Morgan Kaufmann is pleased to present material from a preliminary draft of Parallel Computer Architecture; the material is (c) Copyright 1997 Morgan Kaufmann Publishers. This material may not be used or distributed for any commercial purpose without the express written consent of Morgan Kaufmann Publishers. Please note that this material is a draft of forthcoming publication, and as such neither Morgan Kaufmann nor the authors can be held liable for changes or alterations in the final edition. Motivation for the Book Parallel computing is a critical component of the computing technology of the 90s, and it is likely to have as much impact over the next twenty years as microprocessors have had over the past twenty. Indeed, the two technologies are closely linked, as the evolution of highly integrated microprocessors and memory chips is making multiprocessor systems increasingly attractive. Already multiprocessors represent the high performance end of almost every segment of the computing market, from the fastest supercomputers, to departmental compute servers, to the indi- vidual desktop. In the past, computer vendors employed a range of technologies to provide increasing performance across their product line. Today, the same state-of-the-art microprocessor is used throughout. To obtain a significant range of performance, the simplest approach is to increase the number of processors, and the economies of scale makes this extremely attractive. Very soon, several processors will fit on a single chip. Although parallel computing has a long and rich academic history, the close coupling with com- modity technology has fundamentally changed the discipline. The emphasis on radical architec- tures and exotic technology has given way to quantitative analysis and careful engineering trade- offs. Our goal in writing this book is to equip designers of the emerging class of multiprocessor systems, from modestly parallel personal computers to massively parallel supercomputers, with an understanding of the fundamental architectural issues and the available techniques for addressing design trade-offs. At the same time, we hope to provide designers of software systems for these machines with an understanding of the likely directions of architectural evolution and the forces that will determine the specific path that hardware designs will follow. The most exciting recent development in parallel computer architecture is the convergence of tra- ditionally disparate approaches, namely shared-memory, message-passing, SIMD, and dataflow, on a common machine structure. This is driven partly by common technological and economic forces, and partly by a better understanding of parallel software. This convergence allows us to focus on the overriding architectural issues and to develop a common framework in which to understand and evaluate architectural trade-offs. Moreover, parallel software has matured to the point where the popular parallel programming models are available on a wide range of machines and meaningful benchmarks exists. This maturing of the field makes it possible to undertake a quantitative, as well as qualitative study of hardware/software interactions. In fact, it demands such an approach. The book follows a set of issues that are critical to all parallel architectures – communication latency, communication bandwidth, and coordination of cooperative work - across the full range of modern designs. It describes the set of techniques available in hardware and in software to address each issue and explores how the various techniques interact. Case studies provide a concrete illustration of the general principles and demonstrate specific interac- tions between mechanisms. Our final motivation comes from the current lack of an adequate text book for our own courses at Stanford, Berkeley, and Princeton. Many existing text books cover the material in a cursory fash- ion, summarizing various architectures and research results, but not analyzing them in depth. Others focus on specific projects, but fail to recognize the principles that carry over to alternative approaches. The research reports in the area provide sizable body of empirical data, but it has not yet been distilled into a coherent picture. By focusing on the salient issues in the context of the technological convergence, rather than the rich and varied history that brought us to this point, we hope to provide a deeper and more coherent understanding of the field. Intended Audience We believe the subject matter of this book is core material and should be relevant to graduate stu- dents and practicing engineers in the fields of computer architecture, systems software, and appli- cations. The relevance for computer architects is obvious, given the growing importance of multiprocessors. Chip designers must understand what constitutes a viable building block for multiprocessor systems, while computer system designers must understand how best to utilize modern microprocessor and memory technology in building multiprocessors. Systems software, including operating systems, compilers, programming languages, run-time systems, performance debugging tools, will need to address new issues and will provide new opportunities in parallel computers. Thus, an understanding of the evolution and the forces guid- ing that evolution is critical. Researchers in compilers and programming languages have 8/29/97 DRAFT: Parallel Computer Architecture 3 4 DRAFT: Parallel Computer Architecture 8/29/97
addressed aspects of parallel computing for some time. However, the new convergence with com- modity technology suggests that these aspects may need to be reexamined and perhaps addressed in very general terms. The traditional boundaries between hardware, operating system, and user program are also shifting in the context of parallel computing, where communication, schedul- ing, sharing, and resource management are intrinsic to the program. Applications areas, such as computer graphics and multimedia, scientific computing, computer aided design, decision support and transaction processing, are all likely to see a tremendous transformation as a result of the vast computing power available at low cost through parallel com- puting. However, developing parallel applications that are robust and provide good speed-up across current and future multiprocessors is a challenging task, and requires a deep understand- ing of forces driving parallel computers. The book seeks to provide this understanding, but also to stimulate the exchange between the applications fields and computer architecture, so that bet- ter architectures can be designed --- those that make the programming task easier and perfor- mance more robust. Organization of the Book The book is organized into twelve chapters. Chapter 1 begins with the motivation why parallel architectures are inevitable based on technology, architecture, and applications trends. It then briefly introduces the diverse multiprocessor architectures we find today (shared-memory, mes- sage-passing, data parallel, dataflow, and systolic), and it shows how the technology and architec- tural trends tell a strong story of convergence in the field. The convergence does not mean the end to innovation, but on the contrary, it implies that we will now see a time of rapid progress in the field, as designers start talking each other. Given this convergence, the last portion of the chapter introduces the fundamaental design issues for multiprocessors: naming, synchronization, latency, and bandwidth. These four issues form an underlying theme throughout the rest of this book. The chapter ends with a historical perspective, poviding a glimpse into the diverse and rich history of the field. each other rather than past to Chapter 2 provides a brief introduction to the process of parallel programming and what are the basic components of popular programming models. It is intended to ensure that the reader has a clear understanding of hardware/software trade-offs, as well as what aspects of performance can be addressed through architectural means and what aspects much be addressed either by the com- piler or the programmer in providing to the hardware a well designed parallel program. The anal- ogy in sequential computing is that architecture cannot transform an algorithm into an ( algorithm, but it can improve the average access time for common memory reference O n patterns. The brief discussion of programming issues is not likely to turn you into an expert par- allel programmer, if you are not already, but it will familiarize you with the issues and what pro- grams look like in various programming models. Chapter 3 outlines the basic techniques used in programming for performance and presents a collection of application case studies, that serve as a basis for quantitative evaluation of design trade-offs throughout the book. O n2( log ) ) n Chapter 4 takes up the challenging task of performing a solid empirical evaluation of design trade-offs. Architectural evaluation is difficult even for modern uni-processors where we typi- cally look at variations in pipeline design or memory system design against a fixed set of pro- grams. In parallel architecture we have many more degrees of freedom to explore, the interactions between aspects of the design are more profound, the interactions between hardware and software are more significant and of a wider scope. In general, we are looking at performance as the machine and the program scale. There is no way to scale one without the other example. Chapter 3 discusses how scaling interacts with various architectural parameters and presents a set of benchmarks that are used throughout the later chapters. Chapters 5 and 6 provide a complete understanding of the bus-based multiprocessors, SMPs, that form the bread-and-butter of modern commercial machines beyond the desktop, and even to some extent on the desktop. Chapter 5 presents the logical design of “snooping” bus protocols which ensure that automatically replicated data is conherent across multiple caches. This chapter provides an important discussion of memory consistency models, which allows us to come to terms with what shared memory really means to algorithm designers. It discusses the spectrum of design options and how machines are optimized against typical reference patterns occuring in user programs and in the operating system. Given this conceptual understanding of SMPs, it reflects on implications for parallel programming. Chapter 6 examines the physical design of bus-based multiprocessors. Itdigs down into the engi- neering issues that arise in supporting modern microprocessors with multilevel caches on modern busses, which are highly pipelined. Although some of this material is contained in more casual treatments of multiprocessor architecture, the presentation here provides a very complete under- standing of the design issues in this regime. It is especially important because these small-scale designs form a building block for large-scale designs and because many of the concepts will reappear later in the book on a larger scale with a broader set of concerns. Chapter 7 presents the hardware organization and architecture of a range of machines that are scalable to large or very large configurations. The key organizational concept is that of a network transaction, analogous to the bus transaction, which is the fundamental primitive for the designs in Chapters 5 and 6. However, in large scale machines the global information and global arbitra- tion of small-scale designs is lost. Also, a large number of transactions can be outstanding. We show how conventional programming models are realized in terms of network transactions and then study a spectrum of important design points, organized according to the level of direct hard- ware interpretation of the network transaction, including detailed case studies of important com- mercial machines. Chapter 8 puts the results of the previous chapters together to demonstrate how to realize a global shared physical address space with automatic replication on a large scale. It provides a complete treatment of directory based cache coherence protocols and hardware design alternatives. Chapter 9 examines a spectrum of alternatives that push the boundaries of possible hardware/ software trade-offs to obtain higher performance, to reduce hardware complexity, or both. It looks at relaxed memory consistency models, cache-only memory architectures, and software based cache coherency. This material is currently in the transitional phase from academic research to commercial product at the time of writing. Chapter 10 addresses the design of scable high-performance communication networks, which underlies all the large scale machines discussed in previous chapters, but was deferred in order to complete our understanding of the processor, memory system, and network interface design which drive these networks. The chapter builds a general framework for understanding where hardware costs, transfer delays, and bandwidth restrictions arise in networks. We then look at a variety of trade-offs in routing techniques, switch design, and interconnection topology with 8/29/97 DRAFT: Parallel Computer Architecture 5 6 DRAFT: Parallel Computer Architecture 8/29/97
respect to these cost-performance metrics. These trade-offs are made concrete through case stud- ies of recent designs. Given the foundation established by the first ten chapters, Chapter 11 examines a set of cross-cut- ting issues involved in tolerating significant communication delays without impeding perfor- mance. The techniques essentially exploit two basic capabilities: overlapping communication with useful computation and pipelinng the communication of a volume of data. The simplest of these are essentially bulk transfers, which pipeline the movement of a large regular sequence of data items and often can be off-loaded from the processor. The other techniques attempt to hide the latency incured in collections of individual loads and stores. Write latencies are hidden by exploiting weak consistency models, which recognize that ordering is convey by only a small set of the accesses to shared memory in a program. Read latencies are hidden by implicit or explicit prefetching of data, or by look-ahead techniques in modern dynamically scheduled processors. The chapter provides a thorough examination of these alternatives, the impact on compilation techniques, and quantitative evaluation of the effectiveness. Finally, Chapter 12 examines the trends in technology, architecture, software systems and applications that are likely to shape evo- lution of the field. We believe parallel computer architecture is an exciting core field whose importance is growing; it has reached a point of maturity that a textbook makes sense. From a rich diversity of ideas and approaches, there is now a dramatic convergence happening in the field. It is time to go beyond surveying the machine landscape, to an understanding of fundamental design principles. We have intimately participated in the convergence of the field; this text arises from this experience of ours and we hope it conveys some of the excitement that we feel for this dynamic and growing area. 8/29/97 DRAFT: Parallel Computer Architecture 7 8 DRAFT: Parallel Computer Architecture 8/29/97
CHAPTER 1 CHAPTER 2 1.1 Introduction 1.2 Why Parallel Architecture 1.3 Convergence of Parallel Architectures Introduction ..................................................................................19 ..............................................................................................................................19 ........................................................................................................22 1.2.1 Application Trends ...........................................................................................................23 1.2.2 Technology Trends ...........................................................................................................30 1.2.3 Architectural Trends.........................................................................................................32 1.2.4 Supercomputers................................................................................................................36 1.2.5 Summary ..........................................................................................................................38 .....................................................................................40 1.3.1 Communication Architecture ...........................................................................................40 1.3.2 Shared Memory ................................................................................................................44 1.3.3 Message-Passing ..............................................................................................................50 1.3.4 Convergence .....................................................................................................................53 1.3.5 Data Parallel Processing...................................................................................................56 1.3.6 Other Parallel Architectures .............................................................................................59 1.3.7 A Generic Parallel Architecture .......................................................................................62 ......................................................................................................63 1.4.1 Communication Abstraction.............................................................................................64 1.4.2 Programming Model Requirements .................................................................................64 1.4.3 Naming .............................................................................................................................66 1.4.4 Ordering ...........................................................................................................................67 1.4.5 Communication and Replication......................................................................................68 1.4.6 Performance .....................................................................................................................69 ................................................................................................................73 ................................................................................................................................76 ...................................................................................................................................85 1.5 Concluding Remarks 1.6 References 1.7 Exercises 1.4 Fundamental Design Issues 2.3 The Parallelization Process Parallel Programs 2.1 Introduction 2.2 Parallel Application Case Studies .........................................................................89 ..............................................................................................................................89 .............................................................................................90 2.2.1 Simulating Ocean Currents ..............................................................................................91 2.2.2 Simulating the Evolution of Galaxies ..............................................................................93 2.2.3 Visualizing Complex Scenes using Ray Tracing..............................................................94 2.2.4 Mining Data for Associations...........................................................................................94 .......................................................................................................95 2.3.1 Steps in the Process ..........................................................................................................96 2.3.2 Parallelizing Computation versus Data ..........................................................................102 2.3.3 Goals of the Parallelization Process ...............................................................................103 2.4 Parallelization of an Example Program ..................................................................................104 2.4.1 A Simple Example: The Equation Solver Kernel...........................................................104 2.4.2 Decomposition ...............................................................................................................105 2.4.3 Assignment.....................................................................................................................110 2.4.4 Orchestration under the Data Parallel Model .................................................................111 2.4.5 Orchestration under the Shared Address Space Model ..................................................111 9/4/97 DRAFT: Parallel Computer Architecture 9
CHAPTER 3 2.4.6 Orchestration under the Message Passing Model...........................................................119 ..............................................................................................................125 ...............................................................................................................................126 .................................................................................................................................127 2.5 Concluding Remarks 2.6 References 2.7 Exercises Programming for Performance 3.4 Orchestration for Performance 3.1 Introduction 3.2 Partitioning for Performance ...................................................131 .............................................................................................................................131 ..................................................................................................133 3.2.1 Load Balance and Synchronization Wait Time ..............................................................134 3.2.2 Reducing Inherent Communication................................................................................140 3.2.3 Reducing the Extra Work................................................................................................144 3.2.4 Summary.........................................................................................................................144 3.3 Data Access and Communication in a Multi-Memory System ..............................................145 3.3.1 A Multiprocessor as an Extended Memory Hierarchy ...................................................145 3.3.2 Artifactual Communication in the Extended Memory Hierarchy ..................................147 ...............................................................................................149 3.4.1 Reducing Artifactual Communication............................................................................149 3.4.2 Structuring Communication to Reduce Cost..................................................................156 ..........................................................161 ....................................................164 3.6.1 Ocean..............................................................................................................................165 3.6.2 Barnes-Hut......................................................................................................................170 3.6.3 Raytrace..........................................................................................................................175 3.6.4 Data Mining....................................................................................................................179 ...................................................................................182 ..............................................................................................................188 ...............................................................................................................................189 .................................................................................................................................191 3.7 Implications for Programming Models 3.8 Concluding Remarks 3.9 References 3.10 Exercises 3.5 Performance Factors from the Processors’ Perspective 3.6 The Parallel Application Case Studies: An In-Depth Look CHAPTER 4 Workload-Driven Evaluation 4.1 Introduction 4.2 Scaling Workloads and Machines ......................................................195 .............................................................................................................................195 ...........................................................................................198 4.2.1 Why Worry about Scaling?.............................................................................................199 4.2.2 Key Issues in Scaling......................................................................................................202 4.2.3 Scaling Models ...............................................................................................................202 4.2.4 Scaling Workload Parameters.........................................................................................208 .....................................................................................................209 4.3.1 Performance Isolation using Microbenchmarks.............................................................209 4.3.2 Choosing Workloads.......................................................................................................211 4.3.3 Evaluating a Fixed-Size Machine...................................................................................214 4.3.4 Varying Machine Size.....................................................................................................220 4.3.5 Choosing Performance Metrics ......................................................................................220 4.3 Evaluating a Real Machine 10 DRAFT: Parallel Computer Architecture 9/4/97
CHAPTER 5 5.3 Memory Consistency 5.4 Design Space for Snooping Protocols 5.6 Synchronization 4.4 Evaluating an Architectural Idea or Tradeoff 4.3.6 Presenting Results ..........................................................................................................224 .........................................................................225 4.4.1 Multiprocessor Simulation .............................................................................................226 4.4.2 Scaling Down Problem and Machine Parameters for Simulation..................................227 4.4.3 Dealing with the Parameter Space: An Example Evaluation .........................................230 4.4.4 Summary ........................................................................................................................235 ...................................................................................235 4.5.1 Workload Case Studies...................................................................................................236 4.5.2 Workload Characteristics ...............................................................................................243 ..............................................................................................................250 ..............................................................................................................................251 .................................................................................................................................253 4.6 Concluding Remarks 4.7 References 4.8 Exercises 4.5 Illustrating Workload Characterization 5.5 Assessing Protocol Design Tradeoffs 5.1 Introduction 5.2 Cache Coherence Shared Memory Multiprocessors ...............................................259 ............................................................................................................................259 ....................................................................................................................263 5.2.1 The Cache Coherence Problem ......................................................................................263 5.2.2 Cache Coherence Through Bus Snooping......................................................................266 ..............................................................................................................272 5.3.1 Sequential Consistency ..................................................................................................274 5.3.2 Sufficient Conditions for Preserving Sequential Consistency........................................276 ....................................................................................279 5.4.1 A 3-state (MSI) Write-back Invalidation Protocol .........................................................280 5.4.2 A 4-state (MESI) Write-Back Invalidation Protocol......................................................285 5.4.3 A 4-state (Dragon) Write-back Update Protocol............................................................287 .....................................................................................290 5.5.1 Workloads.......................................................................................................................292 5.5.2 Impact of Protocol Optimizations ..................................................................................296 5.5.3 Tradeoffs in Cache Block Size .......................................................................................298 5.5.4 Update-based vs. Invalidation-based Protocols..............................................................310 ......................................................................................................................314 5.6.1 Components of a Synchronization Event .......................................................................315 5.6.2 Role of User, System Software and Hardware...............................................................316 5.6.3 Mutual Exclusion ...........................................................................................................317 5.6.4 Point-to-point Event Synchronization............................................................................328 5.6.5 Global (Barrier) Event Synchronization.........................................................................330 5.6.6 Synchronization Summary .............................................................................................334 .......................................................................................................335 ..............................................................................................................341 ..............................................................................................................................342 .................................................................................................................................346 5.7 Implications for Software 5.8 Concluding Remarks 5.9 References 5.10 Exercises 9/4/97 DRAFT: Parallel Computer Architecture 11
分享到:
收藏