logo资料库

Chip Multiprocessor Architecture Techniques to Improve Throughput and Latency.pdf

第1页 / 共154页
第2页 / 共154页
第3页 / 共154页
第4页 / 共154页
第5页 / 共154页
第6页 / 共154页
第7页 / 共154页
第8页 / 共154页
资料共154页,剩余部分请下载后查看
book.pdf
The Case for CMPs
A NEW APPROACH: THE CHIP MULTIPROCESSOR (CMP)
The Application Parallelism Landscape
A Simple Example: Superscalar vs. CMP
Simulation Results
This Book: Beyond Basic CMPs
Improving Throughput
SIMPLE CORES AND SERVER APPLICATIONS
The Need for Multithreading within Processors
Maximizing the Number of Cores on the Die
Providing Sufficient Cache and Memory Bandwidth
CASE STUDIES OF THROUGHPUT-ORIENTED CMPs
Example 1: The Piranha Server CMP
Example 2: The Niagara Server CMP
Example 3: The Niagara 2 Server CMP
Simple Core Limitations
GENERAL SERVER CMP ANALYSIS
Simulating a Large Design Space
Choosing Design Datapoints
Results
Discussion
Improving Latency Automatically
PSEUDO-PARALLELIZATION: ``HELPER'' THREADS
AUTOMATED PARALLELIZATION USING THREAD-LEVEL SPECULATION (TLS)
AN EXAMPLE TLS SYSTEM: HYDRA
The Base Hydra Design
Adding TLS to Hydra
Using Feedback from Violation Statistics
Performance Analysis
Completely Automated TLS Support: The JRPM System
Concluding Thoughts on Automated Parallelization
Improving Latency Using Manual Parallel Programming
USING TLS SUPPORT AS TRANSACTIONAL MEMORY
An Example: Parallelizing Heapsort Using TLS
Parallelizing SPEC2000 with TLS
TRANSACTIONAL COHERENCE AND CONSISTENCY (TCC): MORE GENERALIZED TRANSACTIONAL MEMORY
TCC HARDWARE
TCC Software
TCC Performance
MIXING TRANSACTIONAL MEMORY AND CONVENTIONAL SHARED MEMORY
A Multicore World: The Future of CMPs
MOBK089-FM MOBKXXX-Sample.cls October 26, 2007 10:29 Chip Multiprocessor Architecture: Techniques to Improve Throughput and Latency i
MOBK089-FM MOBKXXX-Sample.cls October 26, 2007 10:29 Copyright © 2007 by Morgan & Claypool All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by any means—electronic, mechanical, photocopy, recording, or any other except for brief quotations in printed reviews, without the prior permission of the publisher. Chip Multiprocessor Architecture: Techniques to Improve Throughput and Latency Kunle Olukotun, Lance Hammond, and James Laudon www.morganclaypool.com ISBN: 159829122X ISBN: 9781598291223 paperback paperback ISBN: 1598291238 ISBN: 9781598291230 ebook ebook DOI: 10.2200/S00093ED1V01Y200707CAC003 A Publication in the Morgan & Claypool Publishers series SYNTHESIS LECTURES ON COMPUTER ARCHITECTURE #3 Lecture #3 Series Editor: Mark D. Hill, University of Wisconsin Library of Congress Cataloging-in-Publication Data Series ISSN: 1935-3235 print Series ISSN: 1935-3243 electronic First Edition 10 9 8 7 6 5 4 3 2 1 ii
MOBK089-FM MOBKXXX-Sample.cls November 2, 2007 2:38 iii Synthesis Lectures on Computer Architecture Editor Mark D. Hill, University of Wisconsin, Madison Synthesis Lectures on Computer Architecture publishes 50- to 150 page publications on topics pertaining to the science and art of designing, analyzing, selecting and interconnecting hardware components to create computers that meet functional, performance and cost goals. Chip Mutiprocessor Architecture: Techniques to Improve Throughput and Latency Kunle Olukotun, Lance Hammond, James Laudon 2007 Transactional Memory James R. Larus, Ravi Rajwar 2007 Quantum Computing for Computer Architects Tzvetan S. Metodi, Frederic T. Chong 2006
MOBK089-FM MOBKXXX-Sample.cls October 26, 2007 10:29 Chip Multiprocessor Architecture: Techniques to Improve Throughput and Latency Kunle Olukotun Stanford University Lance Hammond Stanford University James Laudon Sun Microsystems SYNTHESIS LECTURES ON COMPUTER ARCHITECTURE #3 M&C M o r g a n & C l a y p o o l P u b l i s h e r s iv
MOBK089-FM MOBKXXX-Sample.cls October 26, 2007 10:29 v ABSTRACT Chip multiprocessors — also called multi-core microprocessors or CMPs for short — are now the only way to build high-performance microprocessors, for a variety of reasons. Large uniprocessors are no longer scaling in performance, because it is only possible to extract a limited amount of parallelism from a typical instruction stream using conventional superscalar instruction issue techniques. In addition, one cannot simply ratchet up the clock speed on today’s processors, or the power dissipation will become prohibitive in all but water-cooled systems. Compounding these problems is the simple fact that with the immense numbers of transistors available on today’s microprocessor chips, it is too costly to design and debug ever-larger processors every year or two. CMPs avoid these problems by filling up a processor die with multiple, relatively simpler processor cores instead of just one huge core. The exact size of a CMP’s cores can vary from very simple pipelines to moderately complex superscalar processors, but once a core has been selected the CMP’s performance can easily scale across silicon process generations simply by stamping down more copies of the hard-to-design, high-speed processor core in each successive chip generation. In addition, parallel code execution, obtained by spreading multiple threads of execution across the various cores, can achieve significantly higher performance than would be possible using only a single core. While parallel threads are already common in many useful workloads, there are still important workloads that are hard to divide into parallel threads. The low inter-processor communication latency between the cores in a CMP helps make a much wider range of applications viable candidates for parallel execution than was possible with conventional, multi-chip multiprocessors; nevertheless, limited parallelism in key applications is the main factor limiting acceptance of CMPs in some types of systems. After a discussion of the basic pros and cons of CMPs when they are compared with conventional uniprocessors, this book examines how CMPs can best be designed to handle two radically different kinds of workloads that are likely to be used with a CMP: highly parallel, throughput-sensitive applications at one end of the spectrum, and less parallel, latency- sensitive applications at the other. Throughput-sensitive applications, such as server workloads that handle many independent transactions at once, require careful balancing of all parts of a CMP that can limit throughput, such as the individual cores, on-chip cache memory, and off-chip memory interfaces. Several studies and example systems, such as the Sun Niagara, that examine the necessary tradeoffs are presented here. In contrast, latency-sensitive applications — many desktop applications fall into this category — require a focus on re- ducing inter-core communication latency and applying techniques to help programmers divide their programs into multiple threads as easily as possible. This book discusses many techniques that can be used in CMPs to simplify parallel programming, with an emphasis on research
MOBK089-FM MOBKXXX-Sample.cls October 26, 2007 10:29 vi directions proposed at Stanford University. To illustrate the advantages possible with a CMP using a couple of solid examples, extra focus is given to thread-level speculation (TLS), a way to automatically break up nominally sequential applications into parallel threads on a CMP, and transactional memory. This model can greatly simplify manual parallel programming by using hardware — instead of conventional software locks — to enforce atomic code execution of blocks of instructions, a technique that makes parallel coding much less error-prone. KEYWORDS Basic Terms: chip multiprocessors (CMPs), multi-core microprocessors, microprocessor power, parallel processing, threaded execution Application Classes: throughput-sensitive applications, server applications, latency-sensitive applications, desktop applications, SPEC benchmarks, Java applications Technologies: thread-level speculation (TLS), JRPM virtual machine, tracer for extracting speculative threads (TEST), transactional memory, transactional coherency and consistency (TCC), transactional lock removal (TLR) System Names: DEC Piranha, Sun Niagara, Sun Niagara 2, Stanford Hydra
MOBK089-FM MOBKXXX-Sample.cls October 26, 2007 10:29 vii Contents 1. 2. 3. The Case for CMPs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1 A New Approach: The Chip Multiprocessor (CMP). . . . . . . . . . . . . . . . . . . . . . . . .5 1.2 The Application Parallelism Landscape . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 1.3 Simple Example: Superscalar vs. CMP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 Simulation Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 1.3.1 1.4 This Book: Beyond Basic CMPs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 Improving Throughput . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 Simple Cores and Server Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 2.1 2.1.1 The Need for Multithreading within Processors . . . . . . . . . . . . . . . . . . . . 24 2.1.2 Maximizing the Number of Cores on the Die . . . . . . . . . . . . . . . . . . . . . . 25 Providing Sufficient Cache and Memory Bandwidth . . . . . . . . . . . . . . . . 26 2.1.3 2.2 Case Studies of Throughput-oriented CMPs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 2.2.1 Example 1: The Piranha Server CMP. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .26 2.2.2 Example 2: The Niagara Server CMP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 2.2.3 Example 3: The Niagara 2 Server CMP . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 2.2.4 Simple Core Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 2.3 General Server CMP Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 2.3.1 Simulating a Large Design Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 2.3.2 Choosing Design Datapoints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 2.3.3 Results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .53 2.3.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 Improving Latency Automatically . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 3.1 Pseudo-parallelization: “Helper” Threads . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 3.2 Automated Parallelization Using Thread-Level Speculation (TLS) . . . . . . . . . . . 63 3.3 An Example TLS System: Hydra . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 3.3.1 The Base Hydra Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 3.3.2 Adding TLS to Hydra . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 3.3.3 Using Feedback from Violation Statistics. . . . . . . . . . . . . . . . . . . . . . . . . . .80
MOBK089-FM MOBKXXX-Sample.cls October 26, 2007 10:29 viii CONTENTS 4. 3.3.4 Performance Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 3.3.5 Completely Automated TLS Support: The JRPM System . . . . . . . . . . . 88 3.4 Concluding Thoughts on Automated Parallelization . . . . . . . . . . . . . . . . . . . . . . . . 99 Improving Latency Using Manual Parallel Programming . . . . . . . . . . . . . . . . . . . . . . 103 4.1 Using TLS Support as Transactional Memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 4.1.1 An Example: Parallelizing Heapsort Using TLS . . . . . . . . . . . . . . . . . . . 105 Parallelizing SPEC2000 with TLS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114 4.1.2 4.2 Transactional Coherence and Consistency (TCC): More Generalized Transactional Memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 4.2.1 TCC Hardware . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118 4.2.2 TCC Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121 4.2.3 TCC Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127 4.3 Mixing Transactional Memory and Conventional Shared Memory . . . . . . . . . . 136 5. A Multicore World: The Future of CMPs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .141 Author Biography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
分享到:
收藏