logo资料库

Parallel Programming Concepts and Practice 无水印原版pdf.pdf

第1页 / 共405页
第2页 / 共405页
第3页 / 共405页
第4页 / 共405页
第5页 / 共405页
第6页 / 共405页
第7页 / 共405页
第8页 / 共405页
资料共405页,剩余部分请下载后查看
Cover
Copyright
Preface
1 Introduction
1.1 Motivational Example and Its Analysis
The General Case and the Computation-to-Communication Ratio
1.2 Parallelism Basics
Distributed Memory Systems
Shared Memory Systems
Considerations When Designing Parallel Programs
1.3 HPC Trends and Rankings
1.4 Additional Exercises
2 Theoretical Background
2.1 PRAM
PRAM Variants
Parallel Prefix Computation on a PRAM
Sparse Array Compaction on a PRAM
2.2 Network Topologies
2.3 Amdahl's and Gustafson's Laws
2.4 Foster's Parallel Algorithm Design Methodology
2.5 Additional Exercises
References
3 Modern Architectures
3.1 Memory Hierarchy
von Neumann Bottleneck
Cache Memory
Cache Algorithms
Optimizing Cache Accesses
Cache Coherence
False Sharing
Simultaneous Multi-threading and Prefetching
Outlook
3.2 Levels of Parallelism
Flynn's Taxonomy
SIMD Concept
Vectorization on Common Microprocessors
AoS and SoA
Outlook
3.3 Additional Exercises
References
4 C++11 Multithreading
4.1 Introduction to Multithreading (Hello World)
Distinction Between Multithreading and Multiprocessing
Spawning and Joining Threads
Our First Multithreaded Program
4.2 Handling Return Values (Fibonacci Sequence)
The Traditional Way
The Modern Way Using Promises and Futures
The Asynchronous Way
4.3 Scheduling Based on Static Distributions (Matrix Vector Multiplication)
The Sequential Program
Block Distribution of Threads
Cyclic Distribution of Threads
False Sharing
Block-Cyclic Distribution of Threads
4.4 Handling Load Imbalance (All-Pairs Distance Matrix)
Static Schedules
Dynamic Block-Cyclic Distributions
4.5 Signaling Threads with Condition Variables (Ping Pong)
Modeling a Sleeping Student
Playing Ping Pong With Condition Variables
One-Shot Synchronization Using Futures and Promises
4.6 Parallelizing Over Implicitly Enumerable Sets (Thread Pool)
Implicitly Enumerable Sets
Use Cases for a Thread Pool
A Simple Thread Pool Implementation
4.7 Additional Exercises
References
5 Advanced C++11 Multithreading
5.1 Lock-Free Programming (Atomics, Compare-and-Swap)
Atomic Counting
Non-fundamental Atomic Data Types
Atomic Parallel Max-Reduction Using Compare-and-Swap
Arbitrary Atomic Operations
The ABA Problem
5.2 Work-Sharing Thread Pool (Tree Traversal)
Use Case for a Work-Sharing Thread Pool
Implementation of Work-Sharing
5.3 Parallel Graph Search (Binary Knapsack Problem)
The Binary Knapsack Problem
Sequential Implementation
Parallel Implementation
5.4 Outlook
5.5 Additional Exercises
References
6 OpenMP
6.1 Introduction to OpenMP (Hello World)
A Brief History of OpenMP
Basics
6.2 The parallel for Directive (Basic Linear Algebra)
Vector Addition
Implicit Synchronization
Variable Sharing and Privatization
Declaring Private Variables
Initialization of Privatized Variables
Capturing Private Variables
Final Remarks on Variable Privatization/Sharing
Matrix Vector Multiplication
6.3 Basic Parallel Reductions (Nearest-Neighbor Classifier)
One-Nearest-Neighbor Classification
The MNIST Dataset of Handwritten Digits
Theoretical Aspects of All-Pairs Distance Computation
Implementation of All-Pairs Computation
Problems Emerging During Fine-Grained Parallelization
Parallel Block-Wise Reduction
Parallel Label Prediction
Performance Evaluation
6.4 Scheduling of Imbalanced Loops (Inner Products)
Load Imbalance Caused by Symmetry
Implementation of Inner Product Computation
Performance Evaluation
6.5 Advanced Reductions (Softmax Regression/AVX Reductions)
Softmax Regression Classifier Over MNIST
Implementation of the Prediction Step
Gradient Descent-Based Parameter Optimization
Implementation of the Training Step
Performance Evaluation
Custom Reduction Operator
Theoretical Background of Parallel Reduction
Declaring Custom Parallel Reductions
OpenMP Reductions Under the Hood
6.6 Task Parallelism (Tree Traversal)
Tree Traversal
Generating Tasks in a Loop
6.7 SIMD Vectorization (Vector Addition)
Data Dependencies
Vectorization-Aware Functions
6.8 Outlook
6.9 Additional Exercises
References
7 Compute Unified Device Architecture
7.1 Introduction to CUDA (Hello World)
7.2 Hardware Architecture of CUDA-Enabled GPUs
Interconnection Between Host and Device
Video Memory and Peak Bandwidth
Organization of Computational Resources
Mapping Thread Blocks Onto SMs
Mapping Threads Into Warps
7.3 Memory Access Patterns (Eigenfaces)
Computation of the Mean Celebrity Face
Computing the Centered Data Matrix
Computation of the Covariance Matrix
Computation of Eigenfaces
7.4 Memory Hierarchy (Dynamic Time Warping)
Introduction
A Linear Memory Algorithm for Sequential DTW
A Naive CUDA Port of Linear Memory DTW
Wavefront Relaxation in Shared Memory
Concurrent Scheduling and Bank Conflicts
Texture Memory and Constant Memory
7.5 Optimization Guidelines
7.6 Additional Exercises
References
8 Advanced CUDA Programming
8.1 Warp Intrinsics and Atomic Operations (Parallel Reduction)
Segmented Parallel Reduction
Global Parallel Reduction
Arbitrary Atomic Operations
Outlook
8.2 Utilizing Multiple GPUs and Streams (Newton Iteration)
Newton's Iteration
Harnessing Multiple GPUs
Interleaving Communication and Computation
Streamed Computation on Multiple GPUs
8.3 Outlook
Unified Memory
Dynamic Parallelism
Cooperative Groups
Tensor Cores
Distributed Computation on GPU Clusters
8.4 Additional Exercises
References
9 Message Passing Interface
9.1 Introduction to MPI
9.2 Basic Concepts (Hello World)
9.3 Point-to-Point Communication (Ping-Pong)
9.4 Nonblocking Communication (Ping-Pong in a Ring of Processes)
9.5 Collectives (Counting Primes)
9.6 Overlapping Computation and Communication (Jacobi Iteration)
9.7 Derived Datatypes (Matrix Multiplication With Submatrix Scattering)
9.8 Complex Communicators (Matrix Multiplication Using SUMMA)
9.9 Outlook
9.10 Additional Exercises
References
10 Unified Parallel C++
10.1 Introduction to PGAS and UPC++
10.2 Basic Concepts (Hello World)
10.3 Memory Affinity and Privatization (Vector Update)
10.4 Global Pointers and Collectives (Letter Count)
10.5 Locks (Image Histogramming)
10.6 Remote Function Invocation (Mandelbrot Sets)
10.7 Additional Exercises
References
Index
Back Cover
Parallel Programming
Parallel Programming Concepts and Practice Bertil Schmidt Institut für Informatik Staudingerweg 9 55128 Mainz Germany Jorge González-Domínguez Computer Architecture Group University of A Coruña Edificio área científica (Office 3.08), Campus de Elviña 15071, A Coruña Spain Christian Hundt Institut für Informatik Staudingerweg 9 55128 Mainz Germany Moritz Schlarb Data Center Johannes Gutenberg-University Mainz Germany Anselm-Franz-von-Bentzel-Weg 12 55128 Mainz Germany
Morgan Kaufmann is an imprint of Elsevier 50 Hampshire Street, 5th Floor, Cambridge, MA 02139, United States Copyright © 2018 Elsevier Inc. All rights reserved. No part of this publication may be reproduced or transmitted in any form or by any means, electronic or mechanical, including photocopying, recording, or any information storage and retrieval system, without permission in writing from the publisher. Details on how to seek permission, further information about the Publisher’s permissions policies and our arrangements with organizations such as the Copyright Clearance Center and the Copyright Licensing Agency, can be found at our website: www.elsevier.com/permissions. This book and the individual contributions contained in it are protected under copyright by the Publisher (other than as may be noted herein). Notices Knowledge and best practice in this field are constantly changing. As new research and experience broaden our understanding, changes in research methods, professional practices, or medical treatment may become necessary. Practitioners and researchers must always rely on their own experience and knowledge in evaluating and using any information, methods, compounds, or experiments described herein. In using such information or methods they should be mindful of their own safety and the safety of others, including parties for whom they have a professional responsibility. To the fullest extent of the law, neither the Publisher nor the authors, contributors, or editors, assume any liability for any injury and/or damage to persons or property as a matter of products liability, negligence or otherwise, or from any use or operation of any methods, products, instructions, or ideas contained in the material herein. Library of Congress Cataloging-in-Publication Data A catalog record for this book is available from the Library of Congress British Library Cataloguing-in-Publication Data A catalogue record for this book is available from the British Library ISBN: 978-0-12-849890-3 For information on all Morgan Kaufmann publications visit our website at https://www.elsevier.com/books-and-journals Publisher: Katey Birtcher Acquisition Editor: Steve Merken Developmental Editor: Nate McFadden Production Project Manager: Sreejith Viswanathan Designer: Christian J. Bilbow Typeset by VTeX
Preface Parallelism abounds. Nowadays, any modern CPU contains at least two cores, whereas some CPUs feature more than 50 processing units. An even higher degree of parallelism is available on larger sys- tems containing multiple CPUs such as server nodes, clusters, and supercomputers. Thus, the ability to program these types of systems efficiently and effectively is an essential aspiration for scientists, engineers, and programmers. The subject of this book is a comprehensive introduction to the area of parallel programming that addresses this need. Our book teaches practical parallel programming for shared memory and distributed memory architectures based on the C++11 threading API, Open Mul- tiprocessing (OpenMP), Compute Unified Device Architecture (CUDA), Message Passing Interface (MPI), and Unified Parallel C++ (UPC++), as well as necessary theoretical background. We have in- cluded a large number of programming examples based on the recent C++11 and C++14 dialects of the C++ programming language. This book targets participants of “Parallel Programming” or “High Performance Computing” courses which are taught at most universities at senior undergraduate level or graduate level in com- puter science or computer engineering. Moreover, it serves as suitable literature for undergraduates in other disciplines with a computer science minor or professionals from related fields such as research scientists, data analysts, or R&D engineers. Prerequisites for being able to understand the contents of our book include some experience with writing sequential code in C/C++ and basic mathematical knowledge. In good tradition with the historic symbiosis of High Performance Computing and natural science, we introduce parallel concepts based on real-life applications ranging from basic linear algebra rou- tines over machine learning algorithms and physical simulations but also traditional algorithms from computer science. The writing of correct yet efficient code is a key skill for every programmer. Hence, we focus on the actual implementation and performance evaluation of algorithms. Nevertheless, the theoretical properties of algorithms are discussed in depth, too. Each chapter features a collection of additional programming exercises that can be solved within a web framework that is distributed with this book. The System for Automated Code Evaluation (SAUCE) provides a web-based testing en- vironment for the submission of solutions and their subsequent evaluation in a classroom setting: the only prerequisite is an HTML5 compatible web browser allowing for the embedding of interactive programming exercise in lectures. SAUCE is distributed as docker image and can be downloaded at https://parallelprogrammingbook.org This website serves as hub for related content such as installation instructions, a list of errata, and supplementary material (such as lecture slides and solutions to selected exercises for instructors). If you are a student or professional that aims to learn a certain programming technique, we advise to initially read the first three chapters on the fundamentals of parallel programming, theoretical models, and hardware architectures. Subsequently, you can dive into one of the introductory chapters on C++11 Multithreading, OpenMP, CUDA, or MPI which are mostly self-contained. The chapters on Advanced C++11 Multithreading, Advanced CUDA, and UPC++ build upon the techniques of their preceding chapter and thus should not be read in isolation. ix
x Preface If you are a lecturer, we propose a curriculum consisting of 14 lectures mainly covering applications from the introductory chapters. You could start with a lecture discussing the fundamentals from the first chapter including parallel summation using a hypercube and its analysis, the definition of basic measures such as speedup, parallelization efficiency and cost, and a discussion of ranking metrics. The second lecture could cover an introduction to PRAM, network topologies, weak and strong scaling. You can spend more time on PRAM if you aim to later discuss CUDA in more detail or emphasize hardware architectures if you focus on CPUs. Two to three lectures could be spent on teaching the basics of the C++11 threading API, CUDA, and MPI, respectively. OpenMP can be discussed within a span of one to two lectures. The remaining lectures can be used to either discuss the content in the advanced chapters on multithreading, CUDA, or the PGAS-based UPC++ language. An alternative approach is splitting the content into two courses with a focus on pair-programming within the lecture. You could start with a course on CPU-based parallel programming covering selected topics from the first three chapters. Hence, C++11 threads, OpenMP, and MPI could be taught in full detail. The second course would focus on advanced parallel approaches covering extensive CUDA programming in combination with (CUDA-aware) MPI and/or the PGAS-based UPC++. We wish you a great time with the book. Be creative and investigate the code! Finally, we would be happy to hear any feedback from you so that we could improve any of our provided material.
Acknowledgments This book would not have been possible without the contributions of many people. Initially, we would like to thank the anonymous and few non-anonymous reviewers who com- mented on our book proposal and the final draft: Eduardo Cesar Galobardes, Ahmad Al-Khasawneh, and Mohammad Olaimat. Moreover, we would like to thank our colleagues who thoroughly peer-reviewed the chapters and provided essential feedback: André Müller for his valuable advise on C++ programming, Robin Kobus for being a tough code reviewer, Felix Kallenborn for his steady proofreading sessions, Daniel Jünger for constantly complaining about the CUDA chapter, as well as Stefan Endler and Elmar Schömer for their suggestions. Additionally, we would like to thank the staff of Morgan Kaufman and Elsevier who coordinated the making of this book. In particular we would like to mention Nate McFadden. Finally, we would like to thank our spouses and children for their ongoing support and patience during the countless hours we could not spend with them. xi
分享到:
收藏