Parallel Programming Concepts and Practice 无水印原版pdf.pdf

发布时间：2022-06-14 发布人：admin 分类：说明书资料大小：9.38M 资料格式：pdf 举报版权申诉

ca3e389c-800e-4a63-a187-08265abb695a.pdf-第1页.png

第1页 / 共405页

ca3e389c-800e-4a63-a187-08265abb695a.pdf-第2页.png

第2页 / 共405页

ca3e389c-800e-4a63-a187-08265abb695a.pdf-第3页.png

第3页 / 共405页

ca3e389c-800e-4a63-a187-08265abb695a.pdf-第4页.png

第4页 / 共405页

ca3e389c-800e-4a63-a187-08265abb695a.pdf-第5页.png

第5页 / 共405页

ca3e389c-800e-4a63-a187-08265abb695a.pdf-第6页.png

第6页 / 共405页

ca3e389c-800e-4a63-a187-08265abb695a.pdf-第7页.png

第7页 / 共405页

ca3e389c-800e-4a63-a187-08265abb695a.pdf-第8页.png

第8页 / 共405页

Cover

Preface

1 Introduction

1.1 Motivational Example and Its Analysis

The General Case and the Computation-to-Communication Ratio

1.2 Parallelism Basics

Distributed Memory Systems

Shared Memory Systems

Considerations When Designing Parallel Programs

1.3 HPC Trends and Rankings

1.4 Additional Exercises

2 Theoretical Background

2.1 PRAM

PRAM Variants

Parallel Preﬁx Computation on a PRAM

Sparse Array Compaction on a PRAM

2.2 Network Topologies

2.3 Amdahl's and Gustafson's Laws

2.4 Foster's Parallel Algorithm Design Methodology

2.5 Additional Exercises

References

3 Modern Architectures

3.1 Memory Hierarchy

von Neumann Bottleneck

Cache Memory

Cache Algorithms

Optimizing Cache Accesses

Cache Coherence

False Sharing

Simultaneous Multi-threading and Prefetching

Outlook

3.2 Levels of Parallelism

Flynn's Taxonomy

SIMD Concept

Vectorization on Common Microprocessors

AoS and SoA

Outlook

3.3 Additional Exercises

References

4 C++11 Multithreading

4.1 Introduction to Multithreading (Hello World)

Distinction Between Multithreading and Multiprocessing

Spawning and Joining Threads

Our First Multithreaded Program

4.2 Handling Return Values (Fibonacci Sequence)

The Traditional Way

The Modern Way Using Promises and Futures

The Asynchronous Way

4.3 Scheduling Based on Static Distributions (Matrix Vector Multiplication)

The Sequential Program

Block Distribution of Threads

Cyclic Distribution of Threads

False Sharing

Block-Cyclic Distribution of Threads

4.4 Handling Load Imbalance (All-Pairs Distance Matrix)

Static Schedules

Dynamic Block-Cyclic Distributions

4.5 Signaling Threads with Condition Variables (Ping Pong)

Modeling a Sleeping Student

Playing Ping Pong With Condition Variables

One-Shot Synchronization Using Futures and Promises

4.6 Parallelizing Over Implicitly Enumerable Sets (Thread Pool)

Implicitly Enumerable Sets

Use Cases for a Thread Pool

A Simple Thread Pool Implementation

4.7 Additional Exercises

References

5 Advanced C++11 Multithreading

5.1 Lock-Free Programming (Atomics, Compare-and-Swap)

Atomic Counting

Non-fundamental Atomic Data Types

Atomic Parallel Max-Reduction Using Compare-and-Swap

Arbitrary Atomic Operations

The ABA Problem

5.2 Work-Sharing Thread Pool (Tree Traversal)

Use Case for a Work-Sharing Thread Pool

Implementation of Work-Sharing

5.3 Parallel Graph Search (Binary Knapsack Problem)

The Binary Knapsack Problem

Sequential Implementation

Parallel Implementation

5.4 Outlook

5.5 Additional Exercises

References

6 OpenMP

6.1 Introduction to OpenMP (Hello World)

A Brief History of OpenMP

Basics

6.2 The parallel for Directive (Basic Linear Algebra)

Vector Addition

Implicit Synchronization

Variable Sharing and Privatization

Declaring Private Variables

Initialization of Privatized Variables

Capturing Private Variables

Final Remarks on Variable Privatization/Sharing

Matrix Vector Multiplication

6.3 Basic Parallel Reductions (Nearest-Neighbor Classiﬁer)

One-Nearest-Neighbor Classiﬁcation

The MNIST Dataset of Handwritten Digits

Theoretical Aspects of All-Pairs Distance Computation

Implementation of All-Pairs Computation

Problems Emerging During Fine-Grained Parallelization

Parallel Block-Wise Reduction

Parallel Label Prediction

Performance Evaluation

6.4 Scheduling of Imbalanced Loops (Inner Products)

Load Imbalance Caused by Symmetry

Implementation of Inner Product Computation

Performance Evaluation

6.5 Advanced Reductions (Softmax Regression/AVX Reductions)

Softmax Regression Classiﬁer Over MNIST

Implementation of the Prediction Step

Gradient Descent-Based Parameter Optimization

Implementation of the Training Step

Performance Evaluation

Custom Reduction Operator

Theoretical Background of Parallel Reduction

Declaring Custom Parallel Reductions

OpenMP Reductions Under the Hood

6.6 Task Parallelism (Tree Traversal)

Tree Traversal

Generating Tasks in a Loop

6.7 SIMD Vectorization (Vector Addition)

Data Dependencies

Vectorization-Aware Functions

6.8 Outlook

6.9 Additional Exercises

References

7 Compute Uniﬁed Device Architecture

7.1 Introduction to CUDA (Hello World)

7.2 Hardware Architecture of CUDA-Enabled GPUs

Interconnection Between Host and Device

Video Memory and Peak Bandwidth

Organization of Computational Resources

Mapping Thread Blocks Onto SMs

Mapping Threads Into Warps

7.3 Memory Access Patterns (Eigenfaces)

Computation of the Mean Celebrity Face

Computing the Centered Data Matrix

Computation of the Covariance Matrix

Computation of Eigenfaces

7.4 Memory Hierarchy (Dynamic Time Warping)

Introduction

A Linear Memory Algorithm for Sequential DTW

A Naive CUDA Port of Linear Memory DTW

Wavefront Relaxation in Shared Memory

Concurrent Scheduling and Bank Conﬂicts

Texture Memory and Constant Memory

7.5 Optimization Guidelines

7.6 Additional Exercises

References

8 Advanced CUDA Programming

8.1 Warp Intrinsics and Atomic Operations (Parallel Reduction)

Segmented Parallel Reduction

Global Parallel Reduction

Arbitrary Atomic Operations

Outlook

8.2 Utilizing Multiple GPUs and Streams (Newton Iteration)

Newton's Iteration

Harnessing Multiple GPUs

Interleaving Communication and Computation

Streamed Computation on Multiple GPUs

8.3 Outlook

Uniﬁed Memory

Dynamic Parallelism

Cooperative Groups

Tensor Cores

Distributed Computation on GPU Clusters

8.4 Additional Exercises

References

9 Message Passing Interface

9.1 Introduction to MPI

9.2 Basic Concepts (Hello World)

9.3 Point-to-Point Communication (Ping-Pong)

9.4 Nonblocking Communication (Ping-Pong in a Ring of Processes)

9.5 Collectives (Counting Primes)

9.6 Overlapping Computation and Communication (Jacobi Iteration)

9.7 Derived Datatypes (Matrix Multiplication With Submatrix Scattering)

9.8 Complex Communicators (Matrix Multiplication Using SUMMA)

9.9 Outlook

9.10 Additional Exercises

References

10 Uniﬁed Parallel C++

10.1 Introduction to PGAS and UPC++

10.2 Basic Concepts (Hello World)

10.3 Memory Afﬁnity and Privatization (Vector Update)

10.4 Global Pointers and Collectives (Letter Count)

10.5 Locks (Image Histogramming)

10.6 Remote Function Invocation (Mandelbrot Sets)

10.7 Additional Exercises

References

Index

Back Cover

Parallel Programming

Parallel Programming Concepts and Practice Bertil Schmidt Institut für Informatik Staudingerweg 9 55128 Mainz Germany Jorge González-Domínguez Computer Architecture Group University of A Coruña Ediﬁcio área cientíﬁca (Ofﬁce 3.08), Campus de Elviña 15071, A Coruña Spain Christian Hundt Institut für Informatik Staudingerweg 9 55128 Mainz Germany Moritz Schlarb Data Center Johannes Gutenberg-University Mainz Germany Anselm-Franz-von-Bentzel-Weg 12 55128 Mainz Germany

Morgan Kaufmann is an imprint of Elsevier 50 Hampshire Street, 5th Floor, Cambridge, MA 02139, United States Copyright © 2018 Elsevier Inc. All rights reserved. No part of this publication may be reproduced or transmitted in any form or by any means, electronic or mechanical, including photocopying, recording, or any information storage and retrieval system, without permission in writing from the publisher. Details on how to seek permission, further information about the Publisher’s permissions policies and our arrangements with organizations such as the Copyright Clearance Center and the Copyright Licensing Agency, can be found at our website: www.elsevier.com/permissions. This book and the individual contributions contained in it are protected under copyright by the Publisher (other than as may be noted herein). Notices Knowledge and best practice in this ﬁeld are constantly changing. As new research and experience broaden our understanding, changes in research methods, professional practices, or medical treatment may become necessary. Practitioners and researchers must always rely on their own experience and knowledge in evaluating and using any information, methods, compounds, or experiments described herein. In using such information or methods they should be mindful of their own safety and the safety of others, including parties for whom they have a professional responsibility. To the fullest extent of the law, neither the Publisher nor the authors, contributors, or editors, assume any liability for any injury and/or damage to persons or property as a matter of products liability, negligence or otherwise, or from any use or operation of any methods, products, instructions, or ideas contained in the material herein. Library of Congress Cataloging-in-Publication Data A catalog record for this book is available from the Library of Congress British Library Cataloguing-in-Publication Data A catalogue record for this book is available from the British Library ISBN: 978-0-12-849890-3 For information on all Morgan Kaufmann publications visit our website at https://www.elsevier.com/books-and-journals Publisher: Katey Birtcher Acquisition Editor: Steve Merken Developmental Editor: Nate McFadden Production Project Manager: Sreejith Viswanathan Designer: Christian J. Bilbow Typeset by VTeX

Preface Parallelism abounds. Nowadays, any modern CPU contains at least two cores, whereas some CPUs feature more than 50 processing units. An even higher degree of parallelism is available on larger sys- tems containing multiple CPUs such as server nodes, clusters, and supercomputers. Thus, the ability to program these types of systems efﬁciently and effectively is an essential aspiration for scientists, engineers, and programmers. The subject of this book is a comprehensive introduction to the area of parallel programming that addresses this need. Our book teaches practical parallel programming for shared memory and distributed memory architectures based on the C++11 threading API, Open Mul- tiprocessing (OpenMP), Compute Uniﬁed Device Architecture (CUDA), Message Passing Interface (MPI), and Uniﬁed Parallel C++ (UPC++), as well as necessary theoretical background. We have in- cluded a large number of programming examples based on the recent C++11 and C++14 dialects of the C++ programming language. This book targets participants of “Parallel Programming” or “High Performance Computing” courses which are taught at most universities at senior undergraduate level or graduate level in com- puter science or computer engineering. Moreover, it serves as suitable literature for undergraduates in other disciplines with a computer science minor or professionals from related ﬁelds such as research scientists, data analysts, or R&D engineers. Prerequisites for being able to understand the contents of our book include some experience with writing sequential code in C/C++ and basic mathematical knowledge. In good tradition with the historic symbiosis of High Performance Computing and natural science, we introduce parallel concepts based on real-life applications ranging from basic linear algebra rou- tines over machine learning algorithms and physical simulations but also traditional algorithms from computer science. The writing of correct yet efﬁcient code is a key skill for every programmer. Hence, we focus on the actual implementation and performance evaluation of algorithms. Nevertheless, the theoretical properties of algorithms are discussed in depth, too. Each chapter features a collection of additional programming exercises that can be solved within a web framework that is distributed with this book. The System for Automated Code Evaluation (SAUCE) provides a web-based testing en- vironment for the submission of solutions and their subsequent evaluation in a classroom setting: the only prerequisite is an HTML5 compatible web browser allowing for the embedding of interactive programming exercise in lectures. SAUCE is distributed as docker image and can be downloaded at https://parallelprogrammingbook.org This website serves as hub for related content such as installation instructions, a list of errata, and supplementary material (such as lecture slides and solutions to selected exercises for instructors). If you are a student or professional that aims to learn a certain programming technique, we advise to initially read the ﬁrst three chapters on the fundamentals of parallel programming, theoretical models, and hardware architectures. Subsequently, you can dive into one of the introductory chapters on C++11 Multithreading, OpenMP, CUDA, or MPI which are mostly self-contained. The chapters on Advanced C++11 Multithreading, Advanced CUDA, and UPC++ build upon the techniques of their preceding chapter and thus should not be read in isolation. ix

x Preface If you are a lecturer, we propose a curriculum consisting of 14 lectures mainly covering applications from the introductory chapters. You could start with a lecture discussing the fundamentals from the ﬁrst chapter including parallel summation using a hypercube and its analysis, the deﬁnition of basic measures such as speedup, parallelization efﬁciency and cost, and a discussion of ranking metrics. The second lecture could cover an introduction to PRAM, network topologies, weak and strong scaling. You can spend more time on PRAM if you aim to later discuss CUDA in more detail or emphasize hardware architectures if you focus on CPUs. Two to three lectures could be spent on teaching the basics of the C++11 threading API, CUDA, and MPI, respectively. OpenMP can be discussed within a span of one to two lectures. The remaining lectures can be used to either discuss the content in the advanced chapters on multithreading, CUDA, or the PGAS-based UPC++ language. An alternative approach is splitting the content into two courses with a focus on pair-programming within the lecture. You could start with a course on CPU-based parallel programming covering selected topics from the ﬁrst three chapters. Hence, C++11 threads, OpenMP, and MPI could be taught in full detail. The second course would focus on advanced parallel approaches covering extensive CUDA programming in combination with (CUDA-aware) MPI and/or the PGAS-based UPC++. We wish you a great time with the book. Be creative and investigate the code! Finally, we would be happy to hear any feedback from you so that we could improve any of our provided material.

Acknowledgments This book would not have been possible without the contributions of many people. Initially, we would like to thank the anonymous and few non-anonymous reviewers who com- mented on our book proposal and the ﬁnal draft: Eduardo Cesar Galobardes, Ahmad Al-Khasawneh, and Mohammad Olaimat. Moreover, we would like to thank our colleagues who thoroughly peer-reviewed the chapters and provided essential feedback: André Müller for his valuable advise on C++ programming, Robin Kobus for being a tough code reviewer, Felix Kallenborn for his steady proofreading sessions, Daniel Jünger for constantly complaining about the CUDA chapter, as well as Stefan Endler and Elmar Schömer for their suggestions. Additionally, we would like to thank the staff of Morgan Kaufman and Elsevier who coordinated the making of this book. In particular we would like to mention Nate McFadden. Finally, we would like to thank our spouses and children for their ongoing support and patience during the countless hours we could not spend with them. xi

分享到：

赞收藏

资料库

Parallel Programming Concepts and Practice 无水印原版pdf.pdf

相关推荐

开发技术

热门标签

最新资料