Cover
Copyright
Preface
1 Introduction
1.1 Motivational Example and Its Analysis
The General Case and the Computation-to-Communication Ratio
1.2 Parallelism Basics
Distributed Memory Systems
Shared Memory Systems
Considerations When Designing Parallel Programs
1.3 HPC Trends and Rankings
1.4 Additional Exercises
2 Theoretical Background
2.1 PRAM
PRAM Variants
Parallel Prefix Computation on a PRAM
Sparse Array Compaction on a PRAM
2.2 Network Topologies
2.3 Amdahl's and Gustafson's Laws
2.4 Foster's Parallel Algorithm Design Methodology
2.5 Additional Exercises
References
3 Modern Architectures
3.1 Memory Hierarchy
von Neumann Bottleneck
Cache Memory
Cache Algorithms
Optimizing Cache Accesses
Cache Coherence
False Sharing
Simultaneous Multi-threading and Prefetching
Outlook
3.2 Levels of Parallelism
Flynn's Taxonomy
SIMD Concept
Vectorization on Common Microprocessors
AoS and SoA
Outlook
3.3 Additional Exercises
References
4 C++11 Multithreading
4.1 Introduction to Multithreading (Hello World)
Distinction Between Multithreading and Multiprocessing
Spawning and Joining Threads
Our First Multithreaded Program
4.2 Handling Return Values (Fibonacci Sequence)
The Traditional Way
The Modern Way Using Promises and Futures
The Asynchronous Way
4.3 Scheduling Based on Static Distributions (Matrix Vector Multiplication)
The Sequential Program
Block Distribution of Threads
Cyclic Distribution of Threads
False Sharing
Block-Cyclic Distribution of Threads
4.4 Handling Load Imbalance (All-Pairs Distance Matrix)
Static Schedules
Dynamic Block-Cyclic Distributions
4.5 Signaling Threads with Condition Variables (Ping Pong)
Modeling a Sleeping Student
Playing Ping Pong With Condition Variables
One-Shot Synchronization Using Futures and Promises
4.6 Parallelizing Over Implicitly Enumerable Sets (Thread Pool)
Implicitly Enumerable Sets
Use Cases for a Thread Pool
A Simple Thread Pool Implementation
4.7 Additional Exercises
References
5 Advanced C++11 Multithreading
5.1 Lock-Free Programming (Atomics, Compare-and-Swap)
Atomic Counting
Non-fundamental Atomic Data Types
Atomic Parallel Max-Reduction Using Compare-and-Swap
Arbitrary Atomic Operations
The ABA Problem
5.2 Work-Sharing Thread Pool (Tree Traversal)
Use Case for a Work-Sharing Thread Pool
Implementation of Work-Sharing
5.3 Parallel Graph Search (Binary Knapsack Problem)
The Binary Knapsack Problem
Sequential Implementation
Parallel Implementation
5.4 Outlook
5.5 Additional Exercises
References
6 OpenMP
6.1 Introduction to OpenMP (Hello World)
A Brief History of OpenMP
Basics
6.2 The parallel for Directive (Basic Linear Algebra)
Vector Addition
Implicit Synchronization
Variable Sharing and Privatization
Declaring Private Variables
Initialization of Privatized Variables
Capturing Private Variables
Final Remarks on Variable Privatization/Sharing
Matrix Vector Multiplication
6.3 Basic Parallel Reductions (Nearest-Neighbor Classifier)
One-Nearest-Neighbor Classification
The MNIST Dataset of Handwritten Digits
Theoretical Aspects of All-Pairs Distance Computation
Implementation of All-Pairs Computation
Problems Emerging During Fine-Grained Parallelization
Parallel Block-Wise Reduction
Parallel Label Prediction
Performance Evaluation
6.4 Scheduling of Imbalanced Loops (Inner Products)
Load Imbalance Caused by Symmetry
Implementation of Inner Product Computation
Performance Evaluation
6.5 Advanced Reductions (Softmax Regression/AVX Reductions)
Softmax Regression Classifier Over MNIST
Implementation of the Prediction Step
Gradient Descent-Based Parameter Optimization
Implementation of the Training Step
Performance Evaluation
Custom Reduction Operator
Theoretical Background of Parallel Reduction
Declaring Custom Parallel Reductions
OpenMP Reductions Under the Hood
6.6 Task Parallelism (Tree Traversal)
Tree Traversal
Generating Tasks in a Loop
6.7 SIMD Vectorization (Vector Addition)
Data Dependencies
Vectorization-Aware Functions
6.8 Outlook
6.9 Additional Exercises
References
7 Compute Unified Device Architecture
7.1 Introduction to CUDA (Hello World)
7.2 Hardware Architecture of CUDA-Enabled GPUs
Interconnection Between Host and Device
Video Memory and Peak Bandwidth
Organization of Computational Resources
Mapping Thread Blocks Onto SMs
Mapping Threads Into Warps
7.3 Memory Access Patterns (Eigenfaces)
Computation of the Mean Celebrity Face
Computing the Centered Data Matrix
Computation of the Covariance Matrix
Computation of Eigenfaces
7.4 Memory Hierarchy (Dynamic Time Warping)
Introduction
A Linear Memory Algorithm for Sequential DTW
A Naive CUDA Port of Linear Memory DTW
Wavefront Relaxation in Shared Memory
Concurrent Scheduling and Bank Conflicts
Texture Memory and Constant Memory
7.5 Optimization Guidelines
7.6 Additional Exercises
References
8 Advanced CUDA Programming
8.1 Warp Intrinsics and Atomic Operations (Parallel Reduction)
Segmented Parallel Reduction
Global Parallel Reduction
Arbitrary Atomic Operations
Outlook
8.2 Utilizing Multiple GPUs and Streams (Newton Iteration)
Newton's Iteration
Harnessing Multiple GPUs
Interleaving Communication and Computation
Streamed Computation on Multiple GPUs
8.3 Outlook
Unified Memory
Dynamic Parallelism
Cooperative Groups
Tensor Cores
Distributed Computation on GPU Clusters
8.4 Additional Exercises
References
9 Message Passing Interface
9.1 Introduction to MPI
9.2 Basic Concepts (Hello World)
9.3 Point-to-Point Communication (Ping-Pong)
9.4 Nonblocking Communication (Ping-Pong in a Ring of Processes)
9.5 Collectives (Counting Primes)
9.6 Overlapping Computation and Communication (Jacobi Iteration)
9.7 Derived Datatypes (Matrix Multiplication With Submatrix Scattering)
9.8 Complex Communicators (Matrix Multiplication Using SUMMA)
9.9 Outlook
9.10 Additional Exercises
References
10 Unified Parallel C++
10.1 Introduction to PGAS and UPC++
10.2 Basic Concepts (Hello World)
10.3 Memory Affinity and Privatization (Vector Update)
10.4 Global Pointers and Collectives (Letter Count)
10.5 Locks (Image Histogramming)
10.6 Remote Function Invocation (Mandelbrot Sets)
10.7 Additional Exercises
References
Index
Back Cover