Intel Threading Building Blocks
Table of Contents
Foreword
Note from the Lead Developer of Intel Threading Building Blocks
Preface
Assumptions This Book Makes
Contents of This Book
Conventions Used in This Book
Informal Class Declarations
Using Code Examples
How to Contact Us
Acknowledgments
Why Threading Building Blocks?
Overview
Benefits
Comparison with Raw Threads and MPI
Comparison with OpenMP
Recursive Splitting, Task Stealing, and Algorithms
Thinking Parallel
Elements of Thinking Parallel
Decomposition
Data Parallelism
Task Parallelism
Pipelining (Task and Data Parallelism Together)
Mixed Solutions
Achieving Parallelism
Scaling and Speedup
How Much Parallelism Is There in an Application?
Amdahl’s Law
Gustafson’s observations regarding Amdahl’s Law
What did they really say?
Serial versus parallel algorithms
What Is a Thread?
Programming Threads
Safety in the Presence of Concurrency
Mutual Exclusion and Locks
Correctness
Abstraction
Patterns
Intuition
Basic Algorithms
Initializing and Terminating the Library
Loop Parallelization
parallel_for
Grain size
Automatic grain size
Notes on automatic grain size
parallel_for with partitioner
parallel_reduce
Advanced example
Parallel_reduce with partitioner
Advanced Topic: Other Kinds of Iteration Spaces
Notes on blocked_range2d
parallel_scan
Parallel_scan with partitioner
Recursive Range Specifications
Splittable Concept
Model Types: Splittable Ranges
split Class
Range Concept
Model Types
blocked_range Template Class
blocked_range2d Template Class
Partitioner Concept
Model Types: Partitioners
simple_partitioner Class
auto_partitioner Class
parallel_for Template Function
parallel_reduce Template Function
parallel_scan Template Function
pre_scan_tag and final_scan_tag Classes
Summary of Loops
Advanced Algorithms
Parallel Algorithms for Streams
Cook Until Done: parallel_while
Notes on parallel_while scaling
parallel_while Template Class
Working on the Assembly Line: Pipeline
Throughput of pipeline
Nonlinear pipelines
pipeline Class
filter Class
parallel_sort
parallel_sort Template Function
Containers
concurrent_queue
Iterating over a concurrent_queue for Debugging
When Not to Use Queues
concurrent_queue Template Class
concurrent_vector
concurrent_vector Template Class
Whole Vector Operations
Concurrent Operations
Parallel Iteration
Capacity
Iterators
concurrent_hash_map
More on HashCompare
concurrent_hash_map Template Class
Whole-Table Operations
Concurrent Access
const_accessor
accessor class
Concurrent Operations: find, insert, erase
Parallel Iteration
Capacity
Iterators
Scalable Memory Allocation
Limitations
Problems in Memory Allocation
Memory Allocators
Which Library to Link into Your Application
Using the Allocator Argument to C++ STL Template Classes
Replacing malloc, new, and delete
Replace malloc, free, realloc, and calloc
Replace new and delete
Allocator Concept
Model Types
scalable_allocator Template Class
cache_aligned_allocator Template Class
aligned_space Template Class
Mutual Exclusion
When to Use Mutual Exclusion
Mutexes
Mutex Flavors
Reader-Writer Mutexes
Upgrade/Downgrade
Lock Pathologies
Deadlock
Convoying and priority inversion
Mutexes
Mutex Concept
mutex Class
spin_mutex Class
queuing_mutex Class
ReaderWriterMutex Concept
Model Types
spin_rw_mutex Class
queuing_rw_mutex Class
Atomic Operations
Why atomic Has No Constructors
Memory Consistency and Fences
atomic Template Class
Timing
tick_count Class
tick_count::interval_t Class
Task Scheduler
When Task-Based Programming Is Inappropriate
Much Better Than Raw Native Threads
Oversubscription
Fair Scheduling
High Coding Overhead
Load Imbalance
Portability
Initializing the Library Is Your Job
Example Program for Fibonacci Numbers
Task Scheduling Overview
How Task Scheduling Works
Recommended Task Recurrence Patterns
Blocking Style with Children
Continuation-Passing Style with Children
Recycling the parent as the continuation
Recycling the parent as a child
Making Best Use of the Scheduler
Recursive Chain Reaction
Continuation Passing
Scheduler bypass
Recycling
Empty tasks
Lazy copying
Task Scheduler Interfaces
task_scheduler_init Class
task Class
Task derivation
Processing of execute()
Task allocation
Explicit task destruction
Recycling tasks
Task depth
Synchronization
Task context
Task debugging
empty_task Class
task_list Class
Task Scheduler Summary
Keys to Success
Key Steps to Success
Relaxed Sequential Execution
Safe Concurrency for Methods and Libraries
Debug Versus Release
For Efficiency’s Sake
Enabling Debugging Features
The TBB_DO_ASSERT Macro
Do Not Ship Your Program Built with TBB_DO_ASSERT
The TBB_DO_THREADING_TOOLS Macro
Debug Versus Release Libraries
Mixing with Other Threading Packages
Naming Conventions
The tbb Namespace
The tbb::internal Namespace
The __TBB Prefix
Examples
The Aha! Factor
A Few Other Key Points
parallel_for Examples
ParallelAverage
Seismic
Matrix Multiply
ParallelMerge
SubstringFinder
The Game of Life
Implementation
Automaton
Automata: Implementation
Extending the Application
Futher Reading
Parallel_reduce Examples
ParallelSum
ParallelSum without Having to Specify a Grain Size
ParallelPrime
CountStrings: Using concurrent_hash_map
Switching from an STL map
Quicksort: Visualizing Task Stealing
A Better Matrix Multiply (Strassen)
Advanced Task Programming
Start a Large Task in Parallel with the Main Program
Two Mouths: Feeding Two from the Same Task in a Pipeline
Packet Processing Pipeline
Parallel Programming for an Internet Device
Local Network Router Example
Pipelined Components for the Local Network Router
Network address translation (NAT)
Application Layer Gateway (ALG)
Packet forwarding
Our example
Implementation
The Threading Building Blocks pipeline
Synchronization with the pipeline item and concurrent hash maps
Filter Classes
Class get_next_packet : public tbb::filter
Class output_packet : public tbb::filter
Class translator : public tbb::filter
Class gateway : public tbb::filter
Class forwarding : public tbb::filter
Additional reading
Memory Allocation
Replacing new and delete
Replacing malloc, calloc, realloc, and free
Game Threading Example
Threading Architecture: Physics + Rendering
Overview of Keys to Scalability
A Frame Loop
Domain Decomposition Data Structure Needs
Think Tasks, Not Threads
Load Balancing Versus Task Stealing
Synchronization Between Physics Threads
Integrating the Example into a Real Game
How to Measure Performance
Physics Interaction and Update Code
Open Dynamics Engine
Look for Hotspots
Improving on the First Solution
The Code
History and Related Projects
Libraries
Languages
Pragmas
Generic Programming
Concepts in Generic Programming
Pseudosignatures in Generic Programming
Models in Generic Programming
Caches
Costs of Time Slicing
Quick Introduction to Lambda Functions
Further Reading
Index