Chapter 1. A Pattern Language for Parallel Programming
1.1. INTRODUCTION
1.2. PARALLEL PROGRAMMING
1.3. DESIGN PATTERNS AND PATTERN LANGUAGES
1.4. A PATTERN LANGUAGE FOR PARALLEL PROGRAMMING
Figure 1.1. Overview of the pattern language
Chapter 2. Background and Jargon of Parallel Computing
2.1. CONCURRENCY IN PARALLEL PROGRAMS VERSUS OPERATING SYSTEMS
2.2. PARALLEL ARCHITECTURES: A BRIEF INTRODUCTION
2.2.1. Flynn's Taxonomy
Figure 2.1. The Single Instruction, Single Data (SISD) architecture
Figure 2.2. The Single Instruction, Multiple Data (SIMD) architecture
Figure 2.3. The Multiple Instruction, Multiple Data (MIMD) architecture
2.2.2. A Further Breakdown of MIMD
Figure 2.4. The Symmetric Multiprocessor (SMP) architecture
Figure 2.5. An example of the nonuniform memory access (NUMA) architecture
Figure 2.6. The distributed-memory architecture
2.2.3. Summary
2.3. PARALLEL PROGRAMMING ENVIRONMENTS
Table 2.1. Some Parallel Programming Environments from the Mid-1990s
2.4. THE JARGON OF PARALLEL COMPUTING
2.5. A QUANTITATIVE LOOK AT PARALLEL COMPUTATION
2.6. COMMUNICATION
2.6.1. Latency and Bandwidth
2.6.2. Overlapping Communication and Computation and Latency Hiding
Figure 2.7. Communication without (left) and with (right) support for overlapping communication and computation. Although UE 0 in the computation on the right still has some idle time waiting for the reply from UE 1, the idle time is reduced and the computation requires less total time because of UE 1 's earlier start.
2.7. SUMMARY
Chapter 3. The Finding Concurrency Design Space
3.1. ABOUT THE DESIGN SPACE
Figure 3.1. Overview of the Finding Concurrency design space and its place in the pattern language
3.1.1. Overview
3.1.2. Using the Decomposition Patterns
3.1.3. Background for Examples
Medical imaging
Linear algebra
Molecular dynamics
Figure 3.2. Pseudocode for the molecular dynamics example
3.2. THE TASK DECOMPOSITION PATTERN
Problem
Context
Forces
Solution
Examples
Medical imaging
Matrix multiplication
Molecular dynamics
Figure 3.3. Pseudocode for the molecular dynamics example
Known uses
3.3. THE DATA DECOMPOSITION PATTERN
Problem
Context
Forces
Solution
Examples
Medical imaging
Matrix multiplication
Molecular dynamics
Known uses
3.4. THE GROUP TASKS PATTERN
Problem
Context
Solution
Examples
Molecular dynamics
Matrix multiplication
3.5. THE ORDER TASKS PATTERN
Problem
Context
Solution
Examples
Molecular dynamics
Figure 3.4. Ordering of tasks in molecular dynamics problem
3.6. THE DATA SHARING PATTERN
Problem
Context
Forces
Solution
Examples
Molecular dynamics
Figure 3.5. Data sharing in molecular dynamics. We distinguish between sharing for reads, read-writes, and accumulations.
3.7. THE DESIGN EVALUATION PATTERN
Problem
Context
Forces
Solution
Suitability for target platform
Design quality
Preparation for next phase
3.8. SUMMARY
Chapter 4. The Algorithm Structure Design Space
4.1. INTRODUCTION
Figure 4.1. Overview of the Algorithm Structure design space and its place in the pattern language
4.2. CHOOSING AN ALGORITHM STRUCTURE PATTERN
4.2.1. Target Platform
4.2.2. Major Organizing Principle
4.2.3. The Algorithm Structure Decision Tree
Figure 4.2. Decision tree for the Algorithm Structure design space
Organize By Tasks
Organize By Data Decomposition
Organize By Flow of Data
4.2.4. Re-evaluation
4.3. EXAMPLES
4.3.1. Medical Imaging
4.3.2. Molecular Dynamics
4.4. THE TASK PARALLELISM PATTERN
Problem
Context
Forces
Solution
Tasks
Dependencies
Schedule
Figure 4.3. Good versus poor load balance
Program structure
Common idioms
Examples
Image construction
Molecular dynamics
Figure 4.4. Pseudocode for the nonbonded computation in a typical molecular dynamics code
Known uses
4.5. THE DIVIDE AND CONQUER PATTERN
Problem
Context
Figure 4.5. The divide-and-conquer strategy
Forces
Figure 4.6. Sequential pseudocode for the divide-and-conquer algorithm
Solution
Figure 4.7. Parallelizing the divide-and-conquer strategy. Each dashed-line box represents a task.
Mapping tasks to UEs and PEs
Communication costs
Dealing with dependencies
Other optimizations
Examples
Mergesort
Matrix diagonalization
Known uses
Related Patterns
4.6. THE GEOMETRIC DECOMPOSITION PATTERN
Problem
Context
Example: mesh-computation program
Figure 4.8. Data dependencies in the heat-equation problem. Solid boxes indicate the element being updated; shaded boxes the elements containing needed data.
Example: matrix-multiplication program
Figure 4.9. Data dependencies in the matrix-multiplication problem. Solid boxes indicate the "chunk" being updated (C); shaded boxes indicate the chunks of A (row) and B (column) required to update C at each of the two steps.
Solution
Data decomposition
Figure 4.10. A data distribution with ghost boundaries. Shaded cells are ghost copies; arrows point from primary copies to corresponding secondary copies.
The exchange operation
The update operation
Data distribution and task scheduling
Program structure
Examples
Mesh computation
Figure 4.11. Sequential heat-diffusion program
OpenMP solution
Figure 4.12. Parallel heat-diffusion program using OpenMP
MPI solution
Figure 4.13. Parallel heat-diffusion program using OpenMP. This version has less thread-management overhead.
Figure 4.14. Parallel heat-diffusion program using MPI (continued in Fig. 4.15)
Figure 4.15. Parallel heat-diffusion program using MPI (continued from Fig. 4.14)
Figure 4.16. Parallel heat-diffusion program using MPI with overlapping communication/ computation (continued from Fig. 4.14)
Figure 4.17. Sequential matrix multiplication
Matrix multiplication
Figure 4.18. Sequential matrix multiplication, revised. We do not show the parts of the program that are not changed from the program in Fig. 4.17.
OpenMP solution
MPI solution
Figure 4.19. Parallel matrix multiplication with message passing (continued in Fig. 4.20)
Figure 4.20. Parallel matrix multiplication with message-passing (continued from Fig. 4.19)
Known uses
Related Patterns
4.7. THE RECURSIVE DATA PATTERN
Problem
Context
Figure 4.21. Finding roots in a forest. Solid lines represent the original parent-child relationships among nodes; dashed lines point from nodes to their successors.
Forces
Solution
Data decomposition
Structure
Synchronization
Examples
Partial sums of a linked list
Figure 4.23. Steps in finding partial sums of a list. Straight arrows represent links between elements; curved arrows indicate additions.
Known uses
Figure 4.22. Pseudocode for finding partial sums of a list
Related Patterns
4.8. THE PIPELINE PATTERN
Problem
Context
Forces
Solution
Figure 4.24. Operation of a pipeline. Each pipeline stage i computes the i-th step of the computation.
Figure 4.25. Example pipelines
Defining the stages of the pipeline
Figure 4.26. Basic structure of a pipeline stage
Structuring the computation
Representing the dataflow among pipeline elements
Handling errors
Processor allocation and task scheduling
Throughput and latency
Examples
Fourier-transform computations
Java pipeline framework
Known uses
Figure 4.27. Base class for pipeline stages
Figure 4.28. Base class for linear pipeline
Related Patterns
Figure 4.29. Pipelined sort (main class)
Figure 4.30. Pipelined sort (sorting stage)
4.9. THE EVENT-BASED COORDINATION PATTERN
Problem
Context
Figure 4.31. Discrete-event simulation of a car-wash facility. Arrows indicate the flow of events.
Forces
Solution
Defining the tasks
Figure 4.32. Basic structure of a task in the Event-Based Coordination pattern
Representing event flow
Enforcing event ordering
Figure 4.33. Event-based communication among three tasks. Task 2 generates its event in response to the event received from task 1. The two events sent to task 3 can arrive in either order.
Avoiding deadlocks
Scheduling and processor allocation
Efficient communication of events
Examples
Known uses
Related Patterns
Chapter 5. The Supporting Structures Design Space
5.1. INTRODUCTION
Figure 5.1. Overview of the Supporting Structures design space and its place in the pattern language
5.1.1. Program Structuring Patterns
5.1.2. Patterns Representing Data Structures
5.2. FORCES
5.3. CHOOSING THE PATTERNS
Table 5.1. Relationship between Supporting Structures patterns and Algorithm Structure patterns. The number of stars (ranging from zero to four) is an indication of the likelihood that the given Supporting Structures pattern is useful in the implementation of the Algorithm Structure pattern.
Table 5.2. Relationship between Supporting Structures patterns and programming environments. The number of stars (ranging from zero to four) is an indication of the likelihood that the given Supporting Structures pattern is useful in the programming environment.
5.4. THE SPMD PATTERN
Problem
Context
Forces
Solution
Discussion
Examples
Numerical integration
Figure 5.2. Sequential program to carry out a trapezoid rule integration to compute
Figure 5.3. MPI program to carry out a trapezoid rule integration in parallel by assigning one block of loop iterations to each UE and performing a reduction
Figure 5.4. Index calculation that more evenly distributes the work when the number of steps is not evenly divided by the number of UEs. The idea is to split up the remaining tasks (rem) among the first rem UEs.
Figure 5.5. MPI program to carry out a trapezoid rule integration in parallel using a simple loop-splitting algorithm with cyclic distribution of iterations and a reduction
Figure 5.6. OpenMP program to carry out a trapezoid rule integration in parallel using the same SPMD algorithm used in Fig. 5.5
Molecular dynamics
Figure 5.7. Pseudocode for molecular dynamics example. This code is very similar to the version discussed earlier, but a few extra details have been included. To support more detailed pseudocode examples, the call to the function that initializes the force arrays has been made explicit. Also, the fact that the neighbor list is only occasionally updated is made explicit.
Figure 5.8. Pseudocode for an SPMD molecular dynamics program using MPI
Figure 5.9. Pseudocode for the nonbonded computation in a typical parallel molecular dynamics code. This code is almost identical to the sequential version of the function shown in Fig. 4.4. The only major change is a new array of integers holding the indices for the atoms assigned to this UE, local_atoms. We've also assumed that the neighbor list has been generated to hold only those atoms assigned to this UE. For the sake of allocating space for these arrays, we have added a parameter LN which is the largest number of atoms that can be assigned to a single UE.
Figure 5.10. Pseudocode for the neighbor list computation. For each atom i, the indices for atoms within a sphere of radius cutoff are added to the neighbor list for atom i. Notice that the second loop (over j) only considers atoms with indices greater than i. This accounts for the symmetry in the force computation due to Newton's third law of motion, that is, that the force between atom i and atom j is just the negative of the force between atom j and atom i.
Figure 5.11. Pseudocode for a parallel molecular dynamics program using OpenMP
Mandelbrot set computation
Figure 5.12. Pseudocode for a sequential version of the Mandelbrot set generation program
Known uses
Figure 5.13. Pseudocode for a parallel MPI version of the Mandelbrot set generation program
Related Patterns
5.5. THE MASTER/WORKER PATTERN
Problem
Context
Forces
Solution
Figure 5.14. The two elements of the Master/Worker pattern are the master and the worker. There is only one master, but there can be one or more workers. Logically, the master sets up the calculation and then manages a bag of tasks. Each worker grabs a task from the bag, carries out the work, and then goes back to the bag, repeating until the termination condition is met.
Discussion
Detecting completion
Variations
Examples
Generic solutions
Figure 5.15. Master process for a master/worker program. This assumes a shared address space so the task and results queues are visible to all UEs. In this simple version, the master initializes the queue, launches the workers, and then waits for the workers to finish (that is, the ForkJoin command launches the workers and then waits for them to finish before returning). At that point, results are consumed and the computation completes.
Figure 5.16. Worker process for a master/worker program. We assume a shared address space thereby making task_queue and global_results available to the master and all workers. A worker loops over the task_queue and exits when the end of the queue is encountered.
Figure 5.17. Instantiating and initializing a pooled executor
Mandelbrot set generation
Figure 5.18. Pseudocode for a sequential version of the Mandelbrot set generation program
Figure 5.19. Master process for a master/worker parallel version of the Mandelbrot set generation program
Figure 5.20. Worker process for a master/worker parallel version of the Mandelbrot set generation program. We assume a shared address space thereby making task_queue, global_results, and ranges available to the master and the workers.
Known uses
Related Patterns
5.6. THE LOOP PARALLELISM PATTERN
Problem
Context
Forces
Solution
Figure 5.21. Program fragment showing merging loops to increase the amount of work per iteration
Figure 5.22. Program fragment showing coalescing nested loops to produce a single loop with a larger number of iterations
Performance considerations
Figure 5.23. Program fragment showing an example of false sharing. The small array A is held in one or two cache lines. As the UEs access A inside the innermost loop, they will need to take ownership of the cache line back from the other UEs. This back-and-forth movement of the cache lines destroys performance. The solution is to use a temporary variable inside the innermost loop.
Examples
Numerical integration
Figure 5.24. Sequential program to carry out a trapezoid rule integration to compute
Molecular dynamics.
Figure 5.25. Pseudocode for the nonbonded computation in a typical parallel molecular dynamics code. This is code is almost identical to the sequential version of the function shown previously in Fig. 4.4.
Mandelbrot set computation
Figure 5.26. Pseudocode for a sequential version of the Mandelbrot set generation program
Mesh computation
Figure 5.27. Parallel heat-diffusion program using OpenMP. This program is described in the Examples section of the Geometric Decomposition pattern.
Figure 5.28. Parallel heat-diffusion program using OpenMP, with reduced thread management overhead and memory management more appropriate for NUMA computers
Known uses
Related Patterns
5.7. THE FORK/JOIN PATTERN
Problem
Context
Forces
Solution
Direct task/UE mapping
Indirect task/UE mapping
Examples
Mergesort using direct mapping
Figure 5.29. Parallel mergesort where each task corresponds to a thread
Figure 5.30. Instantiating FJTaskRunnerGroup and invoking the master task
Mergesort using indirect mapping
Known uses
Figure 5.31. Mergesort using the FJTask framework
Related Patterns
5.8. THE SHARED DATA PATTERN
Problem
Context
Forces
Solution
Be sure this pattern is needed
Define an abstract data type
Implement an appropriate concurrency-control protocol
Figure 5.32. Typical use of read/write locks. These locks are defined in the java.util.concurrent.locks package. Putting the unlock in the finally block ensures that the lock will be unlocked regardless of how the try block is exited (normally or with an exception) and is a standard idiom in Java programs that use locks rather than synchronized blocks.
Figure 5.33. Example of nested locking using synchronized blocks with dummy objects lockA and lockB
Review other considerations
Examples
Shared queues
Genetic algorithm for nonlinear optimization
Figure 5.34. Pseudocode for the population shuffle loop from the genetic algorithm program GAFORT
Figure 5.35. Pseudocode for an ineffective approach to parallelizing the population shuffle in the genetic algorithm program GAFORT
Known uses
Figure 5.36. Pseudocode for a parallelized loop to carry out the population shuffle in the genetic algorithm program GAFORT. This version of the loop uses a separate lock for each chromosome and runs effectively in parallel.
Related Patterns
5.9. THE SHARED QUEUE PATTERN
Problem
Context
Forces
Solution
The abstract data type (ADT)
Queue with "one at a time" execution
Figure 5.37. Queue that ensures that at most one thread can access the data structure at one time. If the queue is empty, null is immediately returned.
Figure 5.38. Queue that ensures at most one thread can access the data structure at one time. Unlike the first shared queue example, if the queue is empty, the thread waits. When used in a master/worker algorithm, a poison pill would be required to signal termination to a thread.
Concurrency-control protocols for noninterfering operations
Figure 5.39. Shared queue that takes advantage of the fact that put and take are noninterfering and uses separate locks so they can proceed concurrently
Concurrency-control protocols using nested locks
Figure 5.40. Blocking queue with multiple locks to allow concurrent put and take on a nonempty queue
Distributed shared queues
Figure 5.41. Nonblocking shared queue with takeLast operation
Figure 5.42. Abstract base class for tasks
Figure 5.43. Class defining behavior of threads in the thread pool (continued in Fig. 5.44 and Fig. 5.45)
Figure 5.44. Class defining behavior of threads in the thread pool (continued from Fig. 5.43 and continued in Fig. 5.45)
Examples
Computing Fibonacci numbers
Figure 5.45. Class defining behavior of threads in the thread pool (continued from Fig. 5.43 and Fig. 5.44)
Figure 5.46. The TaskRunnerGroup class. This class initializes and manages the threads in the thread pool.
Related Patterns
Figure 5.47. Program to compute Fibonacci numbers (continued in Fig. 5.48)
Figure 5.48. Program to compute Fibonacci numbers (continued from Fig. 5.47)
5.10. THE DISTRIBUTED ARRAY PATTERN
Problem
Context
Forces
Solution
Overview
Array distributions
Figure 5.49. Original square matrix A
Figure 5.50. 1D distribution of A onto four UEs
Figure 5.51. 2D distribution of A onto four UEs
Figure 5.52. 1D block-cyclic distribution of A onto four UEs
Figure 5.53. 2D block-cyclic distribution of A onto four UEs, part 1: Decomposing A
Figure 5.54. 2D block-cyclic distribution of A onto four UEs, part 2: Assigning submatrices to UEs
Figure 5.55. 2D block-cyclic distribution of A onto four UEs: Local view of elements of A assigned to UE(0,0). LAl,m is the block with block indices (l, m). Each element is labeled both with its original global indices (ai,j) and its indices within block LAl,m (lx,y).
Figure 5.56. 2D block-cyclic distribution of A onto four UEs: Local view of elements of A assigned to UE(0,0). Each element is labeled both with its original global indices ai,j and its local indices [x', y' . Local indices are with respect to the contiguous matrix used to store all blocks assigned to this UE.
Choosing a distribution
Mapping indices
Aligning computation with locality
Examples
Transposing a matrix stored as column blocks
Figure 5.57. Matrix A and its transpose, in terms of submatrices, distributed among four UEs
Figure 5.58. Code to transpose a matrix (continued in Fig. 5.59)
Figure 5.59. Code to transpose a matrix (continued from Fig. 5.58)
Known uses
Related Patterns
5.11. OTHER SUPPORTING STRUCTURES
5.11.1. SIMD
5.11.2. MPMD
5.11.3. Client-Server Computing
5.11.4. Concurrent Programming with Declarative Languages
5.11.5. Problem-Solving Environments
Chapter 6. The Implementation Mechanisms Design Space
Figure 6.1. Overview of the Implementation Mechanisms design space and its place in the pattern language
6.1. OVERVIEW
6.2. UE MANAGEMENT
6.2.1. Thread Creation/Destruction
OpenMP: thread creation/destruction
Java: thread creation/destruction
MPI: thread creation/destruction
6.2.2. Process Creation/Destruction
MPI: process creation/destruction
Java: process creation/destruction
OpenMP: process creation/destruction
6.3. SYNCHRONIZATION
6.3.1. Memory Synchronization and Fences
OpenMP: fences
Figure 6.2. Program showing one way to implement pairwise synchronization in OpenMP. The flush construct is vital. It forces the memory to be consistent, thereby making the updates to the flag array visible. For more details about the syntax of OpenMP, see the OpenMP appendix, Appendix A.
Java: fences
MPI: fences
6.3.2. Barriers
MPI: barriers
Figure 6.3. MPI program containing a barrier. This program is used to time the execution of function runit().
OpenMP: barriers
Figure 6.4. OpenMP program containing a barrier. This program is used to time the execution of function runit().
Java: barriers
6.3.3. Mutual Exclusion
Figure 6.5. Java program containing a CyclicBarrier. This program is used to time the execution of function runit().
Figure 6.6. Example of an OpenMP program that includes a critical section
OpenMP: mutual exclusion
Figure 6.7. Example of using locks in OpenMP
Java: mutual exclusion
Figure 6.8. Java version of the OpenMP program in Fig. 6.6
Figure 6.9. Java program showing how to implement mutual exclusion with a synchronized method
MPI: mutual exclusion
Figure 6.10. Example of an MPI program with an update that requires mutual exclusion. A single process is dedicated to the update of this data structure.
6.4. COMMUNICATION
6.4.1. Message Passing
MPI: message passing
Figure 6.11. MPI program that uses a ring of processors and a communication pattern where information is shifted to the right. The functions to do the computation do not affect the communication itself so they are not shown. (Continued in Fig. 6.12.)
OpenMP: message passing
Figure 6.12. MPI program that uses a ring of processors and a communication pattern where information is shifted to the right (continued from Fig. 6.11)
Figure 6.13. OpenMP program that uses a ring of threads and a communication pattern where information is shifted to the right (continued in Fig. 6.14)
Java: message passing
Figure 6.14. OpenMP program that uses a ring of threads and a communication pattern where information is shifted to the right (continued from Fig. 6.13)
Figure 6.15. The message-passing block from Fig. 6.13 and Fig. 6.14, but with more careful synchronization management (pairwise synchronization)
6.4.2. Collective Communication
Reduction
Figure 6.16. MPI program to time the execution of a function called runit(). We use MPI_Reduce to find minimum, maximum, and average runtimes.
Implementing reduction operations
Figure 6.17. OpenMP program to time the execution of a function called runit(). We use a reduction clause to find sum of the runtimes.
Serial computation
Figure 6.18. Serial reduction to compute the sum of a(0) through a(3). sum(a(i:j)) denotes the sum of elements i through j of array a.
Tree-based reduction
Figure 6.19. Tree-based reduction to compute the sum of a(0) through a(3) on a system with 4 UEs. sum(a(i:j)) denotes the sum of elements i through j of array a.
Recursive doubling
Figure 6.20. Recursive-doubling reduction to compute the sum of a(0) through a(3). sum (a(i:j)) denotes the sum of elements i through j of array a.
6.4.3. Other Communication Constructs
Endnotes
Appendix A. A Brief Introduction to OpenMP
Figure A.1. Fortran and C programs that print a simple string to standard output
A.1. CORE CONCEPTS
Figure A.2. Fortran and C programs that print a simple string to standard output
Figure A.3. Fortran and C programs that print a simple string to standard output
Figure A.4. Simple program to show the difference between shared and local (or private) data
A.2. STRUCTURED BLOCKS AND DIRECTIVE FORMATS
A.3. WORKSHARING
Figure A.5. Fortran and C examples of a typical loop-oriented program
Figure A.6. Fortran and C examples of a typical loop-oriented program. In this version of the program, the computationally intensive loop has been isolated and modified so the iterations are independent.
Figure A.7. Fortran and C examples of a typical loop-oriented program parallelized with OpenMP
A.4. DATA ENVIRONMENT CLAUSES
Figure A.8. C program to carry out a trapezoid rule integration to compute (here comes equation)
Figure A.9. C program showing use of the private, firstprivate, and lastprivate clauses. This program is incorrect in that the variables h and j do not have well-defined values when the printf is called. Notice the use of a backslash to continue the OpenMP pragma onto a second line.
A.5. THE OpenMP RUNTIME LIBRARY
Figure A.10. C program showing use of the most common runtime library functions
A.6. SYNCHRONIZATION
Figure A.11. Parallel version of the program in Fig. A.5. In this case, however, we assume that the calls to combine() can occur in any order as long as only one thread at a time executes the function. This is enforced with the critical construct.
Figure A.12. Example showing how the lock functions in OpenMP are used
A.7. THE SCHEDULE CLAUSE
Figure A.13. Parallel version of the program in Fig. A.11, modified to show the use of the schedule clause
A.8. THE REST OF THE LANGUAGE
Appendix B. A Brief Introduction to MPI
B.1. CONCEPTS
B.2. GETTING STARTED
Figure B.1. Program to print a simple string to standard output
Figure B.2. Parallel program in which each process prints a simple string to the standard output
B.3. BASIC POINT-TO-POINT MESSAGE PASSING
Figure B.3. The standard blocking point-to-point communication routines in the C binding for MPI 1.1
Figure B.4. MPI program to "bounce" a message between two processes using the standard blocking point-to-point communication routines in the C binding to MPI 1.1
B.4. COLLECTIVE OPERATIONS
Figure B.6. Program to time the ring function as it passes messages around a ring of processes (continued in Fig. B.7). The program returns the time from the process that takes the longest elapsed time to complete the communication. The code to the ring function is not relevant for this example, but it is included in Fig. B.8.
Figure B.7. Program to time the ring function as it passes messages around a ring of processes (continued from Fig. B.6)
Figure B.5. The major collective communication routines in the C binding to MPI 1.1 (MPI_Barrier, MPI_Bcast, and MPI_Reduce)
Figure B.8. Function to pass a message around a ring of processes. It is deadlock-free because the sends and receives are split between the even and odd processes.
Figure B.9. The nonblocking or asynchronous communication functions
B.5. ADVANCED POINT-TO-POINT MESSAGE PASSING
Figure B.10. Program using nonblocking communication to iteratively update a field using an algorithm that requires only communication around a ring (shifting messages to the right)
Figure B.11. Function to pass a message around a ring of processes using persistent communication
B.6. MPI AND FORTRAN
Figure B.12. Comparison of the C and Fortran language bindings for the reduction routine in MPI 1.1
Figure B.13. Simple Fortran MPI program where each process prints its ID and the number of processes in the computation
B.7. CONCLUSION
Appendix C. A Brief Introduction to Concurrent Programming in Java
Figure C.1. A class holding pairs of objects of an arbitrary type. Without generic types, this would have been done by declaring x and y to be of type Object, requiring casting the returned values of getX and getY. In addition to less-verbose programs, this allows type errors to be found by the compiler rather than throwing a ClassCastException at runtime.
C.1. CREATING THREADS
Figure C.2. Program to create four threads, passing a Runnable in the Thread constructor. Thread-specific data is held in a field of the Runnable object.
C.1.1. Anonymous Inner Classes
C.1.2. Executors and Factories
Figure C.3. Program similar to the one in Fig. C.2, but using an anonymous class to define the Runnable object
Figure C.4. Program using a ThreadPoolExecutor instead of creating threads directly
Figure C.5. Code fragment illustrating use of Callable and Future
C.2. ATOMICITY, MEMORY SYNCHRONIZATION, AND THE volatile KEYWORD
C.3. SYNCHRONIZED BLOCKS
C.4. WAIT AND NOTIFY
Figure C.6. Basic idiom for using wait. Because wait throws an InterruptedException, it should somehow be enclosed in a try-catch block, omitted here.
C.5. LOCKS
Figure C.7. A version of SharedQueue2 (see the Shared Queue pattern) using a Lock and Condition instead of synchronized blocks with wait and notify
C.6. OTHER SYNCHRONIZATION MECHANISMS AND SHARED DATA STRUCTURES
Figure C.8. Simple sequential loop-based program similar to the one in Fig. A.5
Figure C.9. Program showing a parallel version of the sequential program in Fig. C.8 where each iteration of the big_comp loop is a separate task. A thread pool containing ten threads is used to execute the tasks. A CountDownLatch is used to ensure that all of the tasks have completed before executing the (still sequential) loop that combines the results.
C.7. INTERRUPTS
Glossary
About the Authors