logo资料库

并行编程模式 超清晰pdf 英文原版.pdf

第1页 / 共328页
第2页 / 共328页
第3页 / 共328页
第4页 / 共328页
第5页 / 共328页
第6页 / 共328页
第7页 / 共328页
第8页 / 共328页
资料共328页,剩余部分请下载后查看
Chapter 1. A Pattern Language for Parallel Programming
1.1. INTRODUCTION
1.2. PARALLEL PROGRAMMING
1.3. DESIGN PATTERNS AND PATTERN LANGUAGES
1.4. A PATTERN LANGUAGE FOR PARALLEL PROGRAMMING
Figure 1.1. Overview of the pattern language
Chapter 2. Background and Jargon of Parallel Computing
2.1. CONCURRENCY IN PARALLEL PROGRAMS VERSUS OPERATING SYSTEMS
2.2. PARALLEL ARCHITECTURES: A BRIEF INTRODUCTION
2.2.1. Flynn's Taxonomy
Figure 2.1. The Single Instruction, Single Data (SISD) architecture
Figure 2.2. The Single Instruction, Multiple Data (SIMD) architecture
Figure 2.3. The Multiple Instruction, Multiple Data (MIMD) architecture
2.2.2. A Further Breakdown of MIMD
Figure 2.4. The Symmetric Multiprocessor (SMP) architecture
Figure 2.5. An example of the nonuniform memory access (NUMA) architecture
Figure 2.6. The distributed-memory architecture
2.2.3. Summary
2.3. PARALLEL PROGRAMMING ENVIRONMENTS
Table 2.1. Some Parallel Programming Environments from the Mid-1990s
2.4. THE JARGON OF PARALLEL COMPUTING
2.5. A QUANTITATIVE LOOK AT PARALLEL COMPUTATION
2.6. COMMUNICATION
2.6.1. Latency and Bandwidth
2.6.2. Overlapping Communication and Computation and Latency Hiding
Figure 2.7. Communication without (left) and with (right) support for overlapping communication and computation. Although UE 0 in the computation on the right still has some idle time waiting for the reply from UE 1, the idle time is reduced and the computation requires less total time because of UE 1 's earlier start.
2.7. SUMMARY
Chapter 3. The Finding Concurrency Design Space
3.1. ABOUT THE DESIGN SPACE
Figure 3.1. Overview of the Finding Concurrency design space and its place in the pattern language
3.1.1. Overview
3.1.2. Using the Decomposition Patterns
3.1.3. Background for Examples
Medical imaging
Linear algebra
Molecular dynamics
Figure 3.2. Pseudocode for the molecular dynamics example
3.2. THE TASK DECOMPOSITION PATTERN
Problem
Context
Forces
Solution
Examples
Medical imaging
Matrix multiplication
Molecular dynamics
Figure 3.3. Pseudocode for the molecular dynamics example
Known uses
3.3. THE DATA DECOMPOSITION PATTERN
Problem
Context
Forces
Solution
Examples
Medical imaging
Matrix multiplication
Molecular dynamics
Known uses
3.4. THE GROUP TASKS PATTERN
Problem
Context
Solution
Examples
Molecular dynamics
Matrix multiplication
3.5. THE ORDER TASKS PATTERN
Problem
Context
Solution
Examples
Molecular dynamics
Figure 3.4. Ordering of tasks in molecular dynamics problem
3.6. THE DATA SHARING PATTERN
Problem
Context
Forces
Solution
Examples
Molecular dynamics
Figure 3.5. Data sharing in molecular dynamics. We distinguish between sharing for reads, read-writes, and accumulations.
3.7. THE DESIGN EVALUATION PATTERN
Problem
Context
Forces
Solution
Suitability for target platform
Design quality
Preparation for next phase
3.8. SUMMARY
Chapter 4. The Algorithm Structure Design Space
4.1. INTRODUCTION
Figure 4.1. Overview of the Algorithm Structure design space and its place in the pattern language
4.2. CHOOSING AN ALGORITHM STRUCTURE PATTERN
4.2.1. Target Platform
4.2.2. Major Organizing Principle
4.2.3. The Algorithm Structure Decision Tree
Figure 4.2. Decision tree for the Algorithm Structure design space
Organize By Tasks
Organize By Data Decomposition
Organize By Flow of Data
4.2.4. Re-evaluation
4.3. EXAMPLES
4.3.1. Medical Imaging
4.3.2. Molecular Dynamics
4.4. THE TASK PARALLELISM PATTERN
Problem
Context
Forces
Solution
Tasks
Dependencies
Schedule
Figure 4.3. Good versus poor load balance
Program structure
Common idioms
Examples
Image construction
Molecular dynamics
Figure 4.4. Pseudocode for the nonbonded computation in a typical molecular dynamics code
Known uses
4.5. THE DIVIDE AND CONQUER PATTERN
Problem
Context
Figure 4.5. The divide-and-conquer strategy
Forces
Figure 4.6. Sequential pseudocode for the divide-and-conquer algorithm
Solution
Figure 4.7. Parallelizing the divide-and-conquer strategy. Each dashed-line box represents a task.
Mapping tasks to UEs and PEs
Communication costs
Dealing with dependencies
Other optimizations
Examples
Mergesort
Matrix diagonalization
Known uses
Related Patterns
4.6. THE GEOMETRIC DECOMPOSITION PATTERN
Problem
Context
Example: mesh-computation program
Figure 4.8. Data dependencies in the heat-equation problem. Solid boxes indicate the element being updated; shaded boxes the elements containing needed data.
Example: matrix-multiplication program
Figure 4.9. Data dependencies in the matrix-multiplication problem. Solid boxes indicate the "chunk" being updated (C); shaded boxes indicate the chunks of A (row) and B (column) required to update C at each of the two steps.
Solution
Data decomposition
Figure 4.10. A data distribution with ghost boundaries. Shaded cells are ghost copies; arrows point from primary copies to corresponding secondary copies.
The exchange operation
The update operation
Data distribution and task scheduling
Program structure
Examples
Mesh computation
Figure 4.11. Sequential heat-diffusion program
OpenMP solution
Figure 4.12. Parallel heat-diffusion program using OpenMP
MPI solution
Figure 4.13. Parallel heat-diffusion program using OpenMP. This version has less thread-management overhead.
Figure 4.14. Parallel heat-diffusion program using MPI (continued in Fig. 4.15)
Figure 4.15. Parallel heat-diffusion program using MPI (continued from Fig. 4.14)
Figure 4.16. Parallel heat-diffusion program using MPI with overlapping communication/ computation (continued from Fig. 4.14)
Figure 4.17. Sequential matrix multiplication
Matrix multiplication
Figure 4.18. Sequential matrix multiplication, revised. We do not show the parts of the program that are not changed from the program in Fig. 4.17.
OpenMP solution
MPI solution
Figure 4.19. Parallel matrix multiplication with message passing (continued in Fig. 4.20)
Figure 4.20. Parallel matrix multiplication with message-passing (continued from Fig. 4.19)
Known uses
Related Patterns
4.7. THE RECURSIVE DATA PATTERN
Problem
Context
Figure 4.21. Finding roots in a forest. Solid lines represent the original parent-child relationships among nodes; dashed lines point from nodes to their successors.
Forces
Solution
Data decomposition
Structure
Synchronization
Examples
Partial sums of a linked list
Figure 4.23. Steps in finding partial sums of a list. Straight arrows represent links between elements; curved arrows indicate additions.
Known uses
Figure 4.22. Pseudocode for finding partial sums of a list
Related Patterns
4.8. THE PIPELINE PATTERN
Problem
Context
Forces
Solution
Figure 4.24. Operation of a pipeline. Each pipeline stage i computes the i-th step of the computation.
Figure 4.25. Example pipelines
Defining the stages of the pipeline
Figure 4.26. Basic structure of a pipeline stage
Structuring the computation
Representing the dataflow among pipeline elements
Handling errors
Processor allocation and task scheduling
Throughput and latency
Examples
Fourier-transform computations
Java pipeline framework
Known uses
Figure 4.27. Base class for pipeline stages
Figure 4.28. Base class for linear pipeline
Related Patterns
Figure 4.29. Pipelined sort (main class)
Figure 4.30. Pipelined sort (sorting stage)
4.9. THE EVENT-BASED COORDINATION PATTERN
Problem
Context
Figure 4.31. Discrete-event simulation of a car-wash facility. Arrows indicate the flow of events.
Forces
Solution
Defining the tasks
Figure 4.32. Basic structure of a task in the Event-Based Coordination pattern
Representing event flow
Enforcing event ordering
Figure 4.33. Event-based communication among three tasks. Task 2 generates its event in response to the event received from task 1. The two events sent to task 3 can arrive in either order.
Avoiding deadlocks
Scheduling and processor allocation
Efficient communication of events
Examples
Known uses
Related Patterns
Chapter 5. The Supporting Structures Design Space
5.1. INTRODUCTION
Figure 5.1. Overview of the Supporting Structures design space and its place in the pattern language
5.1.1. Program Structuring Patterns
5.1.2. Patterns Representing Data Structures
5.2. FORCES
5.3. CHOOSING THE PATTERNS
Table 5.1. Relationship between Supporting Structures patterns and Algorithm Structure patterns. The number of stars (ranging from zero to four) is an indication of the likelihood that the given Supporting Structures pattern is useful in the implementation of the Algorithm Structure pattern.
Table 5.2. Relationship between Supporting Structures patterns and programming environments. The number of stars (ranging from zero to four) is an indication of the likelihood that the given Supporting Structures pattern is useful in the programming environment.
5.4. THE SPMD PATTERN
Problem
Context
Forces
Solution
Discussion
Examples
Numerical integration
Figure 5.2. Sequential program to carry out a trapezoid rule integration to compute
Figure 5.3. MPI program to carry out a trapezoid rule integration in parallel by assigning one block of loop iterations to each UE and performing a reduction
Figure 5.4. Index calculation that more evenly distributes the work when the number of steps is not evenly divided by the number of UEs. The idea is to split up the remaining tasks (rem) among the first rem UEs.
Figure 5.5. MPI program to carry out a trapezoid rule integration in parallel using a simple loop-splitting algorithm with cyclic distribution of iterations and a reduction
Figure 5.6. OpenMP program to carry out a trapezoid rule integration in parallel using the same SPMD algorithm used in Fig. 5.5
Molecular dynamics
Figure 5.7. Pseudocode for molecular dynamics example. This code is very similar to the version discussed earlier, but a few extra details have been included. To support more detailed pseudocode examples, the call to the function that initializes the force arrays has been made explicit. Also, the fact that the neighbor list is only occasionally updated is made explicit.
Figure 5.8. Pseudocode for an SPMD molecular dynamics program using MPI
Figure 5.9. Pseudocode for the nonbonded computation in a typical parallel molecular dynamics code. This code is almost identical to the sequential version of the function shown in Fig. 4.4. The only major change is a new array of integers holding the indices for the atoms assigned to this UE, local_atoms. We've also assumed that the neighbor list has been generated to hold only those atoms assigned to this UE. For the sake of allocating space for these arrays, we have added a parameter LN which is the largest number of atoms that can be assigned to a single UE.
Figure 5.10. Pseudocode for the neighbor list computation. For each atom i, the indices for atoms within a sphere of radius cutoff are added to the neighbor list for atom i. Notice that the second loop (over j) only considers atoms with indices greater than i. This accounts for the symmetry in the force computation due to Newton's third law of motion, that is, that the force between atom i and atom j is just the negative of the force between atom j and atom i.
Figure 5.11. Pseudocode for a parallel molecular dynamics program using OpenMP
Mandelbrot set computation
Figure 5.12. Pseudocode for a sequential version of the Mandelbrot set generation program
Known uses
Figure 5.13. Pseudocode for a parallel MPI version of the Mandelbrot set generation program
Related Patterns
5.5. THE MASTER/WORKER PATTERN
Problem
Context
Forces
Solution
Figure 5.14. The two elements of the Master/Worker pattern are the master and the worker. There is only one master, but there can be one or more workers. Logically, the master sets up the calculation and then manages a bag of tasks. Each worker grabs a task from the bag, carries out the work, and then goes back to the bag, repeating until the termination condition is met.
Discussion
Detecting completion
Variations
Examples
Generic solutions
Figure 5.15. Master process for a master/worker program. This assumes a shared address space so the task and results queues are visible to all UEs. In this simple version, the master initializes the queue, launches the workers, and then waits for the workers to finish (that is, the ForkJoin command launches the workers and then waits for them to finish before returning). At that point, results are consumed and the computation completes.
Figure 5.16. Worker process for a master/worker program. We assume a shared address space thereby making task_queue and global_results available to the master and all workers. A worker loops over the task_queue and exits when the end of the queue is encountered.
Figure 5.17. Instantiating and initializing a pooled executor
Mandelbrot set generation
Figure 5.18. Pseudocode for a sequential version of the Mandelbrot set generation program
Figure 5.19. Master process for a master/worker parallel version of the Mandelbrot set generation program
Figure 5.20. Worker process for a master/worker parallel version of the Mandelbrot set generation program. We assume a shared address space thereby making task_queue, global_results, and ranges available to the master and the workers.
Known uses
Related Patterns
5.6. THE LOOP PARALLELISM PATTERN
Problem
Context
Forces
Solution
Figure 5.21. Program fragment showing merging loops to increase the amount of work per iteration
Figure 5.22. Program fragment showing coalescing nested loops to produce a single loop with a larger number of iterations
Performance considerations
Figure 5.23. Program fragment showing an example of false sharing. The small array A is held in one or two cache lines. As the UEs access A inside the innermost loop, they will need to take ownership of the cache line back from the other UEs. This back-and-forth movement of the cache lines destroys performance. The solution is to use a temporary variable inside the innermost loop.
Examples
Numerical integration
Figure 5.24. Sequential program to carry out a trapezoid rule integration to compute
Molecular dynamics.
Figure 5.25. Pseudocode for the nonbonded computation in a typical parallel molecular dynamics code. This is code is almost identical to the sequential version of the function shown previously in Fig. 4.4.
Mandelbrot set computation
Figure 5.26. Pseudocode for a sequential version of the Mandelbrot set generation program
Mesh computation
Figure 5.27. Parallel heat-diffusion program using OpenMP. This program is described in the Examples section of the Geometric Decomposition pattern.
Figure 5.28. Parallel heat-diffusion program using OpenMP, with reduced thread management overhead and memory management more appropriate for NUMA computers
Known uses
Related Patterns
5.7. THE FORK/JOIN PATTERN
Problem
Context
Forces
Solution
Direct task/UE mapping
Indirect task/UE mapping
Examples
Mergesort using direct mapping
Figure 5.29. Parallel mergesort where each task corresponds to a thread
Figure 5.30. Instantiating FJTaskRunnerGroup and invoking the master task
Mergesort using indirect mapping
Known uses
Figure 5.31. Mergesort using the FJTask framework
Related Patterns
5.8. THE SHARED DATA PATTERN
Problem
Context
Forces
Solution
Be sure this pattern is needed
Define an abstract data type
Implement an appropriate concurrency-control protocol
Figure 5.32. Typical use of read/write locks. These locks are defined in the java.util.concurrent.locks package. Putting the unlock in the finally block ensures that the lock will be unlocked regardless of how the try block is exited (normally or with an exception) and is a standard idiom in Java programs that use locks rather than synchronized blocks.
Figure 5.33. Example of nested locking using synchronized blocks with dummy objects lockA and lockB
Review other considerations
Examples
Shared queues
Genetic algorithm for nonlinear optimization
Figure 5.34. Pseudocode for the population shuffle loop from the genetic algorithm program GAFORT
Figure 5.35. Pseudocode for an ineffective approach to parallelizing the population shuffle in the genetic algorithm program GAFORT
Known uses
Figure 5.36. Pseudocode for a parallelized loop to carry out the population shuffle in the genetic algorithm program GAFORT. This version of the loop uses a separate lock for each chromosome and runs effectively in parallel.
Related Patterns
5.9. THE SHARED QUEUE PATTERN
Problem
Context
Forces
Solution
The abstract data type (ADT)
Queue with "one at a time" execution
Figure 5.37. Queue that ensures that at most one thread can access the data structure at one time. If the queue is empty, null is immediately returned.
Figure 5.38. Queue that ensures at most one thread can access the data structure at one time. Unlike the first shared queue example, if the queue is empty, the thread waits. When used in a master/worker algorithm, a poison pill would be required to signal termination to a thread.
Concurrency-control protocols for noninterfering operations
Figure 5.39. Shared queue that takes advantage of the fact that put and take are noninterfering and uses separate locks so they can proceed concurrently
Concurrency-control protocols using nested locks
Figure 5.40. Blocking queue with multiple locks to allow concurrent put and take on a nonempty queue
Distributed shared queues
Figure 5.41. Nonblocking shared queue with takeLast operation
Figure 5.42. Abstract base class for tasks
Figure 5.43. Class defining behavior of threads in the thread pool (continued in Fig. 5.44 and Fig. 5.45)
Figure 5.44. Class defining behavior of threads in the thread pool (continued from Fig. 5.43 and continued in Fig. 5.45)
Examples
Computing Fibonacci numbers
Figure 5.45. Class defining behavior of threads in the thread pool (continued from Fig. 5.43 and Fig. 5.44)
Figure 5.46. The TaskRunnerGroup class. This class initializes and manages the threads in the thread pool.
Related Patterns
Figure 5.47. Program to compute Fibonacci numbers (continued in Fig. 5.48)
Figure 5.48. Program to compute Fibonacci numbers (continued from Fig. 5.47)
5.10. THE DISTRIBUTED ARRAY PATTERN
Problem
Context
Forces
Solution
Overview
Array distributions
Figure 5.49. Original square matrix A
Figure 5.50. 1D distribution of A onto four UEs
Figure 5.51. 2D distribution of A onto four UEs
Figure 5.52. 1D block-cyclic distribution of A onto four UEs
Figure 5.53. 2D block-cyclic distribution of A onto four UEs, part 1: Decomposing A
Figure 5.54. 2D block-cyclic distribution of A onto four UEs, part 2: Assigning submatrices to UEs
Figure 5.55. 2D block-cyclic distribution of A onto four UEs: Local view of elements of A assigned to UE(0,0). LAl,m is the block with block indices (l, m). Each element is labeled both with its original global indices (ai,j) and its indices within block LAl,m (lx,y).
Figure 5.56. 2D block-cyclic distribution of A onto four UEs: Local view of elements of A assigned to UE(0,0). Each element is labeled both with its original global indices ai,j and its local indices [x', y' . Local indices are with respect to the contiguous matrix used to store all blocks assigned to this UE.
Choosing a distribution
Mapping indices
Aligning computation with locality
Examples
Transposing a matrix stored as column blocks
Figure 5.57. Matrix A and its transpose, in terms of submatrices, distributed among four UEs
Figure 5.58. Code to transpose a matrix (continued in Fig. 5.59)
Figure 5.59. Code to transpose a matrix (continued from Fig. 5.58)
Known uses
Related Patterns
5.11. OTHER SUPPORTING STRUCTURES
5.11.1. SIMD
5.11.2. MPMD
5.11.3. Client-Server Computing
5.11.4. Concurrent Programming with Declarative Languages
5.11.5. Problem-Solving Environments
Chapter 6. The Implementation Mechanisms Design Space
Figure 6.1. Overview of the Implementation Mechanisms design space and its place in the pattern language
6.1. OVERVIEW
6.2. UE MANAGEMENT
6.2.1. Thread Creation/Destruction
OpenMP: thread creation/destruction
Java: thread creation/destruction
MPI: thread creation/destruction
6.2.2. Process Creation/Destruction
MPI: process creation/destruction
Java: process creation/destruction
OpenMP: process creation/destruction
6.3. SYNCHRONIZATION
6.3.1. Memory Synchronization and Fences
OpenMP: fences
Figure 6.2. Program showing one way to implement pairwise synchronization in OpenMP. The flush construct is vital. It forces the memory to be consistent, thereby making the updates to the flag array visible. For more details about the syntax of OpenMP, see the OpenMP appendix, Appendix A.
Java: fences
MPI: fences
6.3.2. Barriers
MPI: barriers
Figure 6.3. MPI program containing a barrier. This program is used to time the execution of function runit().
OpenMP: barriers
Figure 6.4. OpenMP program containing a barrier. This program is used to time the execution of function runit().
Java: barriers
6.3.3. Mutual Exclusion
Figure 6.5. Java program containing a CyclicBarrier. This program is used to time the execution of function runit().
Figure 6.6. Example of an OpenMP program that includes a critical section
OpenMP: mutual exclusion
Figure 6.7. Example of using locks in OpenMP
Java: mutual exclusion
Figure 6.8. Java version of the OpenMP program in Fig. 6.6
Figure 6.9. Java program showing how to implement mutual exclusion with a synchronized method
MPI: mutual exclusion
Figure 6.10. Example of an MPI program with an update that requires mutual exclusion. A single process is dedicated to the update of this data structure.
6.4. COMMUNICATION
6.4.1. Message Passing
MPI: message passing
Figure 6.11. MPI program that uses a ring of processors and a communication pattern where information is shifted to the right. The functions to do the computation do not affect the communication itself so they are not shown. (Continued in Fig. 6.12.)
OpenMP: message passing
Figure 6.12. MPI program that uses a ring of processors and a communication pattern where information is shifted to the right (continued from Fig. 6.11)
Figure 6.13. OpenMP program that uses a ring of threads and a communication pattern where information is shifted to the right (continued in Fig. 6.14)
Java: message passing
Figure 6.14. OpenMP program that uses a ring of threads and a communication pattern where information is shifted to the right (continued from Fig. 6.13)
Figure 6.15. The message-passing block from Fig. 6.13 and Fig. 6.14, but with more careful synchronization management (pairwise synchronization)
6.4.2. Collective Communication
Reduction
Figure 6.16. MPI program to time the execution of a function called runit(). We use MPI_Reduce to find minimum, maximum, and average runtimes.
Implementing reduction operations
Figure 6.17. OpenMP program to time the execution of a function called runit(). We use a reduction clause to find sum of the runtimes.
Serial computation
Figure 6.18. Serial reduction to compute the sum of a(0) through a(3). sum(a(i:j)) denotes the sum of elements i through j of array a.
Tree-based reduction
Figure 6.19. Tree-based reduction to compute the sum of a(0) through a(3) on a system with 4 UEs. sum(a(i:j)) denotes the sum of elements i through j of array a.
Recursive doubling
Figure 6.20. Recursive-doubling reduction to compute the sum of a(0) through a(3). sum (a(i:j)) denotes the sum of elements i through j of array a.
6.4.3. Other Communication Constructs
Endnotes
Appendix A. A Brief Introduction to OpenMP
Figure A.1. Fortran and C programs that print a simple string to standard output
A.1. CORE CONCEPTS
Figure A.2. Fortran and C programs that print a simple string to standard output
Figure A.3. Fortran and C programs that print a simple string to standard output
Figure A.4. Simple program to show the difference between shared and local (or private) data
A.2. STRUCTURED BLOCKS AND DIRECTIVE FORMATS
A.3. WORKSHARING
Figure A.5. Fortran and C examples of a typical loop-oriented program
Figure A.6. Fortran and C examples of a typical loop-oriented program. In this version of the program, the computationally intensive loop has been isolated and modified so the iterations are independent.
Figure A.7. Fortran and C examples of a typical loop-oriented program parallelized with OpenMP
A.4. DATA ENVIRONMENT CLAUSES
Figure A.8. C program to carry out a trapezoid rule integration to compute (here comes equation)
Figure A.9. C program showing use of the private, firstprivate, and lastprivate clauses. This program is incorrect in that the variables h and j do not have well-defined values when the printf is called. Notice the use of a backslash to continue the OpenMP pragma onto a second line.
A.5. THE OpenMP RUNTIME LIBRARY
Figure A.10. C program showing use of the most common runtime library functions
A.6. SYNCHRONIZATION
Figure A.11. Parallel version of the program in Fig. A.5. In this case, however, we assume that the calls to combine() can occur in any order as long as only one thread at a time executes the function. This is enforced with the critical construct.
Figure A.12. Example showing how the lock functions in OpenMP are used
A.7. THE SCHEDULE CLAUSE
Figure A.13. Parallel version of the program in Fig. A.11, modified to show the use of the schedule clause
A.8. THE REST OF THE LANGUAGE
Appendix B. A Brief Introduction to MPI
B.1. CONCEPTS
B.2. GETTING STARTED
Figure B.1. Program to print a simple string to standard output
Figure B.2. Parallel program in which each process prints a simple string to the standard output
B.3. BASIC POINT-TO-POINT MESSAGE PASSING
Figure B.3. The standard blocking point-to-point communication routines in the C binding for MPI 1.1
Figure B.4. MPI program to "bounce" a message between two processes using the standard blocking point-to-point communication routines in the C binding to MPI 1.1
B.4. COLLECTIVE OPERATIONS
Figure B.6. Program to time the ring function as it passes messages around a ring of processes (continued in Fig. B.7). The program returns the time from the process that takes the longest elapsed time to complete the communication. The code to the ring function is not relevant for this example, but it is included in Fig. B.8.
Figure B.7. Program to time the ring function as it passes messages around a ring of processes (continued from Fig. B.6)
Figure B.5. The major collective communication routines in the C binding to MPI 1.1 (MPI_Barrier, MPI_Bcast, and MPI_Reduce)
Figure B.8. Function to pass a message around a ring of processes. It is deadlock-free because the sends and receives are split between the even and odd processes.
Figure B.9. The nonblocking or asynchronous communication functions
B.5. ADVANCED POINT-TO-POINT MESSAGE PASSING
Figure B.10. Program using nonblocking communication to iteratively update a field using an algorithm that requires only communication around a ring (shifting messages to the right)
Figure B.11. Function to pass a message around a ring of processes using persistent communication
B.6. MPI AND FORTRAN
Figure B.12. Comparison of the C and Fortran language bindings for the reduction routine in MPI 1.1
Figure B.13. Simple Fortran MPI program where each process prints its ID and the number of processes in the computation
B.7. CONCLUSION
Appendix C. A Brief Introduction to Concurrent Programming in Java
Figure C.1. A class holding pairs of objects of an arbitrary type. Without generic types, this would have been done by declaring x and y to be of type Object, requiring casting the returned values of getX and getY. In addition to less-verbose programs, this allows type errors to be found by the compiler rather than throwing a ClassCastException at runtime.
C.1. CREATING THREADS
Figure C.2. Program to create four threads, passing a Runnable in the Thread constructor. Thread-specific data is held in a field of the Runnable object.
C.1.1. Anonymous Inner Classes
C.1.2. Executors and Factories
Figure C.3. Program similar to the one in Fig. C.2, but using an anonymous class to define the Runnable object
Figure C.4. Program using a ThreadPoolExecutor instead of creating threads directly
Figure C.5. Code fragment illustrating use of Callable and Future
C.2. ATOMICITY, MEMORY SYNCHRONIZATION, AND THE volatile KEYWORD
C.3. SYNCHRONIZED BLOCKS
C.4. WAIT AND NOTIFY
Figure C.6. Basic idiom for using wait. Because wait throws an InterruptedException, it should somehow be enclosed in a try-catch block, omitted here.
C.5. LOCKS
Figure C.7. A version of SharedQueue2 (see the Shared Queue pattern) using a Lock and Condition instead of synchronized blocks with wait and notify
C.6. OTHER SYNCHRONIZATION MECHANISMS AND SHARED DATA STRUCTURES
Figure C.8. Simple sequential loop-based program similar to the one in Fig. A.5
Figure C.9. Program showing a parallel version of the sequential program in Fig. C.8 where each iteration of the big_comp loop is a separate task. A thread pool containing ten threads is used to execute the tasks. A CountDownLatch is used to ensure that all of the tasks have completed before executing the (still sequential) loop that combines the results.
C.7. INTERRUPTS
Glossary
About the Authors
    "If you build it, they will come."  And so we built them. Multiprocessor workstations, massively parallel supercomputers, a cluster in  every department ... and they haven't come. Programmers haven't come to program these wonderful  machines. Oh, a few programmers in love with the challenge have shown that most types of problems  can be force­fit onto parallel computers, but general programmers, especially professional  programmers who "have lives", ignore parallel computers.  And they do so at their own peril. Parallel computers are going mainstream. Multithreaded  microprocessors, multicore CPUs, multiprocessor PCs, clusters, parallel game consoles ... parallel  computers are taking over the world of computing. The computer industry is ready to flood the market  with hardware that will only run at full speed with parallel programs. But who will write these  programs?  This is an old problem. Even in the early 1980s, when the "killer micros" started their assault on  traditional vector supercomputers, we worried endlessly about how to attract normal programmers.  We tried everything we could think of: high­level hardware abstractions, implicitly parallel  programming languages, parallel language extensions, and portable message­passing libraries. But  after many years of hard work, the fact of the matter is that "they" didn't come. The overwhelming  majority of programmers will not invest the effort to write parallel software.  A common view is that you can't teach old programmers new tricks, so the problem will not be solved  until the old programmers fade away and a new generation takes over.  But we don't buy into that defeatist attitude. Programmers have shown a remarkable ability to adopt  new software technologies over the years. Look at how many old Fortran programmers are now  writing elegant Java programs with sophisticated object­oriented designs. The problem isn't with old  programmers. The problem is with old parallel computing experts and the way they've tried to create a  pool of capable parallel programmers.  And that's where this book comes in. We want to capture the essence of how expert parallel  programmers think about parallel algorithms and communicate that essential understanding in a way  professional programmers can readily master. The technology we've adopted to accomplish this task is  a pattern language. We made this choice not because we started the project as devotees of design  patterns looking for a new field to conquer, but because patterns have been shown to work in ways that  would be applicable in parallel programming. For example, patterns have been very effective in the  field of object­oriented design. They have provided a common language experts can use to talk about  the elements of design and have been extremely effective at helping programmers master object­ oriented design. 
This book contains our pattern language for parallel programming. The book opens with a couple of  chapters to introduce the key concepts in parallel computing. These chapters focus on the parallel  computing concepts and jargon used in the pattern language as opposed to being an exhaustive  introduction to the field.  The pattern language itself is presented in four parts corresponding to the four phases of creating a  parallel program:      *        Finding Concurrency. The programmer works in the problem domain to identify the available  concurrency and expose it for use in the algorithm design.      *        Algorithm Structure. The programmer works with high­level structures for organizing a parallel  algorithm.      *        Supporting Structures. We shift from algorithms to source code and consider how the parallel  program will be organized and the techniques used to manage shared data.      *        Implementation Mechanisms. The final step is to look at specific software constructs for  implementing a parallel program.  The patterns making up these four design spaces are tightly linked. You start at the top (Finding  Concurrency), work through the patterns, and by the time you get to the bottom (Implementation  Mechanisms), you will have a detailed design for your parallel program.  If the goal is a parallel program, however, you need more than just a parallel algorithm. You also need  a programming environment and a notation for expressing the concurrency within the program's  source code. Programmers used to be confronted by a large and confusing array of parallel  programming environments. Fortunately, over the years the parallel programming community has  converged around three programming environments. 
    *        OpenMP. A simple language extension to C, C++, or Fortran to write parallel programs for  shared­memory computers.      *        MPI. A message­passing library used on clusters and other distributed­memory computers.      *        Java. An object­oriented programming language with language features supporting parallel  programming on shared­memory computers and standard class libraries supporting distributed  computing.  Many readers will already be familiar with one or more of these programming notations, but for  readers completely new to parallel computing, we've included a discussion of these programming  environments in the appendixes.  In closing, we have been working for many years on this pattern language. Presenting it as a book so  people can start using it is an exciting development for us. But we don't see this as the end of this  effort. We expect that others will have their own ideas about new and better patterns for parallel  programming. We've assuredly missed some important features that really belong in this pattern  language. We embrace change and look forward to engaging with the larger parallel computing  community to iterate on this language. Over time, we'll update and improve the pattern language until  it truly represents the consensus view of the parallel programming community. Then our real work  will begin—using the pattern language to guide the creation of better parallel programming  environments and helping people to use these technologies to write parallel software. We won't rest  until the day sequential software is rare.  ACKNOWLEDGMENTS  We started working together on this pattern language in 1998. It's been a long and twisted road,  starting with a vague idea about a new way to think about parallel algorithms and finishing with this  book. We couldn't have done this without a great deal of help.  Mani Chandy, who thought we would make a good team, introduced Tim to Beverly and Berna. The  National Science Foundation, Intel Corp., and Trinity University have supported this research at  various times over the years. Help with the patterns themselves came from the people at the Pattern  Languages of Programs (PLoP) workshops held in Illinois each summer. The format of these 
workshops and the resulting review process was challenging and sometimes difficult, but without  them we would have never finished this pattern language. We would also like to thank the reviewers  who carefully read early manuscripts and pointed out countless errors and ways to improve the book.  Finally, we thank our families. Writing a book is hard on the authors, but that is to be expected. What  we didn't fully appreciate was how hard it would be on our families. We are grateful to Beverly's  family (Daniel and Steve), Tim's family (Noah, August, and Martha), and Berna's family (Billie) for  the sacrifices they've made to support this project.       — Tim Mattson, Olympia, Washington, April 2004       — Beverly Sanders, Gainesville, Florida, April 2004       — Berna Massingill, San Antonio, Texas, April 2004
      Chapter   Section 1.1.   Section 1.2.   Section 1.3.   Section 1.4.   Chapter   Section 2.1.   Section 2.2.   Section 2.3.   Section 2.4.   Section 2.5.   Section 2.6.   Section 2.7.   Chapter   Section 3.1.   Section 3.2.   Section 3.3.   Section 3.4.   Section 3.5.   Section 3.6.   Section 3.7.   Section 3.8.   Chapter   Section 4.1.   Section 4.2.   Section 4.3.       1.      A Pattern Language for Parallel Programming       INTRODUCTION         PARALLEL PROGRAMMING         DESIGN PATTERNS AND PATTERN LANGUAGES         A PATTERN LANGUAGE FOR PARALLEL PROGRAMMING         2.      Background and Jargon of Parallel Computing       CONCURRENCY IN PARALLEL PROGRAMS VERSUS OPERATING SYSTEMS         PARALLEL ARCHITECTURES: A BRIEF INTRODUCTION         PARALLEL PROGRAMMING ENVIRONMENTS         THE JARGON OF PARALLEL COMPUTING         A QUANTITATIVE LOOK AT PARALLEL COMPUTATION         COMMUNICATION         SUMMARY         3.      The Finding Concurrency Design Space       ABOUT THE DESIGN SPACE         THE TASK DECOMPOSITION PATTERN         THE DATA DECOMPOSITION PATTERN         THE GROUP TASKS PATTERN         THE ORDER TASKS PATTERN         THE DATA SHARING PATTERN         THE DESIGN EVALUATION PATTERN         SUMMARY         4.      The Algorithm Structure Design Space       INTRODUCTION         CHOOSING AN ALGORITHM STRUCTURE PATTERN         EXAMPLES      
      THE TASK PARALLELISM PATTERN     Section 4.4.       THE DIVIDE AND CONQUER PATTERN     Section 4.5.       THE GEOMETRIC DECOMPOSITION PATTERN     Section 4.6.       THE RECURSIVE DATA PATTERN     Section 4.7.       THE PIPELINE PATTERN     Section 4.8.       THE EVENT­BASED COORDINATION PATTERN     Section 4.9.       5.      The Supporting Structures Design Space   Chapter       INTRODUCTION     Section 5.1.       FORCES     Section 5.2.       CHOOSING THE PATTERNS     Section 5.3.       THE SPMD PATTERN     Section 5.4.       THE MASTER/WORKER PATTERN     Section 5.5.       THE LOOP PARALLELISM PATTERN     Section 5.6.       THE FORK/JOIN PATTERN     Section 5.7.       THE SHARED DATA PATTERN     Section 5.8.       THE SHARED QUEUE PATTERN Section 5.9.           THE DISTRIBUTED ARRAY PATTERN Section 5.10.             OTHER SUPPORTING STRUCTURES Section 5.11.           6.      The Implementation Mechanisms Design Space       OVERVIEW         UE MANAGEMENT         SYNCHRONIZATION         COMMUNICATION     Chapter   Section 6.1.   Section 6.2.   Section 6.3.   Section 6.4.   Endnotes        A:      A Brief Introduction to OpenMP Appendix         CORE CONCEPTS   Section A.1.         STRUCTURED BLOCKS AND DIRECTIVE FORMATS Section A.2.           WORKSHARING Section A.3.       Section A.4.       DATA ENVIRONMENT CLAUSES         THE OpenMP RUNTIME LIBRARY Section A.5.           SYNCHRONIZATION Section A.6.           THE SCHEDULE CLAUSE Section A.7.           THE REST OF THE LANGUAGE Section A.8.           B:      A Brief Introduction to MPI Appendix         CONCEPTS Section B.1.           GETTING STARTED Section B.2.           BASIC POINT­TO­POINT MESSAGE PASSING Section B.3.           COLLECTIVE OPERATIONS Section B.4.           ADVANCED POINT­TO­POINT MESSAGE PASSING Section B.5.           MPI AND FORTRAN Section B.6.           CONCLUSION Section B.7.       Appendix     C:      A Brief Introduction to Concurrent Programming in Java         CREATING THREADS Section C.1.           ATOMICITY, MEMORY SYNCHRONIZATION, AND THE volatile KEYWORD Section C.2.          
Section C.3.   Section C.4.   Section C.5.   Section C.6.   STRUCTURES Section C.7.   Glossary Bibliography     INTERRUPTS         SYNCHRONIZED BLOCKS         WAIT AND NOTIFY         LOCKS         OTHER SYNCHRONIZATION MECHANISMS AND SHARED DATA      About the Authors Index
A Pattern Language for Parallel Programming > INTRODUCTION Chapter 1. A Pattern Language for Parallel Programming 1.1 INTRODUCTION 1.2 PARALLEL PROGRAMMING 1.3 DESIGN PATTERNS AND PATTERN LANGUAGES 1.4 A PATTERN LANGUAGE FOR PARALLEL PROGRAMMING 1.1. INTRODUCTION Computers are used to model physical systems in many fields of science, medicine, and engineering.  Modelers, whether trying to predict the weather or render a scene in the next blockbuster movie, can  usually use whatever computing power is available to make ever more detailed simulations. Vast  amounts of data, whether customer shopping patterns, telemetry data from space, or DNA sequences,  require analysis. To deliver the required power, computer designers combine multiple processing  elements into a single larger system. These so­called parallel computers run multiple tasks  simultaneously and solve bigger problems in less time. Traditionally, parallel computers were rare and available for only the most critical problems. Since the  mid­1990s, however, the availability of parallel computers has changed dramatically. With  multithreading support built into the latest microprocessors and the emergence of multiple processor  cores on a single silicon die, parallel computers are becoming ubiquitous. Now, almost every  university computer science department has at least one parallel computer. Virtually all oil companies,  automobile manufacturers, drug development companies, and special effects studios use parallel  computing. For example, in computer animation, rendering is the step where information from the animation files,  such as lighting, textures, and shading, is applied to 3D models to generate the 2D image that makes  up a frame of the film. Parallel computing is essential to generate the needed number of frames (24  per second) for a feature­length film. Toy Story, the first completely computer­generated feature­ length film, released by Pixar in 1995, was processed on a "renderfarm" consisting of 100 dual­
分享到:
收藏