并行编程模式超清晰pdf 英文原版.pdf

发布时间：2022-05-30 发布人：admin 分类：说明书资料大小：3.39M 资料格式：pdf 举报版权申诉

baby313-1991611-4744302543450658445.pdf-第1页.png

第1页 / 共328页

baby313-1991611-4744302543450658445.pdf-第2页.png

第2页 / 共328页

baby313-1991611-4744302543450658445.pdf-第3页.png

第3页 / 共328页

baby313-1991611-4744302543450658445.pdf-第4页.png

第4页 / 共328页

baby313-1991611-4744302543450658445.pdf-第5页.png

第5页 / 共328页

baby313-1991611-4744302543450658445.pdf-第6页.png

第6页 / 共328页

baby313-1991611-4744302543450658445.pdf-第7页.png

第7页 / 共328页

baby313-1991611-4744302543450658445.pdf-第8页.png

第8页 / 共328页

Chapter 1. A Pattern Language for Parallel Programming

1.1. INTRODUCTION

1.2. PARALLEL PROGRAMMING

1.3. DESIGN PATTERNS AND PATTERN LANGUAGES

1.4. A PATTERN LANGUAGE FOR PARALLEL PROGRAMMING

Figure 1.1. Overview of the pattern language

Chapter 2. Background and Jargon of Parallel Computing

2.1. CONCURRENCY IN PARALLEL PROGRAMS VERSUS OPERATING SYSTEMS

2.2. PARALLEL ARCHITECTURES: A BRIEF INTRODUCTION

2.2.1. Flynn's Taxonomy

Figure 2.1. The Single Instruction, Single Data (SISD) architecture

Figure 2.2. The Single Instruction, Multiple Data (SIMD) architecture

Figure 2.3. The Multiple Instruction, Multiple Data (MIMD) architecture

2.2.2. A Further Breakdown of MIMD

Figure 2.4. The Symmetric Multiprocessor (SMP) architecture

Figure 2.5. An example of the nonuniform memory access (NUMA) architecture

Figure 2.6. The distributed-memory architecture

2.2.3. Summary

2.3. PARALLEL PROGRAMMING ENVIRONMENTS

Table 2.1. Some Parallel Programming Environments from the Mid-1990s

2.4. THE JARGON OF PARALLEL COMPUTING

2.5. A QUANTITATIVE LOOK AT PARALLEL COMPUTATION

2.6. COMMUNICATION

2.6.1. Latency and Bandwidth

2.6.2. Overlapping Communication and Computation and Latency Hiding

Figure 2.7. Communication without (left) and with (right) support for overlapping communication and computation. Although UE 0 in the computation on the right still has some idle time waiting for the reply from UE 1, the idle time is reduced and the computation requires less total time because of UE 1 's earlier start.

2.7. SUMMARY

Chapter 3. The Finding Concurrency Design Space

3.1. ABOUT THE DESIGN SPACE

Figure 3.1. Overview of the Finding Concurrency design space and its place in the pattern language

3.1.1. Overview

3.1.2. Using the Decomposition Patterns

3.1.3. Background for Examples

Medical imaging

Linear algebra

Molecular dynamics

Figure 3.2. Pseudocode for the molecular dynamics example

3.2. THE TASK DECOMPOSITION PATTERN

Problem

Context

Forces

Solution

Examples

Medical imaging

Matrix multiplication

Molecular dynamics

Figure 3.3. Pseudocode for the molecular dynamics example

Known uses

3.3. THE DATA DECOMPOSITION PATTERN

Problem

Context

Forces

Solution

Examples

Medical imaging

Matrix multiplication

Molecular dynamics

Known uses

3.4. THE GROUP TASKS PATTERN

Problem

Context

Solution

Examples

Molecular dynamics

Matrix multiplication

3.5. THE ORDER TASKS PATTERN

Problem

Context

Solution

Examples

Molecular dynamics

Figure 3.4. Ordering of tasks in molecular dynamics problem

3.6. THE DATA SHARING PATTERN

Problem

Context

Forces

Solution

Examples

Molecular dynamics

Figure 3.5. Data sharing in molecular dynamics. We distinguish between sharing for reads, read-writes, and accumulations.

3.7. THE DESIGN EVALUATION PATTERN

Problem

Context

Forces

Solution

Suitability for target platform

Design quality

Preparation for next phase

3.8. SUMMARY

Chapter 4. The Algorithm Structure Design Space

4.1. INTRODUCTION

Figure 4.1. Overview of the Algorithm Structure design space and its place in the pattern language

4.2. CHOOSING AN ALGORITHM STRUCTURE PATTERN

4.2.1. Target Platform

4.2.2. Major Organizing Principle

4.2.3. The Algorithm Structure Decision Tree

Figure 4.2. Decision tree for the Algorithm Structure design space

Organize By Tasks

Organize By Data Decomposition

Organize By Flow of Data

4.2.4. Re-evaluation

4.3. EXAMPLES

4.3.1. Medical Imaging

4.3.2. Molecular Dynamics

4.4. THE TASK PARALLELISM PATTERN

Problem

Context

Forces

Solution

Tasks

Dependencies

Schedule

Figure 4.3. Good versus poor load balance

Program structure

Common idioms

Examples

Image construction

Molecular dynamics

Figure 4.4. Pseudocode for the nonbonded computation in a typical molecular dynamics code

Known uses

4.5. THE DIVIDE AND CONQUER PATTERN

Problem

Context

Figure 4.5. The divide-and-conquer strategy

Forces

Figure 4.6. Sequential pseudocode for the divide-and-conquer algorithm

Solution

Figure 4.7. Parallelizing the divide-and-conquer strategy. Each dashed-line box represents a task.

Mapping tasks to UEs and PEs

Communication costs

Dealing with dependencies

Other optimizations

Examples

Mergesort

Matrix diagonalization

Known uses

Related Patterns

4.6. THE GEOMETRIC DECOMPOSITION PATTERN

Problem

Context

Example: mesh-computation program

Figure 4.8. Data dependencies in the heat-equation problem. Solid boxes indicate the element being updated; shaded boxes the elements containing needed data.

Example: matrix-multiplication program

Figure 4.9. Data dependencies in the matrix-multiplication problem. Solid boxes indicate the "chunk" being updated (C); shaded boxes indicate the chunks of A (row) and B (column) required to update C at each of the two steps.

Solution

Data decomposition

Figure 4.10. A data distribution with ghost boundaries. Shaded cells are ghost copies; arrows point from primary copies to corresponding secondary copies.

The exchange operation

The update operation

Data distribution and task scheduling

Program structure

Examples

Mesh computation

Figure 4.11. Sequential heat-diffusion program

OpenMP solution

Figure 4.12. Parallel heat-diffusion program using OpenMP

MPI solution

Figure 4.13. Parallel heat-diffusion program using OpenMP. This version has less thread-management overhead.

Figure 4.14. Parallel heat-diffusion program using MPI (continued in Fig. 4.15)

Figure 4.15. Parallel heat-diffusion program using MPI (continued from Fig. 4.14)

Figure 4.16. Parallel heat-diffusion program using MPI with overlapping communication/ computation (continued from Fig. 4.14)

Figure 4.17. Sequential matrix multiplication

Matrix multiplication

Figure 4.18. Sequential matrix multiplication, revised. We do not show the parts of the program that are not changed from the program in Fig. 4.17.

OpenMP solution

MPI solution

Figure 4.19. Parallel matrix multiplication with message passing (continued in Fig. 4.20)

Figure 4.20. Parallel matrix multiplication with message-passing (continued from Fig. 4.19)

Known uses

Related Patterns

4.7. THE RECURSIVE DATA PATTERN

Problem

Context

Figure 4.21. Finding roots in a forest. Solid lines represent the original parent-child relationships among nodes; dashed lines point from nodes to their successors.

Forces

Solution

Data decomposition

Structure

Synchronization

Examples

Partial sums of a linked list

Figure 4.23. Steps in finding partial sums of a list. Straight arrows represent links between elements; curved arrows indicate additions.

Known uses

Figure 4.22. Pseudocode for finding partial sums of a list

Related Patterns

4.8. THE PIPELINE PATTERN

Problem

Context

Forces

Solution

Figure 4.24. Operation of a pipeline. Each pipeline stage i computes the i-th step of the computation.

Figure 4.25. Example pipelines

Defining the stages of the pipeline

Figure 4.26. Basic structure of a pipeline stage

Structuring the computation

Representing the dataflow among pipeline elements

Handling errors

Processor allocation and task scheduling

Throughput and latency

Examples

Fourier-transform computations

Java pipeline framework

Known uses

Figure 4.27. Base class for pipeline stages

Figure 4.28. Base class for linear pipeline

Related Patterns

Figure 4.29. Pipelined sort (main class)

Figure 4.30. Pipelined sort (sorting stage)

4.9. THE EVENT-BASED COORDINATION PATTERN

Problem

Context

Figure 4.31. Discrete-event simulation of a car-wash facility. Arrows indicate the flow of events.

Forces

Solution

Defining the tasks

Figure 4.32. Basic structure of a task in the Event-Based Coordination pattern

Representing event flow

Enforcing event ordering

Figure 4.33. Event-based communication among three tasks. Task 2 generates its event in response to the event received from task 1. The two events sent to task 3 can arrive in either order.

Avoiding deadlocks

Scheduling and processor allocation

Efficient communication of events

Examples

Known uses

Related Patterns

Chapter 5. The Supporting Structures Design Space

5.1. INTRODUCTION

Figure 5.1. Overview of the Supporting Structures design space and its place in the pattern language

5.1.1. Program Structuring Patterns

5.1.2. Patterns Representing Data Structures

5.2. FORCES

5.3. CHOOSING THE PATTERNS

Table 5.1. Relationship between Supporting Structures patterns and Algorithm Structure patterns. The number of stars (ranging from zero to four) is an indication of the likelihood that the given Supporting Structures pattern is useful in the implementation of the Algorithm Structure pattern.

Table 5.2. Relationship between Supporting Structures patterns and programming environments. The number of stars (ranging from zero to four) is an indication of the likelihood that the given Supporting Structures pattern is useful in the programming environment.

5.4. THE SPMD PATTERN

Problem

Context

Forces

Solution

Discussion

Examples

Numerical integration

Figure 5.2. Sequential program to carry out a trapezoid rule integration to compute

Figure 5.3. MPI program to carry out a trapezoid rule integration in parallel by assigning one block of loop iterations to each UE and performing a reduction

Figure 5.4. Index calculation that more evenly distributes the work when the number of steps is not evenly divided by the number of UEs. The idea is to split up the remaining tasks (rem) among the first rem UEs.

Figure 5.5. MPI program to carry out a trapezoid rule integration in parallel using a simple loop-splitting algorithm with cyclic distribution of iterations and a reduction

Figure 5.6. OpenMP program to carry out a trapezoid rule integration in parallel using the same SPMD algorithm used in Fig. 5.5

Molecular dynamics

Figure 5.7. Pseudocode for molecular dynamics example. This code is very similar to the version discussed earlier, but a few extra details have been included. To support more detailed pseudocode examples, the call to the function that initializes the force arrays has been made explicit. Also, the fact that the neighbor list is only occasionally updated is made explicit.

Figure 5.8. Pseudocode for an SPMD molecular dynamics program using MPI

Figure 5.9. Pseudocode for the nonbonded computation in a typical parallel molecular dynamics code. This code is almost identical to the sequential version of the function shown in Fig. 4.4. The only major change is a new array of integers holding the indices for the atoms assigned to this UE, local_atoms. We've also assumed that the neighbor list has been generated to hold only those atoms assigned to this UE. For the sake of allocating space for these arrays, we have added a parameter LN which is the largest number of atoms that can be assigned to a single UE.

Figure 5.10. Pseudocode for the neighbor list computation. For each atom i, the indices for atoms within a sphere of radius cutoff are added to the neighbor list for atom i. Notice that the second loop (over j) only considers atoms with indices greater than i. This accounts for the symmetry in the force computation due to Newton's third law of motion, that is, that the force between atom i and atom j is just the negative of the force between atom j and atom i.

Figure 5.11. Pseudocode for a parallel molecular dynamics program using OpenMP

Mandelbrot set computation

Figure 5.12. Pseudocode for a sequential version of the Mandelbrot set generation program

Known uses

Figure 5.13. Pseudocode for a parallel MPI version of the Mandelbrot set generation program

Related Patterns

5.5. THE MASTER/WORKER PATTERN

Problem

Context

Forces

Solution

Figure 5.14. The two elements of the Master/Worker pattern are the master and the worker. There is only one master, but there can be one or more workers. Logically, the master sets up the calculation and then manages a bag of tasks. Each worker grabs a task from the bag, carries out the work, and then goes back to the bag, repeating until the termination condition is met.

Discussion

Detecting completion

Variations

Examples

Generic solutions

Figure 5.15. Master process for a master/worker program. This assumes a shared address space so the task and results queues are visible to all UEs. In this simple version, the master initializes the queue, launches the workers, and then waits for the workers to finish (that is, the ForkJoin command launches the workers and then waits for them to finish before returning). At that point, results are consumed and the computation completes.

Figure 5.16. Worker process for a master/worker program. We assume a shared address space thereby making task_queue and global_results available to the master and all workers. A worker loops over the task_queue and exits when the end of the queue is encountered.

Figure 5.17. Instantiating and initializing a pooled executor

Mandelbrot set generation

Figure 5.18. Pseudocode for a sequential version of the Mandelbrot set generation program

Figure 5.19. Master process for a master/worker parallel version of the Mandelbrot set generation program

Figure 5.20. Worker process for a master/worker parallel version of the Mandelbrot set generation program. We assume a shared address space thereby making task_queue, global_results, and ranges available to the master and the workers.

Known uses

Related Patterns

5.6. THE LOOP PARALLELISM PATTERN

Problem

Context

Forces

Solution

Figure 5.21. Program fragment showing merging loops to increase the amount of work per iteration

Figure 5.22. Program fragment showing coalescing nested loops to produce a single loop with a larger number of iterations

Performance considerations

Figure 5.23. Program fragment showing an example of false sharing. The small array A is held in one or two cache lines. As the UEs access A inside the innermost loop, they will need to take ownership of the cache line back from the other UEs. This back-and-forth movement of the cache lines destroys performance. The solution is to use a temporary variable inside the innermost loop.

Examples

Numerical integration

Figure 5.24. Sequential program to carry out a trapezoid rule integration to compute

Molecular dynamics.

Figure 5.25. Pseudocode for the nonbonded computation in a typical parallel molecular dynamics code. This is code is almost identical to the sequential version of the function shown previously in Fig. 4.4.

Mandelbrot set computation

Figure 5.26. Pseudocode for a sequential version of the Mandelbrot set generation program

Mesh computation

Figure 5.27. Parallel heat-diffusion program using OpenMP. This program is described in the Examples section of the Geometric Decomposition pattern.

Figure 5.28. Parallel heat-diffusion program using OpenMP, with reduced thread management overhead and memory management more appropriate for NUMA computers

Known uses

Related Patterns

5.7. THE FORK/JOIN PATTERN

Problem

Context

Forces

Solution

Direct task/UE mapping

Indirect task/UE mapping

Examples

Mergesort using direct mapping

Figure 5.29. Parallel mergesort where each task corresponds to a thread

Figure 5.30. Instantiating FJTaskRunnerGroup and invoking the master task

Mergesort using indirect mapping

Known uses

Figure 5.31. Mergesort using the FJTask framework

Related Patterns

5.8. THE SHARED DATA PATTERN

Problem

Context

Forces

Solution

Be sure this pattern is needed

Define an abstract data type

Implement an appropriate concurrency-control protocol

Figure 5.32. Typical use of read/write locks. These locks are defined in the java.util.concurrent.locks package. Putting the unlock in the finally block ensures that the lock will be unlocked regardless of how the try block is exited (normally or with an exception) and is a standard idiom in Java programs that use locks rather than synchronized blocks.

Figure 5.33. Example of nested locking using synchronized blocks with dummy objects lockA and lockB

Review other considerations

Examples

Shared queues

Genetic algorithm for nonlinear optimization

Figure 5.34. Pseudocode for the population shuffle loop from the genetic algorithm program GAFORT

Figure 5.35. Pseudocode for an ineffective approach to parallelizing the population shuffle in the genetic algorithm program GAFORT

Known uses

Figure 5.36. Pseudocode for a parallelized loop to carry out the population shuffle in the genetic algorithm program GAFORT. This version of the loop uses a separate lock for each chromosome and runs effectively in parallel.

Related Patterns

5.9. THE SHARED QUEUE PATTERN

Problem

Context

Forces

Solution

The abstract data type (ADT)

Queue with "one at a time" execution

Figure 5.37. Queue that ensures that at most one thread can access the data structure at one time. If the queue is empty, null is immediately returned.

Figure 5.38. Queue that ensures at most one thread can access the data structure at one time. Unlike the first shared queue example, if the queue is empty, the thread waits. When used in a master/worker algorithm, a poison pill would be required to signal termination to a thread.

Concurrency-control protocols for noninterfering operations

Figure 5.39. Shared queue that takes advantage of the fact that put and take are noninterfering and uses separate locks so they can proceed concurrently

Concurrency-control protocols using nested locks

Figure 5.40. Blocking queue with multiple locks to allow concurrent put and take on a nonempty queue

Distributed shared queues

Figure 5.41. Nonblocking shared queue with takeLast operation

Figure 5.42. Abstract base class for tasks

Figure 5.43. Class defining behavior of threads in the thread pool (continued in Fig. 5.44 and Fig. 5.45)

Figure 5.44. Class defining behavior of threads in the thread pool (continued from Fig. 5.43 and continued in Fig. 5.45)

Examples

Computing Fibonacci numbers

Figure 5.45. Class defining behavior of threads in the thread pool (continued from Fig. 5.43 and Fig. 5.44)

Figure 5.46. The TaskRunnerGroup class. This class initializes and manages the threads in the thread pool.

Related Patterns

Figure 5.47. Program to compute Fibonacci numbers (continued in Fig. 5.48)

Figure 5.48. Program to compute Fibonacci numbers (continued from Fig. 5.47)

5.10. THE DISTRIBUTED ARRAY PATTERN

Problem

Context

Forces

Solution

Overview

Array distributions

Figure 5.49. Original square matrix A

Figure 5.50. 1D distribution of A onto four UEs

Figure 5.51. 2D distribution of A onto four UEs

Figure 5.52. 1D block-cyclic distribution of A onto four UEs

Figure 5.53. 2D block-cyclic distribution of A onto four UEs, part 1: Decomposing A

Figure 5.54. 2D block-cyclic distribution of A onto four UEs, part 2: Assigning submatrices to UEs

Figure 5.55. 2D block-cyclic distribution of A onto four UEs: Local view of elements of A assigned to UE(0,0). LAl,m is the block with block indices (l, m). Each element is labeled both with its original global indices (ai,j) and its indices within block LAl,m (lx,y).

Figure 5.56. 2D block-cyclic distribution of A onto four UEs: Local view of elements of A assigned to UE(0,0). Each element is labeled both with its original global indices ai,j and its local indices [x', y' . Local indices are with respect to the contiguous matrix used to store all blocks assigned to this UE.

Choosing a distribution

Mapping indices

Aligning computation with locality

Examples

Transposing a matrix stored as column blocks

Figure 5.57. Matrix A and its transpose, in terms of submatrices, distributed among four UEs

Figure 5.58. Code to transpose a matrix (continued in Fig. 5.59)

Figure 5.59. Code to transpose a matrix (continued from Fig. 5.58)

Known uses

Related Patterns

5.11. OTHER SUPPORTING STRUCTURES

5.11.1. SIMD

5.11.2. MPMD

5.11.3. Client-Server Computing

5.11.4. Concurrent Programming with Declarative Languages

5.11.5. Problem-Solving Environments

Chapter 6. The Implementation Mechanisms Design Space

Figure 6.1. Overview of the Implementation Mechanisms design space and its place in the pattern language

6.1. OVERVIEW

6.2. UE MANAGEMENT

6.2.1. Thread Creation/Destruction

OpenMP: thread creation/destruction

Java: thread creation/destruction

MPI: thread creation/destruction

6.2.2. Process Creation/Destruction

MPI: process creation/destruction

Java: process creation/destruction

OpenMP: process creation/destruction

6.3. SYNCHRONIZATION

6.3.1. Memory Synchronization and Fences

OpenMP: fences

Figure 6.2. Program showing one way to implement pairwise synchronization in OpenMP. The flush construct is vital. It forces the memory to be consistent, thereby making the updates to the flag array visible. For more details about the syntax of OpenMP, see the OpenMP appendix, Appendix A.

Java: fences

MPI: fences

6.3.2. Barriers

MPI: barriers

Figure 6.3. MPI program containing a barrier. This program is used to time the execution of function runit().

OpenMP: barriers

Figure 6.4. OpenMP program containing a barrier. This program is used to time the execution of function runit().

Java: barriers

6.3.3. Mutual Exclusion

Figure 6.5. Java program containing a CyclicBarrier. This program is used to time the execution of function runit().

Figure 6.6. Example of an OpenMP program that includes a critical section

OpenMP: mutual exclusion

Figure 6.7. Example of using locks in OpenMP

Java: mutual exclusion

Figure 6.8. Java version of the OpenMP program in Fig. 6.6

Figure 6.9. Java program showing how to implement mutual exclusion with a synchronized method

MPI: mutual exclusion

Figure 6.10. Example of an MPI program with an update that requires mutual exclusion. A single process is dedicated to the update of this data structure.

6.4. COMMUNICATION

6.4.1. Message Passing

MPI: message passing

Figure 6.11. MPI program that uses a ring of processors and a communication pattern where information is shifted to the right. The functions to do the computation do not affect the communication itself so they are not shown. (Continued in Fig. 6.12.)

OpenMP: message passing

Figure 6.12. MPI program that uses a ring of processors and a communication pattern where information is shifted to the right (continued from Fig. 6.11)

Figure 6.13. OpenMP program that uses a ring of threads and a communication pattern where information is shifted to the right (continued in Fig. 6.14)

Java: message passing

Figure 6.14. OpenMP program that uses a ring of threads and a communication pattern where information is shifted to the right (continued from Fig. 6.13)

Figure 6.15. The message-passing block from Fig. 6.13 and Fig. 6.14, but with more careful synchronization management (pairwise synchronization)

6.4.2. Collective Communication

Reduction

Figure 6.16. MPI program to time the execution of a function called runit(). We use MPI_Reduce to find minimum, maximum, and average runtimes.

Implementing reduction operations

Figure 6.17. OpenMP program to time the execution of a function called runit(). We use a reduction clause to find sum of the runtimes.

Serial computation

Figure 6.18. Serial reduction to compute the sum of a(0) through a(3). sum(a(i:j)) denotes the sum of elements i through j of array a.

Tree-based reduction

Figure 6.19. Tree-based reduction to compute the sum of a(0) through a(3) on a system with 4 UEs. sum(a(i:j)) denotes the sum of elements i through j of array a.

Recursive doubling

Figure 6.20. Recursive-doubling reduction to compute the sum of a(0) through a(3). sum (a(i:j)) denotes the sum of elements i through j of array a.

6.4.3. Other Communication Constructs

Endnotes

Appendix A. A Brief Introduction to OpenMP

Figure A.1. Fortran and C programs that print a simple string to standard output

A.1. CORE CONCEPTS

Figure A.2. Fortran and C programs that print a simple string to standard output

Figure A.3. Fortran and C programs that print a simple string to standard output

Figure A.4. Simple program to show the difference between shared and local (or private) data

A.2. STRUCTURED BLOCKS AND DIRECTIVE FORMATS

A.3. WORKSHARING

Figure A.5. Fortran and C examples of a typical loop-oriented program

Figure A.6. Fortran and C examples of a typical loop-oriented program. In this version of the program, the computationally intensive loop has been isolated and modified so the iterations are independent.

Figure A.7. Fortran and C examples of a typical loop-oriented program parallelized with OpenMP

A.4. DATA ENVIRONMENT CLAUSES

Figure A.8. C program to carry out a trapezoid rule integration to compute (here comes equation)

Figure A.9. C program showing use of the private, firstprivate, and lastprivate clauses. This program is incorrect in that the variables h and j do not have well-defined values when the printf is called. Notice the use of a backslash to continue the OpenMP pragma onto a second line.

A.5. THE OpenMP RUNTIME LIBRARY

Figure A.10. C program showing use of the most common runtime library functions

A.6. SYNCHRONIZATION

Figure A.11. Parallel version of the program in Fig. A.5. In this case, however, we assume that the calls to combine() can occur in any order as long as only one thread at a time executes the function. This is enforced with the critical construct.

Figure A.12. Example showing how the lock functions in OpenMP are used

A.7. THE SCHEDULE CLAUSE

Figure A.13. Parallel version of the program in Fig. A.11, modified to show the use of the schedule clause

A.8. THE REST OF THE LANGUAGE

Appendix B. A Brief Introduction to MPI

B.1. CONCEPTS

B.2. GETTING STARTED

Figure B.1. Program to print a simple string to standard output

Figure B.2. Parallel program in which each process prints a simple string to the standard output

B.3. BASIC POINT-TO-POINT MESSAGE PASSING

Figure B.3. The standard blocking point-to-point communication routines in the C binding for MPI 1.1

Figure B.4. MPI program to "bounce" a message between two processes using the standard blocking point-to-point communication routines in the C binding to MPI 1.1

B.4. COLLECTIVE OPERATIONS

Figure B.6. Program to time the ring function as it passes messages around a ring of processes (continued in Fig. B.7). The program returns the time from the process that takes the longest elapsed time to complete the communication. The code to the ring function is not relevant for this example, but it is included in Fig. B.8.

Figure B.7. Program to time the ring function as it passes messages around a ring of processes (continued from Fig. B.6)

Figure B.5. The major collective communication routines in the C binding to MPI 1.1 (MPI_Barrier, MPI_Bcast, and MPI_Reduce)

Figure B.8. Function to pass a message around a ring of processes. It is deadlock-free because the sends and receives are split between the even and odd processes.

Figure B.9. The nonblocking or asynchronous communication functions

B.5. ADVANCED POINT-TO-POINT MESSAGE PASSING

Figure B.10. Program using nonblocking communication to iteratively update a field using an algorithm that requires only communication around a ring (shifting messages to the right)

Figure B.11. Function to pass a message around a ring of processes using persistent communication

B.6. MPI AND FORTRAN

Figure B.12. Comparison of the C and Fortran language bindings for the reduction routine in MPI 1.1

Figure B.13. Simple Fortran MPI program where each process prints its ID and the number of processes in the computation

B.7. CONCLUSION

Appendix C. A Brief Introduction to Concurrent Programming in Java

Figure C.1. A class holding pairs of objects of an arbitrary type. Without generic types, this would have been done by declaring x and y to be of type Object, requiring casting the returned values of getX and getY. In addition to less-verbose programs, this allows type errors to be found by the compiler rather than throwing a ClassCastException at runtime.

C.1. CREATING THREADS

Figure C.2. Program to create four threads, passing a Runnable in the Thread constructor. Thread-specific data is held in a field of the Runnable object.

C.1.1. Anonymous Inner Classes

C.1.2. Executors and Factories

Figure C.3. Program similar to the one in Fig. C.2, but using an anonymous class to define the Runnable object

Figure C.4. Program using a ThreadPoolExecutor instead of creating threads directly

Figure C.5. Code fragment illustrating use of Callable and Future

C.2. ATOMICITY, MEMORY SYNCHRONIZATION, AND THE volatile KEYWORD

C.3. SYNCHRONIZED BLOCKS

C.4. WAIT AND NOTIFY

Figure C.6. Basic idiom for using wait. Because wait throws an InterruptedException, it should somehow be enclosed in a try-catch block, omitted here.

C.5. LOCKS

Figure C.7. A version of SharedQueue2 (see the Shared Queue pattern) using a Lock and Condition instead of synchronized blocks with wait and notify

C.6. OTHER SYNCHRONIZATION MECHANISMS AND SHARED DATA STRUCTURES

Figure C.8. Simple sequential loop-based program similar to the one in Fig. A.5

Figure C.9. Program showing a parallel version of the sequential program in Fig. C.8 where each iteration of the big_comp loop is a separate task. A thread pool containing ten threads is used to execute the tasks. A CountDownLatch is used to ensure that all of the tasks have completed before executing the (still sequential) loop that combines the results.

C.7. INTERRUPTS

Glossary

About the Authors

"If you build it, they will come." And so we built them. Multiprocessor workstations, massively parallel supercomputers, a cluster in every department ... and they haven't come. Programmers haven't come to program these wonderful machines. Oh, a few programmers in love with the challenge have shown that most types of problems can be forcefit onto parallel computers, but general programmers, especially professional programmers who "have lives", ignore parallel computers. And they do so at their own peril. Parallel computers are going mainstream. Multithreaded microprocessors, multicore CPUs, multiprocessor PCs, clusters, parallel game consoles ... parallel computers are taking over the world of computing. The computer industry is ready to flood the market with hardware that will only run at full speed with parallel programs. But who will write these programs? This is an old problem. Even in the early 1980s, when the "killer micros" started their assault on traditional vector supercomputers, we worried endlessly about how to attract normal programmers. We tried everything we could think of: highlevel hardware abstractions, implicitly parallel programming languages, parallel language extensions, and portable messagepassing libraries. But after many years of hard work, the fact of the matter is that "they" didn't come. The overwhelming majority of programmers will not invest the effort to write parallel software. A common view is that you can't teach old programmers new tricks, so the problem will not be solved until the old programmers fade away and a new generation takes over. But we don't buy into that defeatist attitude. Programmers have shown a remarkable ability to adopt new software technologies over the years. Look at how many old Fortran programmers are now writing elegant Java programs with sophisticated objectoriented designs. The problem isn't with old programmers. The problem is with old parallel computing experts and the way they've tried to create a pool of capable parallel programmers. And that's where this book comes in. We want to capture the essence of how expert parallel programmers think about parallel algorithms and communicate that essential understanding in a way professional programmers can readily master. The technology we've adopted to accomplish this task is a pattern language. We made this choice not because we started the project as devotees of design patterns looking for a new field to conquer, but because patterns have been shown to work in ways that would be applicable in parallel programming. For example, patterns have been very effective in the field of objectoriented design. They have provided a common language experts can use to talk about the elements of design and have been extremely effective at helping programmers master object oriented design.

This book contains our pattern language for parallel programming. The book opens with a couple of chapters to introduce the key concepts in parallel computing. These chapters focus on the parallel computing concepts and jargon used in the pattern language as opposed to being an exhaustive introduction to the field. The pattern language itself is presented in four parts corresponding to the four phases of creating a parallel program: * Finding Concurrency. The programmer works in the problem domain to identify the available concurrency and expose it for use in the algorithm design. * Algorithm Structure. The programmer works with highlevel structures for organizing a parallel algorithm. * Supporting Structures. We shift from algorithms to source code and consider how the parallel program will be organized and the techniques used to manage shared data. * Implementation Mechanisms. The final step is to look at specific software constructs for implementing a parallel program. The patterns making up these four design spaces are tightly linked. You start at the top (Finding Concurrency), work through the patterns, and by the time you get to the bottom (Implementation Mechanisms), you will have a detailed design for your parallel program. If the goal is a parallel program, however, you need more than just a parallel algorithm. You also need a programming environment and a notation for expressing the concurrency within the program's source code. Programmers used to be confronted by a large and confusing array of parallel programming environments. Fortunately, over the years the parallel programming community has converged around three programming environments.

* OpenMP. A simple language extension to C, C++, or Fortran to write parallel programs for sharedmemory computers. * MPI. A messagepassing library used on clusters and other distributedmemory computers. * Java. An objectoriented programming language with language features supporting parallel programming on sharedmemory computers and standard class libraries supporting distributed computing. Many readers will already be familiar with one or more of these programming notations, but for readers completely new to parallel computing, we've included a discussion of these programming environments in the appendixes. In closing, we have been working for many years on this pattern language. Presenting it as a book so people can start using it is an exciting development for us. But we don't see this as the end of this effort. We expect that others will have their own ideas about new and better patterns for parallel programming. We've assuredly missed some important features that really belong in this pattern language. We embrace change and look forward to engaging with the larger parallel computing community to iterate on this language. Over time, we'll update and improve the pattern language until it truly represents the consensus view of the parallel programming community. Then our real work will begin—using the pattern language to guide the creation of better parallel programming environments and helping people to use these technologies to write parallel software. We won't rest until the day sequential software is rare. ACKNOWLEDGMENTS We started working together on this pattern language in 1998. It's been a long and twisted road, starting with a vague idea about a new way to think about parallel algorithms and finishing with this book. We couldn't have done this without a great deal of help. Mani Chandy, who thought we would make a good team, introduced Tim to Beverly and Berna. The National Science Foundation, Intel Corp., and Trinity University have supported this research at various times over the years. Help with the patterns themselves came from the people at the Pattern Languages of Programs (PLoP) workshops held in Illinois each summer. The format of these

workshops and the resulting review process was challenging and sometimes difficult, but without them we would have never finished this pattern language. We would also like to thank the reviewers who carefully read early manuscripts and pointed out countless errors and ways to improve the book. Finally, we thank our families. Writing a book is hard on the authors, but that is to be expected. What we didn't fully appreciate was how hard it would be on our families. We are grateful to Beverly's family (Daniel and Steve), Tim's family (Noah, August, and Martha), and Berna's family (Billie) for the sacrifices they've made to support this project. — Tim Mattson, Olympia, Washington, April 2004 — Beverly Sanders, Gainesville, Florida, April 2004 — Berna Massingill, San Antonio, Texas, April 2004

Chapter Section 1.1. Section 1.2. Section 1.3. Section 1.4. Chapter Section 2.1. Section 2.2. Section 2.3. Section 2.4. Section 2.5. Section 2.6. Section 2.7. Chapter Section 3.1. Section 3.2. Section 3.3. Section 3.4. Section 3.5. Section 3.6. Section 3.7. Section 3.8. Chapter Section 4.1. Section 4.2. Section 4.3. 1. A Pattern Language for Parallel Programming INTRODUCTION PARALLEL PROGRAMMING DESIGN PATTERNS AND PATTERN LANGUAGES A PATTERN LANGUAGE FOR PARALLEL PROGRAMMING 2. Background and Jargon of Parallel Computing CONCURRENCY IN PARALLEL PROGRAMS VERSUS OPERATING SYSTEMS PARALLEL ARCHITECTURES: A BRIEF INTRODUCTION PARALLEL PROGRAMMING ENVIRONMENTS THE JARGON OF PARALLEL COMPUTING A QUANTITATIVE LOOK AT PARALLEL COMPUTATION COMMUNICATION SUMMARY 3. The Finding Concurrency Design Space ABOUT THE DESIGN SPACE THE TASK DECOMPOSITION PATTERN THE DATA DECOMPOSITION PATTERN THE GROUP TASKS PATTERN THE ORDER TASKS PATTERN THE DATA SHARING PATTERN THE DESIGN EVALUATION PATTERN SUMMARY 4. The Algorithm Structure Design Space INTRODUCTION CHOOSING AN ALGORITHM STRUCTURE PATTERN EXAMPLES

THE TASK PARALLELISM PATTERN Section 4.4. THE DIVIDE AND CONQUER PATTERN Section 4.5. THE GEOMETRIC DECOMPOSITION PATTERN Section 4.6. THE RECURSIVE DATA PATTERN Section 4.7. THE PIPELINE PATTERN Section 4.8. THE EVENTBASED COORDINATION PATTERN Section 4.9. 5. The Supporting Structures Design Space Chapter INTRODUCTION Section 5.1. FORCES Section 5.2. CHOOSING THE PATTERNS Section 5.3. THE SPMD PATTERN Section 5.4. THE MASTER/WORKER PATTERN Section 5.5. THE LOOP PARALLELISM PATTERN Section 5.6. THE FORK/JOIN PATTERN Section 5.7. THE SHARED DATA PATTERN Section 5.8. THE SHARED QUEUE PATTERN Section 5.9. THE DISTRIBUTED ARRAY PATTERN Section 5.10. OTHER SUPPORTING STRUCTURES Section 5.11. 6. The Implementation Mechanisms Design Space OVERVIEW UE MANAGEMENT SYNCHRONIZATION COMMUNICATION Chapter Section 6.1. Section 6.2. Section 6.3. Section 6.4. Endnotes A: A Brief Introduction to OpenMP Appendix CORE CONCEPTS Section A.1. STRUCTURED BLOCKS AND DIRECTIVE FORMATS Section A.2. WORKSHARING Section A.3. Section A.4. DATA ENVIRONMENT CLAUSES THE OpenMP RUNTIME LIBRARY Section A.5. SYNCHRONIZATION Section A.6. THE SCHEDULE CLAUSE Section A.7. THE REST OF THE LANGUAGE Section A.8. B: A Brief Introduction to MPI Appendix CONCEPTS Section B.1. GETTING STARTED Section B.2. BASIC POINTTOPOINT MESSAGE PASSING Section B.3. COLLECTIVE OPERATIONS Section B.4. ADVANCED POINTTOPOINT MESSAGE PASSING Section B.5. MPI AND FORTRAN Section B.6. CONCLUSION Section B.7. Appendix C: A Brief Introduction to Concurrent Programming in Java CREATING THREADS Section C.1. ATOMICITY, MEMORY SYNCHRONIZATION, AND THE volatile KEYWORD Section C.2.

Section C.3. Section C.4. Section C.5. Section C.6. STRUCTURES Section C.7. Glossary Bibliography INTERRUPTS SYNCHRONIZED BLOCKS WAIT AND NOTIFY LOCKS OTHER SYNCHRONIZATION MECHANISMS AND SHARED DATA About the Authors Index

A Pattern Language for Parallel Programming > INTRODUCTION Chapter 1. A Pattern Language for Parallel Programming 1.1 INTRODUCTION 1.2 PARALLEL PROGRAMMING 1.3 DESIGN PATTERNS AND PATTERN LANGUAGES 1.4 A PATTERN LANGUAGE FOR PARALLEL PROGRAMMING 1.1. INTRODUCTION Computers are used to model physical systems in many fields of science, medicine, and engineering. Modelers, whether trying to predict the weather or render a scene in the next blockbuster movie, can usually use whatever computing power is available to make ever more detailed simulations. Vast amounts of data, whether customer shopping patterns, telemetry data from space, or DNA sequences, require analysis. To deliver the required power, computer designers combine multiple processing elements into a single larger system. These socalled parallel computers run multiple tasks simultaneously and solve bigger problems in less time. Traditionally, parallel computers were rare and available for only the most critical problems. Since the mid1990s, however, the availability of parallel computers has changed dramatically. With multithreading support built into the latest microprocessors and the emergence of multiple processor cores on a single silicon die, parallel computers are becoming ubiquitous. Now, almost every university computer science department has at least one parallel computer. Virtually all oil companies, automobile manufacturers, drug development companies, and special effects studios use parallel computing. For example, in computer animation, rendering is the step where information from the animation files, such as lighting, textures, and shading, is applied to 3D models to generate the 2D image that makes up a frame of the film. Parallel computing is essential to generate the needed number of frames (24 per second) for a featurelength film. Toy Story, the first completely computergenerated feature length film, released by Pixar in 1995, was processed on a "renderfarm" consisting of 100 dual

分享到：

赞收藏

资料库

并行编程模式超清晰pdf 英文原版.pdf

相关推荐

开发技术

热门标签

最新资料