Structured Parallel Programming - Patterns for Efficient Computa....pdf

发布时间：2022-05-29 发布人：admin 分类：说明书资料大小：5.02M 资料格式：pdf 举报版权申诉

lcljesse-8721951-4744302542914552737.pdf-第1页.png

第1页 / 共433页

lcljesse-8721951-4744302542914552737.pdf-第2页.png

第2页 / 共433页

lcljesse-8721951-4744302542914552737.pdf-第3页.png

第3页 / 共433页

lcljesse-8721951-4744302542914552737.pdf-第4页.png

第4页 / 共433页

lcljesse-8721951-4744302542914552737.pdf-第5页.png

第5页 / 共433页

lcljesse-8721951-4744302542914552737.pdf-第6页.png

第6页 / 共433页

lcljesse-8721951-4744302542914552737.pdf-第7页.png

第7页 / 共433页

lcljesse-8721951-4744302542914552737.pdf-第8页.png

第8页 / 共433页

Front Cover

Structured Parallel Programming: Patterns for Efficient Computation

Table of Contents

Listings

Preface

Preliminaries

1 Introduction

1.1 Think Parallel

1.2 Performance

1.3 Motivation: Pervasive Parallelism

1.3.1 Hardware Trends Encouraging Parallelism

1.3.2 Observed Historical Trends in Parallelism

1.3.3 Need for Explicit Parallel Programming

1.4 Structured Pattern-Based Programming

1.5 Parallel Programming Models

1.5.1 Desired Properties

1.5.2 Abstractions Instead of Mechanisms

1.5.3 Expression of Regular Data Parallelism

1.5.4 Composability

1.5.5 Portability of Functionality

1.5.6 Performance Portability

1.5.7 Safety, Determinism, and Maintainability

1.5.8 Overview of Programming Models Used

Cilk Plus

Threading Building Blocks (TBB)

OpenMP

Array Building Blocks (ArBB)

OpenCL

1.5.9 When to Use Which Model?

1.6 Organization of this Book

1.7 Summary

2 Background

2.1 Vocabulary and Notation

2.2 Strategies

2.3 Mechanisms

2.4 Machine Models

2.4.1 Machine Model

Instruction Parallelism

Memory Hierarchy

Virtual Memory

Multiprocessor Systems

Attached Devices

2.4.2 Key Features for Performance

Data Locality

Parallel Slack

2.4.3 Flynn's Characterization

2.4.4 Evolution

2.5 Performance Theory

2.5.1 Latency and Throughput

2.5.2 Speedup, Efficiency, and Scalability

2.5.3 Power

2.5.4 Amdahl's Law

2.5.5 Gustafson-Barsis' Law

2.5.6 Work-Span Model

2.5.7 Asymptotic Complexity

2.5.8 Asymptotic Speedup and Efficiency

2.5.9 Little's Formula

2.6 Pitfalls

2.6.1 Race Conditions

2.6.2 Mutual Exclusion and Locks

2.6.3 Deadlock

2.6.4 Strangled Scaling

2.6.5 Lack of Locality

2.6.6 Load Imbalance

2.6.7 Overhead

2.7 Summary

I Patterns

3 Patterns

3.1 Nesting Pattern

3.2 Structured Serial Control Flow Patterns

3.2.1 Sequence

3.2.2 Selection

3.2.3 Iteration

3.2.4 Recursion

3.3 Parallel Control Patterns

3.3.1 Fork–Join

3.3.2 Map

3.3.3 Stencil

3.3.4 Reduction

3.3.5 Scan

3.3.6 Recurrence

3.4 Serial Data Management Patterns

3.4.1 Random Read and Write

3.4.2 Stack Allocation

3.4.3 Heap Allocation

3.4.4 Closures

3.4.5 Objects

3.5 Parallel Data Management Patterns

3.5.1 Pack

3.5.2 Pipeline

3.5.3 Geometric Decomposition

3.5.4 Gather

3.5.5 Scatter

3.6 Other Parallel Patterns

3.6.1 Superscalar Sequences

3.6.2 Futures

3.6.3 Speculative Selection

3.6.4 Workpile

3.6.5 Search

3.6.6 Segmentation

3.6.7 Expand

3.6.8 Category Reduction

3.6.9 Term Graph Rewriting

3.7 Non-Deterministic Patterns

3.7.1 Branch and Bound

3.7.2 Transactions

3.8 Programming Model Support for Patterns

3.8.1 Cilk Plus

Nesting, Recursion, Fork–Join

Reduction

Map, Workpile

Scatter, Gather

3.8.2 Threading Building Blocks

Nesting, Recursion, Fork–Join

Map

Workpile

Reduction

Scan

Pipeline

Speculative Selection, Branch and Bound

3.8.3 OpenMP

Map, Workpile

Reduction

Fork–Join

Stencil, Geometric Decomposition, Gather, Scatter

3.8.4 Array Building Blocks

Map

Reduction, Scan

Scatter, Gather

Pack

3.8.5 OpenCL

Map

Gather

Scatter

Reduction, Scan, Pack, Expand

Stencil

Workpile

Superscalar Sequences

Geometric Decomposition

Closures

3.9 Summary

4 Map

4.1 Map

4.2 Scaled Vector Addition (SAXPY)

4.2.1 Description of the Problem

4.2.2 Serial Implementation

4.2.3 TBB

4.2.4 Cilk Plus

4.2.5 Cilk Plus with Array Notation

4.2.6 OpenMP

4.2.7 ArBB Using Vector Operations

4.2.8 ArBB Using Elemental Functions

4.2.9 OpenCL

4.3 Mandelbrot

4.3.1 Description of the Problem

4.3.2 Serial Implementation

4.3.3 TBB

4.3.4 Cilk Plus

4.3.5 Cilk Plus with Array Notations

4.3.6 OpenMP

4.3.7 ArBB

4.3.8 OpenCL

4.4 Sequence of Maps versus Map of Sequence

4.5 Comparison of Parallel Models

4.6 Related Patterns

4.6.1 Stencil

4.6.2 Workpile

4.6.3 Divide-and-conquer

4.7 Summary

5 Collectives

5.1 Reduce

5.1.1 Reordering Computations

5.1.2 Vectorization

5.1.3 Tiling

5.1.4 Precision

5.1.5 Implementation

OpenCL

TBB

Cilk Plus

ArBB

OpenMP

5.2 Fusing Map and Reduce

5.2.1 Explicit Fusion in TBB

5.2.2 Explicit Fusion in Cilk Plus

5.2.3 Automatic Fusion in ArBB

5.3 Dot Product

5.3.1 Description of the Problem

5.3.2 Serial Implementation

5.3.3 SSE Intrinsics

5.3.4 TBB

5.3.5 Cilk Plus

5.3.6 OpenMP

5.3.7 ArBB

5.4 Scan

5.4.1 Cilk Plus

5.4.2 TBB

5.4.3 ArBB

5.4.4 OpenMP

5.5 Fusing Map and Scan

5.6 Integration

5.6.1 Description of the Problem

5.6.2 Serial Implementation

5.6.3 Cilk Plus

5.6.4 OpenMP

5.6.5 TBB

5.6.6 ArBB

5.7 Summary

6 Data Reorganization

6.1 Gather

6.1.1 General Gather

6.1.2 Shift

6.1.3 Zip

6.2 Scatter

6.2.1 Atomic Scatter

6.2.2 Permutation Scatter

6.2.3 Merge Scatter

6.2.4 Priority Scatter

6.3 Converting Scatter to Gather

6.4 Pack

6.5 Fusing Map and Pack

6.6 Geometric Decomposition and Partition

6.7 Array of Structures vs. Structures of Arrays

6.8 Summary

7 Stencil and Recurrence

7.1 Stencil

7.2 Implementing Stencil with Shift

7.3 Tiling Stencils for Cache

7.4 Optimizing Stencils for Communication

7.5 Recurrence

7.6 Summary

8 Fork–Join

8.1 Definition

8.2 Programming Model Support for Fork–Join

8.2.1 Cilk Plus Support for Fork–Join

8.2.2 TBB Support for Fork–Join

8.2.3 OpenMP Support for Fork–Join

8.3 Recursive Implementation of Map

8.4 Choosing Base Cases

8.5 Load Balancing

8.6 Complexity of Parallel Divide-and-Conquer

8.7 Karatsuba Multiplication of Polynomials

8.7.1 Note on Allocating Scratch Space

8.8 Cache Locality and Cache-Oblivious Algorithms

8.9 Quicksort

8.9.1 Cilk Quicksort

8.9.2 TBB Quicksort

8.9.3 Work and Span for Quicksort

8.10 Reductions and Hyperobjects

8.11 Implementing Scan with Fork–Join

8.12 Applying Fork–Join to Recurrences

8.12.1 Analysis

8.12.2 Flat Fork–Join

8.13 Summary

9 Pipeline

9.1 Basic Pipeline

9.2 Pipeline with Parallel Stages

9.3 Implementation of a Pipeline

9.4 Programming Model Support for Pipelines

9.4.1 Pipeline in TBB

9.4.2 Pipeline in Cilk Plus

9.5 More General Topologies

9.6 Mandatory versus Optional Parallelism

9.7 Summary

II Examples

10 Forward Seismic Simulation

10.1 Background

10.2 Stencil Computation

10.3 Impact of Caches on Arithmetic Intensity

10.4 Raising Arithmetic Intensity with Space–Time Tiling

10.5 Cilk Plus Code

10.6 ArBB Implementation

10.7 Summary

11 K-Means Clustering

11.1 Algorithm

11.2 K-Means with Cilk Plus

11.2.1 Hyperobjects

11.3 K-Means with TBB

11.4 Summary

12 Bzip2 Data Compression

12.1 The Bzip2 Algorithm

12.2 Three-Stage Pipeline Using TBB

12.3 Four-Stage Pipeline Using TBB

12.4 Three-Stage Pipeline Using Cilk Plus

12.5 Summary

13 Merge Sort

13.1 Parallel Merge

13.1.1 TBB Parallel Merge

13.1.2 Work and Span of Parallel Merge

13.2 Parallel Merge Sort

13.2.1 Work and Span of Merge Sort

13.3 Summary

14 Sample Sort

14.1 Overall Structure

14.2 Choosing the Number of Bins

14.3 Binning

14.3.1 TBB Implementation

14.4 Repacking and Subsorting

14.5 Performance Analysis of Sample Sort

14.6 For C++ Experts

14.7 Summary

15 Cholesky Factorization

15.1 Fortran Rules!

15.2 Recursive Cholesky Decomposition

15.3 Triangular Solve

15.4 Symmetric Rank Update

15.5 Where Is the Time Spent?

15.6 Summary

Appendices

Appendix A: Further Reading

A.1 Parallel Algorithms and Patterns

A.2 Computer Architecture Including Parallel Systems

A.3 Parallel Programming Models

Appendix B: Cilk Plus

B.1 Shared Design Principles with TBB

B.2 Unique Characteristics

B.3 Borrowing Components from TBB

B.4 Keyword Spelling

B.5 cilk_for

B.6 cilk_spawn and cilk_sync

B.7 Reducers (Hyperobjects)

B.7.1 C++ Syntax

B.7.2 C Syntax

B.8 Array Notation

B.8.1 Specifying Array Sections

B.8.2 Operations on Array Sections

B.8.3 Reductions on Array Sections

B.8.4 Implicit Index

B.8.5 Avoid Partial Overlap of Array Sections

B.9 #pragma simd

B.10 Elemental Functions

B.10.1 Attribute Syntax

B.11 Note on C++11

B.12 Notes on Setup

B.13 History

B.14 Summary

Appendix C: TBB

C.1 Unique Characteristics

C.2 Using TBB

C.3 parallel_for

C.3.1 blocked_range

C.3.2 Partitioners

C.4 parallel_reduce

C.5 parallel_deterministic_reduce

C.6 parallel_pipeline

C.7 parallel_invoke

C.8 task_group

C.9 task

C.9.1 empty_task

C.10 atomic

C.11 enumerable_thread_specific

C.12 Notes on C++11

C.13 History

C.14 Summary

Appendix D: C++11

D.1 Declaring with auto

D.2 Lambda Expressions

D.3 std::move

Appendix E: Glossary

Bibliography

Index

This page intentionally left blank

Structured Parallel Programming

This page intentionally left blank

Structured Parallel Programming Patterns for Efﬁcient Computation Michael McCool Arch D. Robison James Reinders AMSTERDAM • BOSTON • HEIDELBERG • LONDON NEW YORK • OXFORD • PARIS • SAN DIEGO SAN FRANCISCO • SINGAPORE • SYDNEY • TOKYO Morgan Kaufmann Publishers is an imprint of Elsevier

Acquiring Editor: Todd Green Development Editor: Robyn Day Project Manager: Paul Gottehrer Designer: Joanne Blank Morgan Kaufmann is an imprint of Elsevier 225 Wyman Street, Waltham, MA 02451, USA c 2012 Elsevier, Inc. All rights reserved. No part of this publication may be reproduced or transmitted in any form or by any means, electronic or mechanical, including photocopying, recording, or any information storage and retrieval system, without permission in writing from the publisher. Details on how to seek permission, further information about the Publisher’s permissions policies and our arrangements with organizations such as the Copyright Clearance Center and the Copyright Licensing Agency, can be found at our website: www.elsevier.com/permissions. This book and the individual contributions contained in it are protected under copyright by the Publisher (other than as may be noted herein). Notices Knowledge and best practice in this ﬁeld are constantly changing. As new research and experience broaden our understanding, changes in research methods or professional practices, may become necessary. Practitioners and researchers must always rely on their own experience and knowledge in evaluating and using any information or methods described herein. In using such information or methods they should be mindful of their own safety and the safety of others, including parties for whom they have a professional responsibility. To the fullest extent of the law, neither the Publisher nor the authors, contributors, or editors, assume any liability for any injury and/or damage to persons or property as a matter of products liability, negligence or otherwise, or from any use or operation of any methods, products, instructions, or ideas contained in the material herein. Library of Congress Cataloging-in-Publication Data Application submitted. British Library Cataloguing-in-Publication Data A catalogue record for this book is available from the British Library. ISBN: 978-0-12-415993-8 For information on all MK publications visit our website at http://store.elsevier.com Printed in the United States of America 12 13 14 15 16 10 9 8 7 6 5 4 3 2 1

Contents Listings ........ ......... ......... .......... ......... ......... ......... ......... ......... .......... ...... xv Preface ......... ......... ......... ......... .......... ......... ......... ......... ......... .......... ...... xix Preliminaries... ......... ......... .......... ......... ......... ......... ......... ......... .......... ...... xxiii CHAPTER 1 Introduction ..... ......... ......... ......... .......... ......... ......... ......... ..... 1.1 Think Parallel .. ......... ......... ......... .......... ......... ......... ......... ...... 1.2 Performance.... ......... ......... ......... .......... ......... ......... ......... ...... 1.3 Motivation: Pervasive Parallelism ....... .......... ......... ......... ......... ...... 1.3.1 Hardware Trends Encouraging Parallelism ........ .......... ......... ...... 1.3.2 Observed Historical Trends in Parallelism .......... ......... ......... ...... 1.3.3 Need for Explicit Parallel Programming .. .......... ......... ......... ...... 1.4 Structured Pattern-Based Programming . .......... ......... ......... ......... ...... 1.5 Parallel Programming Models .. ......... .......... ......... ......... ......... ...... 1.5.1 Desired Properties ........ ......... ......... .......... ......... ......... ...... 1.5.2 Abstractions Instead of Mechanisms ...... ......... .......... ......... ...... 1.5.3 Expression of Regular Data Parallelism ... .......... ......... ......... ...... 1.5.4 Composability ... ......... ......... ......... ......... .......... ......... ...... 1.5.5 Portability of Functionality ....... ......... ......... .......... ......... ...... 1.5.6 Performance Portability .. ......... ......... ......... .......... ......... ...... 1.5.7 Safety, Determinism, and Maintainability . ......... .......... ......... ...... 1.5.8 Overview of Programming Models Used . ......... .......... ......... ...... 1.5.9 When to Use Which Model? ...... ......... ......... .......... ......... ...... 1.6 Organization of this Book ....... ......... ......... .......... ......... ......... ...... 1.7 Summary ....... ......... ......... ......... ......... .......... ......... ......... ...... 1 2 4 7 7 11 14 19 21 21 23 24 27 28 28 29 29 36 37 38 CHAPTER 2 Background ..... ......... ......... ......... ......... .......... ......... ......... ..... 39 39 40 41 44 44 50 51 53 54 55 56 2.1 Vocabulary and Notation ........ ......... ......... .......... ......... ......... ...... 2.2 Strategies ....... ......... ......... ......... .......... ......... ......... ......... ...... 2.3 Mechanisms .... ......... ......... ......... ......... .......... ......... ......... ...... 2.4 Machine Models ........ ......... ......... ......... .......... ......... ......... ...... 2.4.1 Machine Model .. ......... ......... ......... ......... .......... ......... ...... 2.4.2 Key Features for Performance .... ......... ......... .......... ......... ...... 2.4.3 Flynn’s Characterization . ......... ......... ......... .......... ......... ...... 2.4.4 Evolution ......... ......... ......... ......... ......... .......... ......... ...... 2.5 Performance Theory .... ......... ......... ......... .......... ......... ......... ...... 2.5.1 Latency and Throughput . ......... ......... ......... .......... ......... ...... 2.5.2 Speedup, Efﬁciency, and Scalability....... ......... .......... ......... ...... v

vi Contents 2.5.3 Power... .......... ......... ......... ......... ......... ......... .......... ...... 2.5.4 Amdahl’s Law ... ......... ......... ......... ......... ......... .......... ...... 2.5.5 Gustafson-Barsis’ Law ... ......... ......... ......... ......... .......... ...... 2.5.6 Work-Span Model ........ ......... ......... ......... ......... .......... ...... 2.5.7 Asymptotic Complexity .. ......... ......... ......... ......... .......... ...... 2.5.8 Asymptotic Speedup and Efﬁciency ....... ......... ......... .......... ...... 2.5.9 Little’s Formula .. ......... ......... ......... ......... ......... .......... ...... 2.6 Pitfalls .......... ......... ......... ......... ......... ......... .......... ......... ...... 2.6.1 Race Conditions . ......... ......... ......... ......... ......... .......... ...... 2.6.2 Mutual Exclusion and Locks ...... ......... ......... ......... .......... ...... 2.6.3 Deadlock ......... ......... ......... ......... ......... ......... .......... ...... 2.6.4 Strangled Scaling ......... ......... ......... ......... ......... .......... ...... 2.6.5 Lack of Locality . ......... ......... ......... ......... ......... .......... ...... 2.6.6 Load Imbalance .. ......... ......... ......... ......... ......... .......... ...... 2.6.7 Overhead ......... ......... ......... ......... ......... ......... .......... ...... 2.7 Summary ....... ......... ......... ......... ......... ......... .......... ......... ...... 57 58 60 62 65 67 67 68 68 70 72 73 73 74 74 75 PART I PATTERNS CHAPTER 3 Patterns..... .......... ......... ......... ......... ......... ......... .......... ......... 79 80 82 82 84 84 87 88 88 88 89 90 92 95 95 96 96 96 97 97 3.1 Nesting Pattern. ......... ......... ......... ......... ......... .......... ......... ...... 3.2 Structured Serial Control Flow Patterns . ......... ......... .......... ......... ...... 3.2.1 Sequence ......... ......... ......... ......... ......... ......... .......... ...... 3.2.2 Selection ......... ......... ......... ......... ......... ......... .......... ...... 3.2.3 Iteration .......... ......... ......... ......... ......... ......... .......... ...... 3.2.4 Recursion ........ ......... ......... ......... ......... ......... .......... ...... 3.3 Parallel Control Patterns......... ......... ......... ......... .......... ......... ...... 3.3.1 Fork–Join ........ ......... ......... ......... ......... ......... .......... ...... 3.3.2 Map..... .......... ......... ......... ......... ......... ......... .......... ...... 3.3.3 Stencil .. .......... ......... ......... ......... ......... ......... .......... ...... 3.3.4 Reduction ........ ......... ......... ......... ......... ......... .......... ...... 3.3.5 Scan .... .......... ......... ......... ......... ......... ......... .......... ...... 3.3.6 Recurrence ....... ......... ......... ......... ......... ......... .......... ...... 3.4 Serial Data Management Patterns........ ......... ......... .......... ......... ...... 3.4.1 Random Read and Write . ......... ......... ......... ......... .......... ...... 3.4.2 Stack Allocation . ......... ......... ......... ......... ......... .......... ...... 3.4.3 Heap Allocation . ......... ......... ......... ......... ......... .......... ...... 3.4.4 Closures .......... ......... ......... ......... ......... ......... .......... ...... 3.4.5 Objects . .......... ......... ......... ......... ......... ......... .......... ......

Contents vii 3.5 Parallel Data Management Patterns...... .......... ......... ......... ......... ...... 98 98 3.5.1 Pack ..... ......... ......... ......... ......... .......... ......... ......... ...... 3.5.2 Pipeline.. ......... ......... ......... ......... .......... ......... ......... ...... 99 3.5.3 Geometric Decomposition ........ ......... ......... .......... ......... ...... 100 3.5.4 Gather ... ......... ......... ......... ......... .......... ......... ......... ...... 101 3.5.5 Scatter ... ......... ......... ......... ......... .......... ......... ......... ...... 101 3.6 Other Parallel Patterns .. ......... ......... .......... ......... ......... ......... ...... 102 3.6.1 Superscalar Sequences ... ......... ......... .......... ......... ......... ...... 102 3.6.2 Futures .. ......... ......... ......... ......... .......... ......... ......... ...... 102 3.6.3 Speculative Selection ..... ......... ......... .......... ......... ......... ...... 104 3.6.4 Workpile ......... ......... ......... ......... .......... ......... ......... ...... 105 3.6.5 Search ... ......... ......... ......... ......... .......... ......... ......... ...... 105 3.6.6 Segmentation .... ......... ......... ......... .......... ......... ......... ...... 105 3.6.7 Expand .. ......... ......... ......... ......... .......... ......... ......... ...... 106 3.6.8 Category Reduction ...... ......... ......... ......... .......... ......... ...... 106 3.6.9 Term Graph Rewriting ... ......... ......... .......... ......... ......... ...... 107 3.7 Non-Deterministic Patterns ..... ......... .......... ......... ......... ......... ...... 108 3.7.1 Branch and Bound ........ ......... ......... ......... .......... ......... ...... 108 3.7.2 Transactions...... ......... ......... ......... .......... ......... ......... ...... 109 3.8 Programming Model Support for Patterns......... ......... ......... ......... ...... 110 3.8.1 Cilk Plus ......... ......... ......... ......... .......... ......... ......... ...... 112 3.8.2 Threading Building Blocks ....... ......... ......... .......... ......... ...... 113 3.8.3 OpenMP. ......... ......... ......... ......... .......... ......... ......... ...... 114 3.8.4 Array Building Blocks ... ......... ......... ......... .......... ......... ...... 115 3.8.5 OpenCL . ......... ......... ......... ......... ......... .......... ......... ...... 116 3.9 Summary ....... ......... ......... ......... .......... ......... ......... ......... ...... 118 CHAPTER 4 Map ..... ......... ......... .......... ......... ......... ......... ......... ......... ..... 121 4.1 Map .... ......... ......... ......... ......... .......... ......... ......... ......... ...... 123 4.2 Scaled Vector Addition (SAXPY) ....... .......... ......... ......... ......... ...... 124 4.2.1 Description of the Problem........ ......... ......... .......... ......... ...... 124 4.2.2 Serial Implementation .... ......... ......... ......... .......... ......... ...... 125 4.2.3 TBB ..... ......... ......... ......... ......... .......... ......... ......... ...... 125 4.2.4 Cilk Plus ......... ......... ......... ......... ......... .......... ......... ...... 127 4.2.5 Cilk Plus with Array Notation .... ......... .......... ......... ......... ...... 127 4.2.6 OpenMP. ......... ......... ......... ......... ......... .......... ......... ...... 128 4.2.7 ArBB Using Vector Operations ... ......... ......... .......... ......... ...... 128 4.2.8 ArBB Using Elemental Functions ......... ......... .......... ......... ...... 129 4.2.9 OpenCL . ......... ......... ......... ......... ......... .......... ......... ...... 130

分享到：

赞收藏

资料库

Structured Parallel Programming - Patterns for Efficient Computa....pdf

相关推荐

开发技术

热门标签

最新资料