CUDA Programming A Developer_'s Guide to Parallel Computing with....pdf

发布时间：2022-06-08 发布人：admin 分类：说明书资料大小：16.58M 资料格式：pdf 举报版权申诉

hehesakura-10362884-4744300845384431148.pdf-第1页.png

第1页 / 共591页

hehesakura-10362884-4744300845384431148.pdf-第2页.png

第2页 / 共591页

hehesakura-10362884-4744300845384431148.pdf-第3页.png

第3页 / 共591页

hehesakura-10362884-4744300845384431148.pdf-第4页.png

第4页 / 共591页

hehesakura-10362884-4744300845384431148.pdf-第5页.png

第5页 / 共591页

hehesakura-10362884-4744300845384431148.pdf-第6页.png

第6页 / 共591页

hehesakura-10362884-4744300845384431148.pdf-第7页.png

第7页 / 共591页

hehesakura-10362884-4744300845384431148.pdf-第8页.png

第8页 / 共591页

Front Cover

CUDA Programming: A Developer’s Guide to Parallel Computing with GPUs

Contents

Preface

Chapter 1 - A Short History of Supercomputing

INTRODUCTION

VON NEUMANN ARCHITECTURE

CRAY

CONNECTION MACHINE

CELL PROCESSOR

MULTINODE COMPUTING

THE EARLY DAYS OF GPGPU CODING

THE DEATH OF THE SINGLE-CORE SOLUTION

NVIDIA AND CUDA

GPU HARDWARE

ALTERNATIVES TO CUDA

CONCLUSION

Chapter 2 - Understanding Parallelism with GPUs

INTRODUCTION

TRADITIONAL SERIAL CODE

SERIAL/PARALLEL PROBLEMS

CONCURRENCY

TYPES OF PARALLELISM

FLYNN’S TAXONOMY

SOME COMMON PARALLEL PATTERNS

CONCLUSION

Chapter 3 - CUDA Hardware Overview

PC ARCHITECTURE

GPU HARDWARE

CPUS AND GPUS

COMPUTE LEVELS

Chapter 4 - Setting Up CUDA

INTRODUCTION

INSTALLING THE SDK UNDER WINDOWS

VISUAL STUDIO

LINUX

MAC

INSTALLING A DEBUGGER

COMPILATION MODEL

ERROR HANDLING

CONCLUSION

Chapter 5 - Grids, Blocks, and Threads

WHAT IT ALL MEANS

THREADS

BLOCKS

GRIDS

WARPS

BLOCK SCHEDULING

A PRACTICAL EXAMPLE—HISTOGRAMS

CONCLUSION

Chapter 6 - Memory Handling with CUDA

INTRODUCTION

CACHES

SHARED MEMORY

CONSTANT MEMORY

GLOBAL MEMORY

TEXTURE MEMORY

CONCLUSION

Chapter 7 - Using CUDA in Practice

INTRODUCTION

SERIAL AND PARALLEL CODE

PROCESSING DATASETS

PROFILING

AN EXAMPLE USING AES

CONCLUSION

References

Chapter 8 - Multi-CPU and Multi-GPU Solutions

INTRODUCTION

LOCALITY

MULTI-CPU SYSTEMS

MULTI-GPU SYSTEMS

ALGORITHMS ON MULTIPLE GPUS

WHICH GPU?

SINGLE-NODE SYSTEMS

STREAMS

MULTIPLE-NODE SYSTEMS

CONCLUSION

Chapter 9 - Optimizing Your Application

STRATEGY 1: PARALLEL/SERIAL GPU/CPU PROBLEM BREAKDOWN

STRATEGY 2: MEMORY CONSIDERATIONS

STRATEGY 3: TRANSFERS

STRATEGY 4: THREAD USAGE, CALCULATIONS, AND DIVERGENCE

STRATEGY 5: ALGORITHMS

STRATEGY 6: RESOURCE CONTENTIONS

STRATEGY 7: SELF-TUNING APPLICATIONS

CONCLUSION

Chapter 10 - Libraries and SDK

INTRODUCTION

LIBRARIES

CUDA COMPUTING SDK

DIRECTIVE-BASED PROGRAMMING

WRITING YOUR OWN KERNELS

CONCLUSION

Chapter 11 - Designing GPU-Based Systems

INTRODUCTION

CPU PROCESSOR

GPU DEVICE

PCI-E BUS

GEFORCE CARDS

CPU MEMORY

AIR COOLING

LIQUID COOLING

DESKTOP CASES AND MOTHERBOARDS

MASS STORAGE

POWER CONSIDERATIONS

OPERATING SYSTEMS

CONCLUSION

Chapter 12 - Common Problems, Causes, and Solutions

INTRODUCTION

ERRORS WITH CUDA DIRECTIVES

PARALLEL PROGRAMMING ISSUES

ALGORITHMIC ISSUES

FINDING AND AVOIDING ERRORS

DEVELOPING FOR FUTURE GPUS

FURTHER RESOURCES

CONCLUSION

References

Index

CUDA Programming A Developer’s Guide to Parallel Computing with GPUs

This page intentionally left blank

CUDA Programming A Developer’s Guide to Parallel Computing with GPUs Shane Cook AMSTERDAM BOSTON HEIDELBERG LONDON NEW YORK OXFORD PARIS SAN DIEGO SAN FRANCISCO SINGAPORE SYDNEY TOKYO Morgan Kaufmann is an Imprint of Elsevier

Acquiring Editor: Todd Green Development Editor: Robyn Day Project Manager: Andre Cuello Designer: Kristen Davis Morgan Kaufmann is an imprint of Elsevier 225 Wyman Street, Waltham, MA 02451, USA Ó 2013 Elsevier Inc. All rights reserved. No part of this publication may be reproduced or transmitted in any form or by any means, electronic or mechanical, including photocopying, recording, or any information storage and retrieval system, without permission in writing from the publisher. Details on how to seek permission, further information about the Publisher’s permissions policies and our arrangements with organizations such as the Copyright Clearance Center and the Copyright Licensing Agency, can be found at our website: www.elsevier.com/permissions. This book and the individual contributions contained in it are protected under copyright by the Publisher (other than as may be noted herein). Notices Knowledge and best practice in this ﬁeld are constantly changing. As new research and experience broaden our understanding, changes in research methods or professional practices, may become necessary. Practitioners and researchers must always rely on their own experience and knowledge in evaluating and using any information or methods described herein. In using such information or methods they should be mindful of their own safety and the safety of others, including parties for whom they have a professional responsibility. To the fullest extent of the law, neither the Publisher nor the authors, contributors, or editors, assume any liability for any injury and/or damage to persons or property as a matter of products liability, negligence or otherwise, or from any use or operation of any methods, products, instructions, or ideas contained in the material herein. Library of Congress Cataloging-in-Publication Data Application submitted British Library Cataloguing-in-Publication Data A catalogue record for this book is available from the British Library. ISBN: 978-0-12-415933-4 For information on all MK publications visit our website at http://store.elsevier.com Printed in the United States of America 13 14 10 9 8 7 6 5 4 3 2 1

Contents Preface ................................................................................................................................................ xiii CHAPTER 1 A Short History of Supercomputing................................................1 Introduction ................................................................................................................ 1 Von Neumann Architecture........................................................................................ 2 Cray............................................................................................................................. 5 Connection Machine................................................................................................... 6 Cell Processor............................................................................................................. 7 Multinode Computing ................................................................................................ 9 The Early Days of GPGPU Coding ......................................................................... 11 The Death of the Single-Core Solution ................................................................... 12 NVIDIA and CUDA................................................................................................. 13 GPU Hardware ......................................................................................................... 15 Alternatives to CUDA .............................................................................................. 16 OpenCL ............................................................................................................... 16 DirectCompute .................................................................................................... 17 CPU alternatives.................................................................................................. 17 Directives and libraries ....................................................................................... 18 Conclusion ................................................................................................................ 19 CHAPTER 2 Understanding Parallelism with GPUs ......................................... 21 Introduction .............................................................................................................. 21 Traditional Serial Code ............................................................................................ 21 Serial/Parallel Problems ........................................................................................... 23 Concurrency.............................................................................................................. 24 Locality................................................................................................................ 25 Types of Parallelism ................................................................................................. 27 Task-based parallelism ........................................................................................ 27 Data-based parallelism ........................................................................................ 28 Flynn’s Taxonomy .................................................................................................... 30 Some Common Parallel Patterns.............................................................................. 31 Loop-based patterns ............................................................................................ 31 Fork/join pattern.................................................................................................. 33 Tiling/grids .......................................................................................................... 35 Divide and conquer ............................................................................................. 35 Conclusion ................................................................................................................ 36 CHAPTER 3 CUDA Hardware Overview........................................................... 37 PC Architecture ........................................................................................................ 37 GPU Hardware ......................................................................................................... 42 v

vi Contents CPUs and GPUs ....................................................................................................... 46 Compute Levels........................................................................................................ 46 Compute 1.0 ........................................................................................................ 47 Compute 1.1 ........................................................................................................ 47 Compute 1.2 ........................................................................................................ 49 Compute 1.3 ........................................................................................................ 49 Compute 2.0 ........................................................................................................ 49 Compute 2.1 ........................................................................................................ 51 CHAPTER 4 Setting Up CUDA ........................................................................ 53 Introduction .............................................................................................................. 53 Installing the SDK under Windows ......................................................................... 53 Visual Studio ............................................................................................................ 54 Projects ................................................................................................................ 55 64-bit users .......................................................................................................... 55 Creating projects ................................................................................................. 57 Linux......................................................................................................................... 58 Kernel base driver installation (CentOS, Ubuntu 10.4) ..................................... 59 Mac ........................................................................................................................... 62 Installing a Debugger ............................................................................................... 62 Compilation Model................................................................................................... 66 Error Handling.......................................................................................................... 67 Conclusion ................................................................................................................ 68 CHAPTER 5 Grids, Blocks, and Threads......................................................... 69 What it all Means ..................................................................................................... 69 Threads ..................................................................................................................... 69 Problem decomposition....................................................................................... 69 How CPUs and GPUs are different .................................................................... 71 Task execution model.......................................................................................... 72 Threading on GPUs............................................................................................. 73 A peek at hardware ............................................................................................. 74 CUDA kernels ..................................................................................................... 77 Blocks ....................................................................................................................... 78 Block arrangement .............................................................................................. 80 Grids ......................................................................................................................... 83 Stride and offset .................................................................................................. 84 X and Y thread indexes........................................................................................ 85 Warps ........................................................................................................................ 91 Branching ............................................................................................................ 92 GPU utilization.................................................................................................... 93 Block Scheduling ..................................................................................................... 95

Contents vii A Practical ExampledHistograms .......................................................................... 97 Conclusion .............................................................................................................. 103 Questions ........................................................................................................... 104 Answers ............................................................................................................. 104 CHAPTER 6 Memory Handling with CUDA .................................................... 107 Introduction ............................................................................................................ 107 Caches..................................................................................................................... 108 Types of data storage ........................................................................................ 110 Register Usage........................................................................................................ 111 Shared Memory ...................................................................................................... 120 Sorting using shared memory ........................................................................... 121 Radix sort .......................................................................................................... 125 Merging lists...................................................................................................... 131 Parallel merging ................................................................................................ 137 Parallel reduction............................................................................................... 140 A hybrid approach............................................................................................. 144 Shared memory on different GPUs................................................................... 148 Shared memory summary ................................................................................. 148 Questions on shared memory............................................................................ 149 Answers for shared memory ............................................................................. 149 Constant Memory ................................................................................................... 150 Constant memory caching................................................................................. 150 Constant memory broadcast.............................................................................. 152 Constant memory updates at runtime ............................................................... 162 Constant question .............................................................................................. 166 Constant answer ................................................................................................ 167 Global Memory ...................................................................................................... 167 Score boarding................................................................................................... 176 Global memory sorting ..................................................................................... 176 Sample sort........................................................................................................ 179 Questions on global memory ............................................................................ 198 Answers on global memory .............................................................................. 199 Texture Memory ..................................................................................................... 200 Texture caching ................................................................................................. 200 Hardware manipulation of memory fetches ..................................................... 200 Restrictions using textures ................................................................................ 201 Conclusion .............................................................................................................. 202 CHAPTER 7 Using CUDA in Practice............................................................ 203 Introduction ............................................................................................................ 203 Serial and Parallel Code......................................................................................... 203 Design goals of CPUs and GPUs ..................................................................... 203

分享到：

赞收藏

资料库

CUDA Programming A Developer_'s Guide to Parallel Computing with....pdf

相关推荐

课程资源

热门标签

最新资料