logo资料库

17.12更新-高通骁龙手机平台opencl 通用编程优化手册.pdf

第1页 / 共72页
第2页 / 共72页
第3页 / 共72页
第4页 / 共72页
第5页 / 共72页
第6页 / 共72页
第7页 / 共72页
第8页 / 共72页
资料共72页,剩余部分请下载后查看
Qualcomm® Snapdragon™ Mobile Platform OpenCL General Programming and Optimization
Revision history
Contents
Figures
Tables
1 Introduction
1.1 Purpose
1.2 Conventions
1.3 Technical assistance
2 Introduction to OpenCL
2.1 OpenCL background and overview
2.2 OpenCL on mobile
2.3 OpenCL standard
2.3.1 OpenCL API functions
2.3.2 OpenCL C language
2.3.3 OpenCL versions and profiles
2.4 OpenCL portability and backward compatibility
2.4.1 Program portability
2.4.2 Performance portability
2.4.3 Backward compatibility
3 OpenCL on Snapdragon
3.1 OpenCL on Snapdragon
3.2 Adreno GPU architecture
3.2.1 Adreno high-level architecture for OpenCL
3.2.2 Waves and fibers
3.2.3 Latency hiding
3.2.4 Workgroup assignment
3.3 Adreno A3x, A4x, and A5x differences on OpenCL
3.3.1 L2 cache
3.3.2 Local memory
3.4 Context switching between graphics and compute workload
3.4.1 Context switch
3.4.2 Limit kernel/workgroup execution time on GPU
3.5 OpenCL standard related improvement
3.6 OpenCL extensions
4 Adreno OpenCL application development
4.1 OpenCL application development on Android
4.2 Debugging tools
4.3 Snapdragon Profiler
4.4 Performance profiling
4.4.1 CPU timer
4.4.2 GPU timer
4.4.3 GPU timer vs. CPU timer
4.4.4 Performance mode
4.4.5 GPU frequency controls
5 Overview of performance optimizations
5.1 Performance portability
5.2 High-level view of optimization
5.3 Initial evaluation for OpenCL porting
5.4 Port CPU code to OpenCL GPU
5.5 Parallelize GPU and CPU workloads
5.6 Bottleneck analysis
5.6.1 Identify bottlenecks
5.6.2 Resolve bottlenecks
5.7 API level performance optimization
5.7.1 Proper arrangement of API function calls
5.7.2 Use event-driven pipeline
5.7.3 Kernel loading and building
5.7.4 Use in-order command queues
6 Workgroup size performance optimization
6.1 Obtain the maximum workgroup size
6.2 Required and preferred workgroup size
6.3 Factors affecting the maximum workgroup size
6.4 Kernels without barrier
6.5 Workgroup size tuning
6.5.1 Avoid using default workgroup size
6.5.2 Large workgroup size, better performance?
6.5.3 Fixed vs. dynamic workgroup size
6.5.4 One vs. two vs. three-dimensional (1D/2D/3D) workgroup
6.6 Other topics on workgroup size
6.6.1 Global work size and padding
6.6.2 Brute force search
6.6.3 Avoid uneven workload across workgroups
6.6.4 Workgroup synchronization
7 Memory performance optimization
7.1 OpenCL memories in Adreno GPUs
7.1.1 Local memory
7.1.2 Constant memory
7.1.3 Private memory
7.1.4 Global memory
7.1.4.1 Buffer
7.1.4.2 Image
7.1.4.3 Using image object vs. buffer object
7.1.4.4 Use of both Image and buffer objects
7.1.4.5 Global memory vs. local memory
7.2 Optimal memory load/store
7.2.1 Coalesced memory load/store
7.2.2 Vectorized load/store
7.2.3 Optimal data type
7.2.4 16-bit floating (half) vs. 32-bit floating
7.3 Atomic functions
7.4 Zero copy
7.4.1 Use map over copy
7.4.2 Avoid memory copy for objects allocated not by OpenCL
7.4.2.1 ION memory extensions
7.4.2.2 QTI Android native buffer (ANB) extension
7.4.2.3 Using standard EGL extensions
7.5 Improve cache usage
7.6 CPU cache operations
7.7 Use of SVM
7.8 Best practices to reduce power/energy consumption
8 Kernel performance optimization
8.1 Kernel fusion or splitting
8.2 Compiler options
8.3 Conformant vs. fast vs. vs. native math functions
8.4 Loop unrolling
8.5 Avoid branch divergence
8.6 Handle image boundaries
8.7 32-bit vs. 64-bit GPU memory access
8.8 Avoid use of size_t
8.9 Generic memory address space
8.10 Miscellaneous
9 OpenCL optimization case studies
9.1 Application sample code
9.1.1 Improve algorithm
9.1.2 Vectorized load/store
9.1.3 Use image instead of buffer
9.2 Epsilon filter
9.2.1 Initial implementation
9.2.2 Data pack optimization
9.2.3 Vectorized load/store optimization
9.2.4 Further increase work load per work item
9.2.5 Use local memory optimization
9.2.6 Branch operations optimization
9.2.7 Summary
9.3 Sobel filter
9.3.1 Algorithm optimization
9.3.2 Data pack optimization
9.3.3 Vectorized load/store optimization
9.3.4 Performance and summary
9.4 Summary
10 Summary
A How to enable performance mode
A.1 Adreno A3x GPU
A.1.1 CPU settings
A.1.2 GPU settings:
A.2 Adreno A4x GPU and Adreno A5x GPU
B References
B.1 Related documents
B.2 Acronyms and terms
Qualcomm Technologies, Inc. Qualcomm® Snapdragon™ Mobile Platform OpenCL General Programming and Optimization 80-NB295-11 A November 3, 2017 Qualcomm Snapdragon and Adreno are products of Qualcomm Technologies, Inc. Other Qualcomm products referenced herein are products of Qualcomm Technologies, Inc. or its subsidiaries. Qualcomm, Snapdragon, and Adreno are trademarks of Qualcomm Incorporated, registered in the United States and other countries. Other product and brand names may be trademarks or registered trademarks of their respective owners. This technical data may be subject to U.S. and international export, re-export, or transfer (“export”) laws. Diversion contrary to U.S. and international law is strictly prohibited. Qualcomm Technologies, Inc. 5775 Morehouse Drive San Diego, CA 92121 U.S.A. © 2017 Qualcomm Technologies, Inc. All rights reserved.
Revision history Revision Date Description November 2017 Initial release A 80-NB295-11 A MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 2
Contents 1 Introduction ...................................................................................................... 9 1.1 Purpose.......................................................................................................................... 9 1.2 Conventions .................................................................................................................. 9 1.3 Technical assistance ...................................................................................................... 9 2 Introduction to OpenCL ................................................................................ 10 2.1 OpenCL background and overview ............................................................................ 10 2.2 OpenCL on mobile ...................................................................................................... 11 2.3 OpenCL standard ........................................................................................................ 11 2.3.1 OpenCL API functions .................................................................................... 11 2.3.2 OpenCL C language ........................................................................................ 12 2.3.3 OpenCL versions and profiles ......................................................................... 12 2.4 OpenCL portability and backward compatibility ........................................................ 13 2.4.1 Program portability .......................................................................................... 13 2.4.2 Performance portability ................................................................................... 13 2.4.3 Backward compatibility ................................................................................... 13 3 OpenCL on Snapdragon ............................................................................... 14 3.1 OpenCL on Snapdragon .............................................................................................. 14 3.2 Adreno GPU architecture ............................................................................................ 15 3.2.1 Adreno high-level architecture for OpenCL .................................................... 15 3.2.2 Waves and fibers .............................................................................................. 16 3.2.3 Latency hiding ................................................................................................. 16 3.2.4 Workgroup assignment .................................................................................... 17 3.3 Adreno A3x, A4x, and A5x differences on OpenCL .................................................. 18 3.3.1 L2 cache ........................................................................................................... 18 3.3.2 Local memory .................................................................................................. 18 3.4 Context switching between graphics and compute workload ..................................... 19 3.4.1 Context switch ................................................................................................. 19 3.4.2 Limit kernel/workgroup execution time on GPU ............................................ 19 3.5 OpenCL standard related improvement ...................................................................... 19 3.6 OpenCL extensions ..................................................................................................... 20 4 Adreno OpenCL application development .................................................. 21 4.1 OpenCL application development on Android ........................................................... 21 4.2 Debugging tools .......................................................................................................... 22 4.3 Snapdragon Profiler .................................................................................................... 22 4.4 Performance profiling ................................................................................................. 22 4.4.1 CPU timer ........................................................................................................ 22 80-NB295-11 A MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 3
Qualcomm® Snapdragon OpenCL General Programming and Optimization Contents 4.4.2 GPU timer ........................................................................................................ 23 4.4.3 GPU timer vs. CPU timer ................................................................................ 24 4.4.4 Performance mode ........................................................................................... 24 4.4.5 GPU frequency controls .................................................................................. 25 5 Overview of performance optimizations ...................................................... 26 5.1 Performance portability .............................................................................................. 26 5.2 High-level view of optimization ................................................................................. 26 5.3 Initial evaluation for OpenCL porting ........................................................................ 27 5.4 Port CPU code to OpenCL GPU ................................................................................. 27 5.5 Parallelize GPU and CPU workloads ......................................................................... 28 5.6 Bottleneck analysis ..................................................................................................... 28 5.6.1 Identify bottlenecks ......................................................................................... 28 5.6.2 Resolve bottlenecks ......................................................................................... 28 5.7 API level performance optimization ........................................................................... 29 5.7.1 Proper arrangement of API function calls ....................................................... 29 5.7.2 Use event-driven pipeline ................................................................................ 30 5.7.3 Kernel loading and building ............................................................................ 30 5.7.4 Use in-order command queues ........................................................................ 30 6 Workgroup size performance optimization ................................................. 31 6.1 Obtain the maximum workgroup size ......................................................................... 31 6.2 Required and preferred workgroup size ...................................................................... 31 6.3 Factors affecting the maximum workgroup size ......................................................... 32 6.4 Kernels without barrier ............................................................................................... 33 6.5 Workgroup size tuning ................................................................................................ 33 6.5.1 Avoid using default workgroup size ................................................................ 33 6.5.2 Large workgroup size, better performance? .................................................... 33 6.5.3 Fixed vs. dynamic workgroup size .................................................................. 33 6.5.4 One vs. two vs. three-dimensional (1D/2D/3D) workgroup ............................ 34 6.6 Other topics on workgroup size .................................................................................. 34 6.6.1 Global work size and padding ......................................................................... 34 6.6.2 Brute force search ............................................................................................ 34 6.6.3 Avoid uneven workload across workgroups .................................................... 34 6.6.4 Workgroup synchronization ............................................................................ 35 7 Memory performance optimization .............................................................. 36 7.1 OpenCL memories in Adreno GPUs .......................................................................... 36 7.1.1 Local memory .................................................................................................. 37 7.1.2 Constant memory ............................................................................................. 38 7.1.3 Private memory ................................................................................................ 39 7.1.4 Global memory ................................................................................................ 39 7.2 Optimal memory load/store ........................................................................................ 42 7.2.1 Coalesced memory load/store .......................................................................... 42 7.2.2 Vectorized load/store ....................................................................................... 42 7.2.3 Optimal data type ............................................................................................. 43 7.2.4 16-bit floating (half) vs. 32-bit floating ........................................................... 43 7.3 Atomic functions ......................................................................................................... 43 80-NB295-11 A MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 4
Qualcomm® Snapdragon OpenCL General Programming and Optimization Contents 7.4 Zero copy .................................................................................................................... 44 7.4.1 Use map over copy .......................................................................................... 44 7.4.2 Avoid memory copy for objects allocated not by OpenCL ............................. 45 7.5 Improve cache usage ................................................................................................... 45 7.6 CPU cache operations ................................................................................................. 46 7.7 Use of SVM ................................................................................................................ 46 7.8 Best practices to reduce power/energy consumption .................................................. 47 8 Kernel performance optimization ................................................................. 48 8.1 Kernel fusion or splitting ............................................................................................ 48 8.2 Compiler options ......................................................................................................... 48 8.3 Conformant vs. fast vs. vs. native math functions ...................................................... 49 8.4 Loop unrolling ............................................................................................................ 50 8.5 Avoid branch divergence ............................................................................................ 51 8.6 Handle image boundaries ............................................................................................ 51 8.7 32-bit vs. 64-bit GPU memory access ........................................................................ 51 8.8 Avoid use of size_t ..................................................................................................... 52 8.9 Generic memory address space ................................................................................... 52 8.10 Miscellaneous ........................................................................................................... 52 9 OpenCL optimization case studies .............................................................. 54 9.1 Application sample code ............................................................................................. 54 9.1.1 Improve algorithm ........................................................................................... 54 9.1.2 Vectorized load/store ....................................................................................... 56 9.1.3 Use image instead of buffer ............................................................................. 57 9.2 Epsilon filter ............................................................................................................... 57 9.2.1 Initial implementation ...................................................................................... 58 9.2.2 Data pack optimization .................................................................................... 58 9.2.3 Vectorized load/store optimization .................................................................. 59 9.2.4 Further increase work load per work item ....................................................... 60 9.2.5 Use local memory optimization ....................................................................... 62 9.2.6 Branch operations optimization ....................................................................... 63 9.2.7 Summary .......................................................................................................... 63 9.3 Sobel filter ................................................................................................................... 64 9.3.1 Algorithm optimization ................................................................................... 64 9.3.2 Data pack optimization .................................................................................... 65 9.3.3 Vectorized load/store optimization .................................................................. 66 9.3.4 Performance and summary .............................................................................. 66 9.4 Summary ..................................................................................................................... 67 10 Summary ...................................................................................................... 68 A How to enable performance mode .............................................................. 69 A.1 Adreno A3x GPU ....................................................................................................... 69 A.1.1 CPU settings ................................................................................................... 69 A.1.2 GPU settings: .................................................................................................. 69 A.2 Adreno A4x GPU and Adreno A5x GPU .................................................................. 70 80-NB295-11 A MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 5
Qualcomm® Snapdragon OpenCL General Programming and Optimization Contents B References ..................................................................................................... 72 B.1 Related documents ..................................................................................................... 72 B.2 Acronyms and terms .................................................................................................. 72 80-NB295-11 A MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 6
Qualcomm® Snapdragon OpenCL General Programming and Optimization Contents Figures Figure 2-1 Heterogeneous system using OpenCL .................................................................................... 10 Figure 3-1 High-level architecture of the Adreno A5x GPUs for OpenCL ............................................... 15 Figure 3-2 An example of workgroup layout and dispatch in Adreno GPUs ............................................ 17 Figure 3-3 An example of workgroup allocation to SPs ............................................................................ 18 Figure 3-4 Illustration of coalesced vs. non-coalesced data load ............................................................... 19 Figure 4-1 Profiling flags for the clEnqueueNDRange call in Adreno GPUs ...................................... 24 Figure 7-1 OpenCL conceptual memory hierarchy ................................................................................... 36 Figure 8-1 Pictorial representation of divergence across two waves ......................................................... 51 Figure 9-1 Epsilon filter algorithm ............................................................................................................ 58 Figure 9-2 Data pack using 16-bit half (fp16) data type ............................................................................ 59 Figure 9-3 Filtering more pixels per work item ......................................................................................... 60 Figure 9-4 Process 8 pixels per work item ................................................................................................. 61 Figure 9-5 Process 16 pixels per work item ............................................................................................... 61 Figure 9-6 Using local memory for Epsilon filtering................................................................................. 62 Figure 9-7 Two directional operations in Sobel filter ................................................................................ 64 Figure 9-8 Sobel filter separability ............................................................................................................ 64 Figure 9-9 Process one pixel per work item: load 3x3 pixels per kernel ................................................... 65 Figure 9-10 Process 16x1 pixels: load 18x3 pixels .................................................................................... 65 Figure 9-11 Process 16x2 pixels, load 18x4 pixels .................................................................................... 65 Figure 9-12 Performance boost by using data pack and vectorized load/store .......................................... 66 80-NB295-11 A MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 7
Qualcomm® Snapdragon OpenCL General Programming and Optimization Contents Tables Table 2-1 OpenCL platform layer functionality ........................................................................................ 11 Table 2-2 OpenCL run time layer functionality ......................................................................................... 12 Table 3-1 Adreno GPUs with OpenCL support ......................................................................................... 14 Table 3-2 Local memory performance summary ....................................................................................... 18 Table 3-3 Standard OpenCL features supported in Adreno GPUs ............................................................ 20 Table 4-1 Requirements of OpenCL development with Adreno GPUs ..................................................... 21 Table 7-1 OpenCL memory model in Adreno GPUs ................................................................................. 37 Table 7-2 Buffer vs. image in Adreno GPUs ............................................................................................. 41 Table 7-3 Coalesced access support in Adreno GPUs ............................................................................... 42 Table 8-1 Performance of OpenCL math functions (IEEE 754 conformant) ............................................ 49 Table 8-2 Math function options based on precision/performance ............................................................ 50 Table 9-1 Performance from using local memory ..................................................................................... 62 Table 9-2 Summary of optimizations and performance ............................................................................. 63 Table 9-3 Performance profiled for images with different resolutions ...................................................... 64 Table 9-4 Amount of data load/store for the three cases ........................................................................... 65 Table 9-5 Number of loads and stores by using vectorized load/store ...................................................... 66 80-NB295-11 A MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 8
分享到:
收藏