Qualcomm® Snapdragon™ Mobile Platform OpenCL General Programming and Optimization
Revision history
Contents
Figures
Tables
1 Introduction
1.1 Purpose
1.2 Conventions
1.3 Technical assistance
2 Introduction to OpenCL
2.1 OpenCL background and overview
2.2 OpenCL on mobile
2.3 OpenCL standard
2.3.1 OpenCL API functions
2.3.2 OpenCL C language
2.3.3 OpenCL versions and profiles
2.4 OpenCL portability and backward compatibility
2.4.1 Program portability
2.4.2 Performance portability
2.4.3 Backward compatibility
3 OpenCL on Snapdragon
3.1 OpenCL on Snapdragon
3.2 Adreno GPU architecture
3.2.1 Adreno high-level architecture for OpenCL
3.2.2 Waves and fibers
3.2.3 Latency hiding
3.2.4 Workgroup assignment
3.3 Adreno A3x, A4x, and A5x differences on OpenCL
3.3.1 L2 cache
3.3.2 Local memory
3.4 Context switching between graphics and compute workload
3.4.1 Context switch
3.4.2 Limit kernel/workgroup execution time on GPU
3.5 OpenCL standard related improvement
3.6 OpenCL extensions
4 Adreno OpenCL application development
4.1 OpenCL application development on Android
4.2 Debugging tools
4.3 Snapdragon Profiler
4.4 Performance profiling
4.4.1 CPU timer
4.4.2 GPU timer
4.4.3 GPU timer vs. CPU timer
4.4.4 Performance mode
4.4.5 GPU frequency controls
5 Overview of performance optimizations
5.1 Performance portability
5.2 High-level view of optimization
5.3 Initial evaluation for OpenCL porting
5.4 Port CPU code to OpenCL GPU
5.5 Parallelize GPU and CPU workloads
5.6 Bottleneck analysis
5.6.1 Identify bottlenecks
5.6.2 Resolve bottlenecks
5.7 API level performance optimization
5.7.1 Proper arrangement of API function calls
5.7.2 Use event-driven pipeline
5.7.3 Kernel loading and building
5.7.4 Use in-order command queues
6 Workgroup size performance optimization
6.1 Obtain the maximum workgroup size
6.2 Required and preferred workgroup size
6.3 Factors affecting the maximum workgroup size
6.4 Kernels without barrier
6.5 Workgroup size tuning
6.5.1 Avoid using default workgroup size
6.5.2 Large workgroup size, better performance?
6.5.3 Fixed vs. dynamic workgroup size
6.5.4 One vs. two vs. three-dimensional (1D/2D/3D) workgroup
6.6 Other topics on workgroup size
6.6.1 Global work size and padding
6.6.2 Brute force search
6.6.3 Avoid uneven workload across workgroups
6.6.4 Workgroup synchronization
7 Memory performance optimization
7.1 OpenCL memories in Adreno GPUs
7.1.1 Local memory
7.1.2 Constant memory
7.1.3 Private memory
7.1.4 Global memory
7.1.4.1 Buffer
7.1.4.2 Image
7.1.4.3 Using image object vs. buffer object
7.1.4.4 Use of both Image and buffer objects
7.1.4.5 Global memory vs. local memory
7.2 Optimal memory load/store
7.2.1 Coalesced memory load/store
7.2.2 Vectorized load/store
7.2.3 Optimal data type
7.2.4 16-bit floating (half) vs. 32-bit floating
7.3 Atomic functions
7.4 Zero copy
7.4.1 Use map over copy
7.4.2 Avoid memory copy for objects allocated not by OpenCL
7.4.2.1 ION memory extensions
7.4.2.2 QTI Android native buffer (ANB) extension
7.4.2.3 Using standard EGL extensions
7.5 Improve cache usage
7.6 CPU cache operations
7.7 Use of SVM
7.8 Best practices to reduce power/energy consumption
8 Kernel performance optimization
8.1 Kernel fusion or splitting
8.2 Compiler options
8.3 Conformant vs. fast vs. vs. native math functions
8.4 Loop unrolling
8.5 Avoid branch divergence
8.6 Handle image boundaries
8.7 32-bit vs. 64-bit GPU memory access
8.8 Avoid use of size_t
8.9 Generic memory address space
8.10 Miscellaneous
9 OpenCL optimization case studies
9.1 Application sample code
9.1.1 Improve algorithm
9.1.2 Vectorized load/store
9.1.3 Use image instead of buffer
9.2 Epsilon filter
9.2.1 Initial implementation
9.2.2 Data pack optimization
9.2.3 Vectorized load/store optimization
9.2.4 Further increase work load per work item
9.2.5 Use local memory optimization
9.2.6 Branch operations optimization
9.2.7 Summary
9.3 Sobel filter
9.3.1 Algorithm optimization
9.3.2 Data pack optimization
9.3.3 Vectorized load/store optimization
9.3.4 Performance and summary
9.4 Summary
10 Summary
A How to enable performance mode
A.1 Adreno A3x GPU
A.1.1 CPU settings
A.1.2 GPU settings:
A.2 Adreno A4x GPU and Adreno A5x GPU
B References
B.1 Related documents
B.2 Acronyms and terms