NEON Programmer’s Guide
Contents
Preface
References
Typographical conventions
Feedback on this book
Glossary
1: Introduction
1.1 Data processing technologies
1.1.1 Single Instruction Single Data
1.1.2 Single Instruction Multiple Data (vector mode)
1.1.3 Single Instruction Multiple Data (packed data mode)
1.2 Comparison between ARM NEON technology and other implementations
1.2.1 Comparison between NEON technology and the ARMv6 SIMD instructions
1.2.2 Comparison between NEON technology and other SIMD solutions
1.2.3 Comparison of NEON technology and Digital Signal Processors
1.3 Architecture support for NEON technology
1.3.1 Instruction timings
1.3.2 Support for VFP-only systems
1.3.3 Support for the Half-precision extension
1.3.4 Support for the Fused Multiply-Add instructions
1.3.5 Security and virtualization
1.3.6 Undefined instructions
1.3.7 Support for ARMv6 SIMD instructions
1.4 Fundamentals of NEON technology
1.4.1 Registers, vectors, lanes and elements
1.4.2 NEON data type specifiers
1.4.3 VFP views of the NEON and floating-point register file
2: Compiling NEON Instructions
2.1 Vectorization
2.1.1 Enabling auto-vectorization in ARM Compiler toolchain
2.1.2 Enabling auto-vectorization in GCC compiler
2.1.3 C pointer aliasing
2.1.4 Natural types
2.1.5 Array grouping
2.1.6 Inside knowledge
2.1.7 Enabling the NEON unit in bare-metal applications
2.1.8 Enabling the NEON unit in a Linux stock kernel
2.1.9 Enabling the NEON unit in a Linux custom kernel
2.1.10 Optimizing for vectorization
2.2 Generating NEON code using the vectorizing compiler
2.2.1 Compiler command line options
2.3 Vectorizing examples
2.3.1 Vectorization example on unrolling addition function
2.3.2 Vectorizing example with vectorizing compilation
2.3.3 Vectorizing examples with different command line switches
2.4 NEON assembler and ABI restrictions
2.4.1 Passing arguments in NEON and floating-point registers
2.5 NEON libraries
2.6 Intrinsics
2.7 Detecting presence of a NEON unit
2.7.1 Build-time NEON unit detection
2.7.2 Run-time NEON unit detection
2.8 Writing code to imply SIMD
2.8.1 Writing loops to imply SIMD
2.8.2 Tell the compiler where to unroll inner loops
2.8.3 Write structures to imply SIMD
2.9 GCC command line options
2.9.1 Option to specify the CPU
2.9.2 Option to specify the FPU
2.9.3 Option to enable use of NEON and floating-point instructions
2.9.4 Vectorizing floating-point operations
2.9.5 Example GCC command line usage for NEON code optimization
2.9.6 GCC information dump
3: NEON Instruction Set Architecture
3.1 Introduction to the NEON instruction syntax
3.2 Instruction syntax
3.2.1 Instruction modifiers
3.2.2 Instruction shape
3.3 Specifying data types
3.4 Packing and unpacking data
3.5 Alignment
3.6 Saturation arithmetic
3.7 Floating-point operations
3.7.1 Floating-point exceptions
3.8 Flush-to-zero mode
3.8.1 Denormals
3.8.2 The effects of using flush-to-zero mode
3.8.3 Operations not affected by flush-to-zero mode
3.9 Shift operations
3.9.1 Shifting vectors
3.9.2 Shifting and inserting
3.9.3 Shifting and accumulating
3.9.4 Instruction modifiers
3.9.5 Table of shifts available
3.10 Polynomials
3.10.1 Polynomial arithmetic over {0,1}
3.10.2 NEON instructions that can perform polynomial arithmetic
3.10.3 Difference between polynomial multiply and conventional multiply
3.11 Instructions to permute vectors
3.11.1 Alternatives
3.11.2 Instructions
4: NEON Intrinsics
4.1 Introduction
4.2 Vector data types for NEON intrinsics
4.3 Prototype of NEON Intrinsics
4.4 Using NEON intrinsics
4.5 Variables and constants in NEON code
4.5.1 Declaring a variable
4.5.2 Using constants
4.5.3 Moving results back to normal C variables
4.5.4 Accessing D registers from a Q register
4.5.5 Casting NEON variables between different types
4.6 Accessing vector types from C
4.7 Loading data from memory into vectors
4.8 Constructing a vector from a literal bit pattern
4.9 Constructing multiple vectors from interleaved memory
4.10 Loading a single lane of a vector from memory
4.11 Programming using NEON intrinsics
4.12 Instructions without an equivalent intrinsic
5: Optimizing NEON Code
5.1 Optimizing NEON assembler code
5.1.1 NEON pipeline differences between Cortex-A processors
5.1.2 Memory access optimizations
5.2 Scheduling
5.2.1 NEON instruction scheduling
5.2.2 Mixed ARM and NEON instruction sequences
5.2.3 Passing data between ARM general-purpose registers and NEON registers
5.2.4 Dual issue for NEON instructions
5.2.5 Example of how to read NEON instruction tables
5.2.6 Optimizations by variable spreading
5.2.7 Optimizations when using lengthening instructions
6: NEON Code Examples with Intrinsics
6.1 Swapping color channels
6.1.1 How de-interleave and interleave work
6.1.2 Single or multiple elements
6.1.3 Addressing
6.1.4 Other loads and stores
6.2 Handling non-multiple array lengths
6.2.1 Leftovers
6.2.2 Example problem
6.2.3 Larger arrays
6.2.4 Overlapping
6.2.5 Single element processing
6.2.6 Alignment
6.2.7 Using ARM instructions
7: NEON Code Examples with Mixed Operations
7.1 Matrix multiplication
7.1.1 Algorithm
7.1.2 Code
7.2 Cross product
7.2.1 Definition
7.2.2 Single cross product
7.2.3 Four cross products
7.2.4 Arbitrary input length
8: NEON Code Examples with Optimization
8.1 Converting color depth
8.1.1 Converting from RGB565 to RGB888
8.1.2 Converting from RGB888 to RGB565
8.2 Median filter
8.2.1 Implementation
8.2.2 Basic principles and bitonic sorting
8.2.3 Bitonic merging
8.2.4 Partitioning
8.2.5 Color planes
8.2.6 Padding
8.2.7 Rolling window
8.2.8 First pass sorting (bitonic sort)
8.2.9 Transpose
8.2.10 Second pass sorting
8.2.11 Re-use
8.3 FIR filter
8.3.1 Using NEON intrinsics
8.3.2 Using the vectorizing compiler
8.3.3 Adding inside knowledge
A: NEON Microarchitecture
A.1 The Cortex-A5 processor
A.1.1 The Cortex–A5 Media Processing Engine
A.1.2 VFPv4 architecture hardware support
A.2 The Cortex-A7 processor
A.2.1 The Cortex-A7 NEON unit
A.3 The Cortex-A8 processor
A.3.1 The Cortex-A8 Media Processing Engine
A.3.2 Cortex-A8 Data memory access
A.3.3 Cortex-A8 specific pipeline hazards
A.4 The Cortex-A9 processor
A.4.1 The Cortex-A9 Media Processing Engine
A.5 The Cortex-A15 processor
A.5.1 The Cortex-A15 Media Processing Engine
B: Operating System Support
B.1 FPSCR, the floating-point status and control register
B.2 FPEXC, the floating-point exception register
B.3 FPSID, the floating-point system ID register
B.4 MVFR0/1 Media and VFP Feature Registers
C: NEON and VFP Instruction Summary
C.1 List of all NEON and VFP instructions
C.2 List of doubling instructions
C.3 List of halving instructions
C.4 List of widening or long instructions
C.5 List of narrowing instructions
C.6 List of rounding instructions
C.7 List of saturating instructions
C.8 NEON general data processing instructions
C.8.1 VCVT (fixed-point or integer to floating-point)
C.8.2 VCVT (between half-precision and single-precision floating-point)
C.8.3 VDUP
C.8.4 VEXT
C.8.5 VMOV (immediate)
C.8.6 VMVN
C.8.7 VMOVL, V{Q}MOVN, VQMOVUN
C.8.8 VREV
C.8.9 VSWP
C.8.10 VTBL
C.8.11 VTBX
C.8.12 VTRN
C.8.13 VUZP
C.8.14 VZIP
C.9 NEON shift instructions
C.9.1 VSHL, VQSHL, VQSHLU, and VSHLL (by immediate)
C.9.2 V{Q}{R}SHL
C.9.3 V{R}SHR{N}, V{R}SRA
C.9.4 VQ{R}SHR{U}N
C.9.5 VSLI
C.9.6 VSRI
C.10 NEON logical and compare operations
C.10.1 VACGE and VACGT
C.10.2 VAND
C.10.3 VBIC (immediate)
C.10.4 VBIC (register)
C.10.5 VBIF
C.10.6 VBIT
C.10.7 VBSL
C.10.8 VCEQ, VCGE, VCGT, VCLE, and VCLT
C.10.9 VEOR
C.10.10 VMOV
C.10.11 VMVN
C.10.12 VORN
C.10.13 VORR (immediate)
C.10.14 VORR (register)
C.10.15 VTST
C.11 NEON arithmetic instructions
C.11.1 VABA{L}
C.11.2 VABD{L}
C.11.3 V{Q}ABS
C.11.4 V{Q}ADD, VADDL, VADDW
C.11.5 V{R}ADDHN
C.11.6 VCLS
C.11.7 VCLZ
C.11.8 VCNT
C.11.9 V{R}HADD
C.11.10 VHSUB
C.11.11 VMAX and VMIN
C.11.12 V{Q}NEG
C.11.13 VPADD{L}, VPADAL
C.11.14 VPMAX and VPMIN
C.11.15 VRECPE
C.11.16 VRECPS
C.11.17 VRSQRTE
C.11.18 VRSQRTS
C.11.19 V{Q}SUB, VSUBL and VSUBW
C.11.20 V{R}SUBHN
C.12 NEON multiply instructions
C.12.1 VFMA, VFMS
C.12.2 VMUL{L}, VMLA{L}, and VMLS{L}
C.12.3 VMUL{L}, VMLA{L}, and VMLS{L} (by scalar)
C.12.4 VQ{R}DMULH (by vector or by scalar)
C.12.5 VQDMULL, VQDMLAL, and VQDMLSL (by vector or by scalar)
C.13 NEON load and store instructions
C.13.1 Interleaving
C.13.2 Alignment restrictions in load and store, element and structure instructions
C.13.3 VLDn and VSTn (single n-element structure to one lane)
C.13.4 VLDn (single n-element structure to all lanes)
C.13.5 VLDn and VSTn (multiple n-element structures)
C.13.6 VLDR and VSTR
C.13.7 VLDM, VSTM, VPOP, and VPUSH
C.13.8 VMOV (between two ARM registers and a NEON register)
C.13.9 VMOV (between an ARM register and a NEON scalar)
C.13.10 VMRS and VMSR (between an ARM register and a NEON or VFP system register)
C.14 VFP instructions
C.14.1 VABS
C.14.2 VADD
C.14.3 VCMP (Floating-point compare)
C.14.4 VCVT (between single-precision and double-precision)
C.14.5 VCVT (between floating-point and integer)
C.14.6 VCVT (between floating-point and fixed-point)
C.14.7 VCVTB, VCVTT (half-precision extension)
C.14.8 VDIV
C.14.9 VFMA, VFMS, VFNMA, VFNMS (Fused floating-point multiply accumulate and fused floating-point multiply subtract with optional negation)
C.14.10 VMOV
C.14.11 VMOV
C.14.12 VMUL, VMLA, VMLS, VNMUL, VNMLA, and VNMLS
C.14.13 VNEG
C.14.14 VSQRT
C.14.15 VSUB
C.15 NEON and VFP pseudo-instructions
C.15.1 VACLE and VACLT
C.15.2 VAND (immediate)
C.15.3 VCLE and VCLT
C.15.4 VLDR pseudo-instruction
C.15.5 VLDR and VSTR (post-increment and pre-decrement)
C.15.6 VMOV2
C.15.7 VORN (immediate)
D: NEON Intrinsics Reference
D.1 NEON intrinsics description
D.2 Intrinsics type conversion
D.2.1 VREINTERPRET
D.2.2 VCOMBINE
D.2.3 VGET_HIGH
D.2.4 VGET_LOW
D.3 Arithmetic
D.3.1 VADD
D.3.2 VADDL
D.3.3 VADDW
D.3.4 VHADD
D.3.5 VRHADD
D.3.6 VQADD
D.3.7 VADDHN
D.3.8 VRADDHN
D.3.9 VSUB
D.3.10 VSUBL
D.3.11 VSUBW
D.3.12 VHSUB
D.3.13 VRHSUB
D.3.14 VQSUB
D.3.15 VSUBHN
D.3.16 VRSUBHN
D.4 Multiply
D.4.1 VMUL
D.4.2 VMLA
D.4.3 VMLAL
D.4.4 VMLS
D.4.5 VMLSL
D.4.6 VQDMULH
D.4.7 VQRDMULH
D.4.8 VQDMLAL
D.4.9 VQDMLSL
D.4.10 VMULL
D.4.11 VQDMULL
D.4.12 VMLA_LANE
D.4.13 VMLAL_LANE
D.4.14 VQDMLAL_LANE
D.4.15 VMLS_LANE
D.4.16 VMLSL_LANE
D.4.17 VQDMLSL_LANE
D.4.18 VMUL_N
D.4.19 VMULL_N
D.4.20 VMULL_LANE
D.4.21 VQDMULL_N
D.4.22 VQDMULL_LANE
D.4.23 VQDMULH_N
D.4.24 VQDMULH_LANE
D.4.25 VQRDMULH_N
D.4.26 VQRDMULH_LANE
D.4.27 VMLA_LANE
D.4.28 VMLAL_N
D.4.29 VQDMLAL_N
D.4.30 VMLSL_N
D.4.31 VQDMLSL_N
D.5 Data processing
D.5.1 VPADD
D.5.2 VPADDL
D.5.3 VPADAL
D.5.4 VPMAX
D.5.5 VPMIN
D.5.6 VABD
D.5.7 VABDL
D.5.8 VABA
D.5.9 VABAL
D.5.10 VMAX
D.5.11 VMIN
D.5.12 VABS
D.5.13 VQABS
D.5.14 VNEG
D.5.15 VQNEG
D.5.16 VCLS
D.5.17 VCLZ
D.5.18 VCNT
D.5.19 VRECPE
D.5.20 VRECPS
D.5.21 VRSQRTE
D.5.22 VRSQRTS
D.5.23 VMOVN
D.5.24 VMOVL
D.5.25 VQMOVN
D.5.26 VQMOVUN
D.6 Logical and compare
D.6.1 VCEQ
D.6.2 VCGE
D.6.3 VCLE
D.6.4 VCGT
D.6.5 VCLT
D.6.6 VCAGE
D.6.7 VCALE
D.6.8 VCAGT
D.6.9 VCALT
D.6.10 VTST
D.6.11 VMVN
D.6.12 VAND
D.6.13 VORR
D.6.14 VEOR
D.6.15 VBIC
D.6.16 VORN
D.6.17 VBSL
D.7 Shift
D.7.1 VSHL
D.7.2 VQSHL
D.7.3 VRSHL
D.7.4 VQRSHL
D.7.5 VSHR_N
D.7.6 VSHL_N
D.7.7 VRSHR_N
D.7.8 VSRA_N
D.7.9 VRSRA_N
D.7.10 VQSHL_N
D.7.11 VQSHLU_N
D.7.12 VSHRN_N
D.7.13 VQSHRUN_N
D.7.14 VQRSHRUN_N
D.7.15 VQSHRN_N
D.7.16 VRSHRN_N
D.7.17 VQRSHRN_N
D.7.18 VSHLL_N
D.7.19 VSRI_N
D.7.20 VSLI_N
D.8 Floating-point
D.8.1 VCVT
D.8.2 VCVT_N
D.8.3 VCVT_F32
D.8.4 VCVT_N_F32
D.8.5 VCVT_F16_F32
D.8.6 VCVT_F32_F16
D.8.7 VFMA
D.8.8 VFMS
D.9 Load and store
D.9.1 VLD1
D.9.2 VLD1_LANE
D.9.3 VLD1_DUP
D.9.4 VLD2
D.9.5 VLD2_LANE
D.9.6 VLD2_DUP
D.9.7 VLD3
D.9.8 VLD3_LANE
D.9.9 VLD3_DUP
D.9.10 VLD4
D.9.11 VLD4_LANE
D.9.12 VLD4_DUP
D.9.13 VST1
D.9.14 VST1_LANE
D.9.15 VST2
D.9.16 VST2_LANE
D.9.17 VST3
D.9.18 VST3_LANE
D.9.19 VST4
D.9.20 VST4_LANE
D.9.21 VGET_LANE
D.9.22 VSET_LANE
D.10 Permutation
D.10.1 VEXT
D.10.2 VTBL1
D.10.3 VTBL2
D.10.4 VTBL3
D.10.5 VTBL4
D.10.6 VTBX1
D.10.7 VTBX2
D.10.8 VTBX3
D.10.9 VTBX4
D.10.10 VREV64
D.10.11 VREV32
D.10.12 VREV16
D.10.13 VTRN
D.10.14 VZIP
D.10.15 VUZP
D.11 Miscellaneous
D.11.1 VCREATE
D.11.2 VDUP_N
D.11.3 VMOV_N
D.11.4 VDUP_LANE