Cortex-A Series Programmer’s Guide
Contents
Preface
References
Typographical conventions
Feedback on this book
Glossary
1: Introduction
1.1 History
1.2 System-on-Chip (SoC)
1.3 Embedded systems
2: ARM Architecture and Processors
2.1 Architecture versions
2.2 Architecture history and extensions
2.2.1 DSP multiply-accumulate and saturated arithmetic instructions
2.2.2 Jazelle
2.2.3 Thumb Execution Environment (ThumbEE)
2.2.4 Thumb-2
2.2.5 Security Extensions (TrustZone)
2.2.6 VFP
2.2.7 Advanced SIMD (NEON)
2.2.8 Large Physical Address Extension (LPAE)
2.2.9 Virtualization
2.2.10 big.LITTLE
2.3 Key architectural points of ARM Cortex-A series processors
2.4 Processors and pipelines
2.5 The Cortex-A series processors
2.5.1 The Cortex-A5 processor
2.5.2 The Cortex-A7 processor
2.5.3 The Cortex-A8 processor
2.5.4 The Cortex-A9 processor
2.5.5 The Cortex-A15 processor
2.5.6 Qualcomm Scorpion
3: Tools, Operating Systems and Boards
3.1 Linux distributions
3.1.1 Linux for ARM systems
3.1.2 Linux terminology
3.1.3 Embedded Linux
3.1.4 Board Support Package
3.1.5 Linaro
3.2 Useful tools
3.2.1 QEMU
3.2.2 BusyBox
3.2.3 Scratchbox
3.2.4 U-Boot
3.2.5 UEFI and Tianocore
3.3 Software toolchains for ARM processors
3.3.1 GNU toolchain
3.3.2 ARM Compiler toolchain
3.4 ARM DS-5
3.5 Example platforms
3.5.1 BeagleBoard
3.5.2 Pandora
3.5.3 ST Ericsson Snowball
3.5.4 Gumstix
3.5.5 PandaBoard
4: ARM Registers, Modes and Instruction Sets
4.1 Instruction sets
4.2 Modes
4.3 Registers
4.3.1 Program Status Registers
4.4 Instruction pipelines
4.4.1 Multi-issue pipelines
4.4.2 Register renaming
4.5 Branch prediction
4.5.1 Return stack
4.5.2 Programmer’s view
5: Introduction to Assembly Language
5.1 Comparison with other assembly languages
5.2 Instruction sets
5.3 Introduction to the GNU Assembler
5.3.1 Invoking the GNU Assembler
5.3.2 GNU Assembler syntax
5.3.3 Sections
5.3.4 Assembler directives
5.3.5 Expressions
5.3.6 GNU tools naming conventions
5.4 ARM tools assembly language
5.4.1 ARM assembler syntax
5.4.2 Label
5.4.3 Directives
5.5 Interworking
5.6 Identifying assembly code
6: ARM/Thumb Unified Assembly Language Instructions
6.1 Instruction set basics
6.1.1 Constant values
6.1.2 Conditional execution
6.1.3 Status flags and condition codes
6.2 Data processing operations
6.2.1 Operand 2 and the barrel shifter
6.3 Multiplication operations
6.3.1 Additional multiplies
6.4 Memory instructions
6.4.1 Addressing modes
6.4.2 Multiple transfers
6.5 Branches
6.6 Integer SIMD instructions
6.6.1 Integer register SIMD instructions
6.6.2 Integer register SIMD multiplies
6.6.3 Sum of absolute differences
6.6.4 Data packing and unpacking
6.6.5 Byte selection
6.7 Saturating arithmetic
6.7.1 Saturated math instructions
6.8 Miscellaneous instructions
6.8.1 Coprocessor instructions
6.8.2 Coprocessor 15
6.8.3 SVC
6.8.4 PSR modification
6.8.5 Bit manipulation
6.8.6 Cache preload
6.8.7 Byte reversal
6.8.8 Other instructions
7: Floating-Point
7.1 Floating-point basics and the IEEE-754 standard
7.1.1 Rounding algorithms
7.1.2 ARM VFP
7.1.3 Instructions
7.1.4 Enabling VFP
7.2 VFP support in GCC
7.3 VFP support in the ARM Compiler
7.4 VFP support in Linux
7.4.1 Context switching
7.5 Floating-point optimization
8: Introducing NEON
8.1 SIMD
8.1.1 ARMv6 SIMD instructions
8.2 NEON architecture overview
8.2.1 Commonality with VFP
8.2.2 Data types
8.2.3 NEON registers
8.2.4 NEON instruction set
9: Caches
9.1 Why do caches help?
9.2 Cache drawbacks
9.3 Memory hierarchy
9.4 Cache architecture
9.4.1 Cache controller
9.4.2 Direct mapped caches
9.4.3 Set associative caches
9.4.4 Cache terminology
9.4.5 A real-life example
9.4.6 Virtual and physical tags and indexes
9.5 Cache policies
9.5.1 Allocation policy
9.5.2 Replacement policy
9.5.3 Write policy
9.6 Write and Fetch buffers
9.7 Cache performance and hit rate
9.8 Invalidating and cleaning cache memory
9.9 Point of coherency and unification
9.10 Level 2 cache controller
9.10.1 Level 2 cache maintenance
9.11 Parity and ECC in caches
10: Memory Management Unit
10.1 Virtual memory
10.2 Level 1 page tables
10.3 Level 2 page tables
10.4 The Translation Lookaside Buffer
10.5 TLB coherency
10.6 Choice of page sizes
10.7 Memory attributes
10.7.1 Memory Access Permissions
10.7.2 Memory types
10.7.3 Domains
10.8 Multi-tasking and OS usage of page tables
10.8.1 Address Space ID
10.8.2 Page Table Base Register 0 and 1
10.8.3 The Fast Context Switch Extension
10.9 Large Physical Address Extensions
11: Memory Ordering
11.1 ARM memory ordering model
11.1.1 Strongly-ordered and Device memory
11.1.2 Normal memory
11.2 Memory barriers
11.2.1 Memory barrier use example
11.2.2 Avoiding deadlocks with a barrier
11.2.3 WFE and WFI Interaction with barriers
11.2.4 Linux use of barriers
11.3 Cache coherency implications
11.3.1 Issues with copying code
11.3.2 Compiler re-ordering optimizations
12: Exception Handling
12.1 Types of exception
12.2 Exception mode summary
12.2.1 Exception priorities
12.3 Entering an exception handler
12.4 Exit from an exception handler
12.5 Vector table
12.6 Return instruction
13: Interrupt Handling
13.1 External interrupt requests
13.1.1 Assigning interrupts
13.1.2 Simplistic interrupt handling
13.1.3 Nested interrupt handling
13.2 Generic Interrupt Controller
13.2.1 Configuration
13.2.2 Interrupt handling
14: Other Exception Handlers
14.1 Abort handler
14.2 Undefined instruction handling
14.3 SVC exception handling
14.4 Linux exception program flow
14.4.1 Boot process
14.4.2 Interrupt dispatch
15: Boot Code
15.1 Booting a bare-metal system
15.2 Configuration
15.3 Booting Linux
15.3.1 Reset handler
15.3.2 Bootloader
15.3.3 Initialize memory system
15.3.4 Kernel images
15.3.5 Kernel parameters
15.3.6 Kernel entry
15.3.7 Platform-specific actions
15.3.8 Kernel start-up code
16: Porting
16.1 Endianness
16.2 Alignment
16.3 Miscellaneous C porting issues
16.3.1 unsigned char and signed char
16.3.2 Compiler packing of structures
16.3.3 Use of the stack
16.3.4 Other issues
16.4 Porting ARM assembly code to ARMv7
16.4.1 Memory access ordering and memory barriers
16.5 Porting ARM code to Thumb
16.5.1 Use of PC as an operand
16.5.2 Branches and interworking
16.5.3 Operand combinations
16.5.4 Other ARM/Thumb differences
17: Application Binary Interfaces
17.1 Procedure Call Standard
17.1.1 VFP and NEON register usage
17.1.2 Linkage
17.1.3 Stack and heap
17.1.4 Returning results
17.2 Mixing C and assembly code
18: Profiling
18.1 Profiler output
18.1.1 Gprof
18.1.2 OProfile
18.1.3 DS-5 Streamline
18.1.4 ARM performance monitor
18.1.5 Linux perf events
18.1.6 Ftrace
18.1.7 Valgrind and Cachegrind
19: Optimizing Code to Run on ARM Processors
19.1 Compiler optimizations
19.1.1 Function inlining
19.1.2 Eliminating common sub-expressions
19.1.3 Loop unrolling
19.1.4 GCC optimization options
19.1.5 armcc optimization options
19.2 ARM memory system optimization
19.2.1 Data cache optimization
19.2.2 Loop tiling
19.2.3 Loop interchange
19.2.4 Structure alignment
19.2.5 Associativity effects
19.2.6 Optimizing instruction cache usage
19.2.7 Optimizing L2 and outer cache usage
19.2.8 Optimizing TLB usage
19.2.9 Data abort optimization
19.2.10 Prefetching a memory block access
19.3 Source code modifications
19.3.1 Loop termination
19.3.2 Loop fusion
19.3.3 Reducing stack and heap usage
19.3.4 Variable selection
19.3.5 Pointer aliasing
19.3.6 Division and modulo
19.3.7 Extern data
19.3.8 Inline or embedded assembler
19.3.9 Complex addressing modes
19.3.10 Unaligned access
19.3.11 Linker optimizations
20: Writing NEON Code
20.1 NEON C Compiler and assembler
20.1.1 Vectorization
20.1.2 NEON libraries
20.1.3 Intrinsics
20.1.4 NEON types in C
20.1.5 Variables and constants
20.1.6 Generating NEON instructions from C/C++ code
20.1.7 NEON assembler and ABI restrictions
20.1.8 Detecting NEON
20.2 Optimizing NEON assembler code
20.2.1 Memory access optimizations
20.2.2 Alignment
20.2.3 Scheduling
20.3 NEON power saving
21: Introduction to Multi-processing
21.1 Multi-processing ARM systems
21.2 Symmetric multi-processing
21.3 Asymmetric multi-processing
22: SMP Architectural Considerations
22.1 Cache coherency
22.1.1 MESI protocol
22.1.2 MOESI protocol
22.1.3 Accelerator Coherency Port (ACP)
22.2 TLB and cache maintenance broadcast
22.3 Handling interrupts in an SMP system
22.4 Exclusive accesses
22.5 Booting SMP systems
22.5.1 Processor ID
22.5.2 SMP boot in Linux
22.6 Private memory region
22.6.1 Timers and watchdogs
23: Parallelizing Software
23.1 Decomposition methods
23.2 Threading models
23.3 Threading libraries
23.3.1 Inter-thread communications
23.3.2 Threaded performance
23.3.3 Thread affinity
23.4 Synchronization mechanisms in the Linux kernel
23.4.1 Completions
23.4.2 Spinlocks
23.4.3 Semaphores
23.4.4 Lock-free synchronization
24: Issues with Parallelizing Software
24.1 Thread safety and reentrancy
24.2 Performance issues
24.2.1 Bandwidth concerns
24.2.2 Thread dependencies
24.2.3 Cache thrashing
24.2.4 False sharing
24.2.5 Deadlock and livelock
24.3 Profiling in SMP systems
25: Power Management
25.1 Power and clocking
25.1.1 Standby mode
25.1.2 Dormant mode
25.1.3 Assembly language power instructions
25.1.4 Dynamic Voltage and Frequency Scaling
26: Security
26.1 TrustZone hardware architecture
26.1.1 Multi-processor systems with security extensions
26.1.2 Interaction of Normal and Secure worlds
27: Virtualization
27.1 ARMv7-A Virtualization Extensions
27.1.1 Privilege model in ARMv7-A Virtualization Extensions
27.1.2 Hypervisor mode
27.1.3 Memory translation
27.2 Hypervisor exception model
27.3 Relationship between virtualization and ARM Security Extensions
28: Introducing big.LITTLE
28.1 big.LITTLE configuration
28.2 Structure of a big.LITTLE system
28.3 Execution models in big.LITTLE
28.3.1 big.LITTLE migration models
28.3.2 Cluster migration
28.3.3 CPU migration
28.4 big.LITTLE MP operation
29: Debug
29.1 ARM debug hardware
29.1.1 Debug events
29.2 ARM trace hardware
29.2.1 CoreSight
29.3 Debug monitor
29.4 Debugging Linux applications
29.5 DS-5 debug and trace
29.5.1 Debugging Linux applications using DS-5
29.5.2 Debugging Linux kernel modules
29.5.3 Debugging Linux kernels using DS-5
29.5.4 Debugging a multi-threaded applications using DS-5
29.5.5 Debugging shared libraries
29.5.6 Trace support in DS-5
A: Instruction Summary
A.1 Instruction Summary
A.1.1 ADC
A.1.2 ADD
A.1.3 ADR
A.1.4 ADRL
A.1.5 AND
A.1.6 ASR
A.1.7 B
A.1.8 BFC
A.1.9 BFI
A.1.10 BIC
A.1.11 BKPT
A.1.12 BL
A.1.13 BLX
A.1.14 BX
A.1.15 BXJ
A.1.16 CBNZ
A.1.17 CBZ
A.1.18 CDP
A.1.19 CDP2
A.1.20 CHKA
A.1.21 CLREX
A.1.22 CLZ
A.1.23 CMN
A.1.24 CMP
A.1.25 CPS
A.1.26 DBG
A.1.27 DMB
A.1.28 DSB
A.1.29 ENTERX
A.1.30 EOR
A.1.31 ERET
A.1.32 HB
A.1.33 ISB
A.1.34 IT
A.1.35 LDC
A.1.36 LDC2
A.1.37 LDM
A.1.38 LDR
A.1.39 LDR (pseudo-instruction)
A.1.40 LDRD
A.1.41 LDREX
A.1.42 LEAVEX
A.1.43 LSL
A.1.44 LSR
A.1.45 MCR
A.1.46 MCR2
A.1.47 MCRR
A.1.48 MCRR2
A.1.49 MLA
A.1.50 MLS
A.1.51 MOV
A.1.52 MOVT
A.1.53 MOV32
A.1.54 MRC
A.1.55 MRC2
A.1.56 MRRC
A.1.57 MRRC2
A.1.58 MRS
A.1.59 MSR
A.1.60 MUL
A.1.61 MVN
A.1.62 NOP
A.1.63 ORN
A.1.64 ORR
A.1.65 PKHBT
A.1.66 PKHTB
A.1.67 PLD
A.1.68 PLDW
A.1.69 PLI
A.1.70 POP
A.1.71 PUSH
A.1.72 QADD
A.1.73 QADD8
A.1.74 QADD16
A.1.75 QASX
A.1.76 QDADD
A.1.77 QDSUB
A.1.78 QSAX
A.1.79 QSUB
A.1.80 QSUB8
A.1.81 QSUB16
A.1.82 RBIT
A.1.83 REV
A.1.84 REV16
A.1.85 REVSH
A.1.86 RFE
A.1.87 ROR
A.1.88 RRX
A.1.89 RSB
A.1.90 RSC
A.1.91 SADD8
A.1.92 SADD16
A.1.93 SASX
A.1.94 SBC
A.1.95 SBFX
A.1.96 SDIV
A.1.97 SEL
A.1.98 SETEND
A.1.99 SEV
A.1.100 SHADD8
A.1.101 SHADD16
A.1.102 SHASX
A.1.103 SHSAX
A.1.104 SHSUB8
A.1.105 SHSUB16
A.1.106 SMC
A.1.107 SMLAxy
A.1.108 SMLAD
A.1.109 SMLAL
A.1.110 SMLALxy
A.1.111 SMLALD
A.1.112 SMLAWy
A.1.113 SMLSLD
A.1.114 SMMLA
A.1.115 SMMLS
A.1.116 SMMUL
A.1.117 SMUAD
A.1.118 SMUSD
A.1.119 SMULxy
A.1.120 SMULL
A.1.121 SMULWy
A.1.122 SRS
A.1.123 SSAT
A.1.124 SSAT16
A.1.125 SSAX
A.1.126 SSUB8
A.1.127 SSUB16
A.1.128 STC
A.1.129 STC2
A.1.130 STM
A.1.131 STR
A.1.132 STRD
A.1.133 STREX
A.1.134 SUB
A.1.135 SVC
A.1.136 SWP
A.1.137 SXT
A.1.138 SXTA
A.1.139 SYS
A.1.140 TBB
A.1.141 TBH
A.1.142 TEQ
A.1.143 TST
A.1.144 UADD8
A.1.145 UADD16
A.1.146 UASX
A.1.147 UBFX
A.1.148 UDIV
A.1.149 UHADD8
A.1.150 UHADD16
A.1.151 UHASX
A.1.152 UHSAX
A.1.153 UHSUB8
A.1.154 UHSUB16
A.1.155 UMAAL
A.1.156 UMLAL
A.1.157 UMULL
A.1.158 UQADD8
A.1.159 UQADD16
A.1.160 UQASX
A.1.161 UQSAX
A.1.162 UQSUB8
A.1.163 UQSUB16
A.1.164 USAD8
A.1.165 USADA8
A.1.166 USAT
A.1.167 USAT16
A.1.168 USAX
A.1.169 USUB8
A.1.170 USUB16
A.1.171 UXT
A.1.172 UXTA
A.1.173 WFE
A.1.174 WFI
A.1.175 YIELD
B: NEON and VFP Instruction Summary
B.1 NEON general data processing instructions
B.1.1 VCVT (fixed-point or integer to floating-point)
B.1.2 VCVT (between half-precision and single-precision floating-point)
B.1.3 VDUP
B.1.4 VEXT
B.1.5 VMOV
B.1.6 VMVN
B.1.7 VMOVL, V{Q}MOVN, VQMOVUN
B.1.8 VREV
B.1.9 VSWP
B.1.10 VTBL
B.1.11 VTBX
B.1.12 VTRN
B.1.13 VUZP
B.1.14 VZIP
B.2 NEON shift instructions
B.2.1 VSHL, VQSHL, VQSHLU, and VSHLL (by immediate)
B.2.2 V{Q}{R}SHL
B.2.3 V{R}SHR{N}, V{R}SRA
B.2.4 VQ{R}SHR{U}N
B.2.5 VSLI
B.2.6 VSRI
B.3 NEON logical and compare operations
B.3.1 VACGE and VACGT
B.3.2 VAND
B.3.3 VBIC (immediate)
B.3.4 VBIC (register)
B.3.5 VBIF
B.3.6 VBIT
B.3.7 VBSL
B.3.8 VCEQ, VCGE, VCGT, VCLE, and VCLT
B.3.9 VEOR
B.3.10 VMOV
B.3.11 VMVN
B.3.12 VORN
B.3.13 VORR (immediate)
B.3.14 VORR (register)
B.3.15 VTST
B.4 NEON arithmetic instructions
B.4.1 VABA{L}
B.4.2 VABD{L}
B.4.3 V{Q}ABS
B.4.4 V{Q}ADD, VADDL, VADDW
B.4.5 V{R}ADDHN
B.4.6 VCLS
B.4.7 VCLZ
B.4.8 VCNT
B.4.9 V{R}HADD
B.4.10 VHSUB
B.4.11 VMAX and VMIN
B.4.12 V{Q}NEG
B.4.13 VPADD{L}, VPADAL
B.4.14 VPMAX and VPMIN
B.4.15 VRECPE
B.4.16 VRECPS
B.4.17 VRSQRTE
B.4.18 VRSQRTS
B.4.19 V{Q}SUB, VSUBL and VSUBW
B.4.20 V{R}SUBHN
B.5 NEON multiply instructions
B.5.1 VFMA, VFMS
B.5.2 VMUL{L}, VMLA{L}, and VMLS{L}
B.5.3 VMUL{L}, VMLA{L}, and VMLS{L} (by scalar)
B.5.4 VQ{R}DMULH (by vector or by scalar)
B.5.5 VQDMULL, VQDMLAL, and VQDMLSL (by vector or by scalar)
B.6 NEON load and store element and structure instructions
B.6.1 VLDn and VSTn (single n-element structure to one lane)
B.6.2 VLDn (single n-element structure to all lanes)
B.6.3 VLDn and VSTn (multiple n-element structures)
B.6.4 VLDR and VSTR
B.6.5 VLDM, VSTM, VPOP, and VPUSH
B.6.6 VMOV (between two ARM registers and an extension register)
B.6.7 VMOV (between an ARM register and a NEON scalar)
B.6.8 VMRS and VMSR
B.7 VFP instructions
B.7.1 VABS
B.7.2 VADD
B.7.3 VCMP
B.7.4 VCVT (between single-precision and double-precision)
B.7.5 VCVT (between floating-point and integer)
B.7.6 VCVT (between floating-point and fixed-point)
B.7.7 VCVTB, VCVTT (half-precision extension)
B.7.8 VDIV
B.7.9 VFMA, VFNMA, VFMS, VFNMS
B.7.10 VMOV
B.7.11 VMOV
B.7.12 VMUL, VMLA, VMLS, VNMUL, VNMLA, and VNMLS
B.7.13 VNEG
B.7.14 VSQRT
B.7.15 VSUB
B.8 NEON and VFP pseudo-instructions
B.8.1 VACLE and VACLT
B.8.2 VAND (immediate)
B.8.3 VCLE and VCLT
B.8.4 VLDR pseudo-instruction
B.8.5 VLDR and VSTR (post-increment and pre-decrement)
B.8.6 VMOV2
B.8.7 VORN
C: Building Linux for ARM Systems
C.1 Building the Linux kernel
C.2 Creating the Linux filesystem
C.3 Putting it together
Index
A
B
C
D
E
F
G
H
I
J
K
L
M
N
O
P
Q
R
S
T
U
V
W
Z