logo资料库

The microarchitecture of Intel and AMD CPUs.pdf

第1页 / 共226页
第2页 / 共226页
第3页 / 共226页
第4页 / 共226页
第5页 / 共226页
第6页 / 共226页
第7页 / 共226页
第8页 / 共226页
资料共226页,剩余部分请下载后查看
1 Introduction
1.1 About this manual
1.2 Microprocessor versions covered by this manual
2 Out-of-order execution (All processors except P1, PMMX)
2.1 Instructions are split into µops
2.2 Register renaming
3 Branch prediction (all processors)
3.1 Prediction methods for conditional jumps
Saturating counter
Two-level adaptive predictor with local history tables
Two-level adaptive predictor with global history table
The agree predictor
Loop counter
Indirect jump prediction
Subroutine return prediction
Hybrid predictors
Future branch prediction methods
3.2 Branch prediction in P1
BTB is looking ahead (P1)
Consecutive branches
3.3 Branch prediction in PMMX, PPro, P2, and P3
BTB organization
Misprediction penalty
Pattern recognition for conditional jumps
Tight loops (PMMX)
Indirect jumps and calls (PMMX, PPro, P2 and P3)
JECXZ and LOOP (PMMX)
3.4 Branch prediction in P4 and P4E
Pattern recognition for conditional jumps in P4
Alternating branches
Pattern recognition for conditional jumps in P4E
3.5 Branch prediction in PM and Core2
Misprediction penalty
Pattern recognition for conditional jumps
Pattern recognition for indirect jumps and calls
BTB organization
3.6 Branch prediction in Intel Nehalem
Misprediction penalty
Pattern recognition for conditional jumps
Pattern recognition for indirect jumps and calls
BTB organization
Prediction of function returns
3.7 Branch prediction in Intel Sandy Bridge and Ivy Bridge
Misprediction penalty
Pattern recognition for conditional jumps
Pattern recognition for indirect jumps and calls
BTB organization
Prediction of function returns
3.8 Branch prediction in Intel Haswell, Broadwell and Skylake
Misprediction penalty
Pattern recognition for conditional jumps
Pattern recognition for indirect jumps and calls
BTB organization
Prediction of function returns
3.9 Branch prediction in Intel Atom, Silvermont and Knights Landing
Misprediction penalty
Prediction of indirect branches
Return stack buffer
3.10 Branch prediction in VIA Nano
3.11 Branch prediction in AMD K8 and K10
BTB organization
Misprediction penalty
Pattern recognition for conditional jumps
Prediction of indirect branches
Return stack buffer
Literature:
3.12 Branch prediction in AMD Bulldozer, Piledriver and Steamroller
Misprediction penalty
Return stack buffer
Literature:
3.13 Branch prediction in AMD Bobcat and Jaguar
BTB organization
Misprediction penalty
Pattern recognition for conditional jumps
Prediction of indirect branches
Return stack buffer
Literature:
3.14 Indirect jumps on older processors
3.15 Returns (all processors except P1)
3.16 Static prediction
Static prediction in P1 and PMMX
Static prediction in PPro, P2, P3, P4, P4E
Static prediction in PM and Core2
Static prediction in AMD
3.17 Close jumps
Close jumps on PMMX
Chained jumps on PPro, P2 and P3
Chained jumps on P4, P4E and PM
Chained jumps on AMD
4 Pentium 1 and Pentium MMX pipeline
4.1 Pairing integer instructions
Perfect pairing
Imperfect pairing
4.2 Address generation interlock
4.3 Splitting complex instructions into simpler ones
4.4 Prefixes
4.5 Scheduling floating point code
5 Pentium 4 (NetBurst) pipeline
5.1 Data cache
5.2 Trace cache
Economizing trace cache use on P4
Trace cache use on P4E
Trace cache delivery rate
Branches in the trace cache
Guidelines for improving trace cache performance
5.3 Instruction decoding
5.4 Execution units
5.5 Do the floating point and MMX units run at half speed?
Hypothesis 1
Hypothesis 2
Hypothesis 3
Hypothesis 4
5.6 Transfer of data between execution units
Explanation A
Explanation B
Explanation C
5.7 Retirement
5.8 Partial registers and partial flags
5.9 Store forwarding stalls
5.10 Memory intermediates in dependency chains
Transferring parameters to procedures
Transferring data between floating point and other registers
Literature
5.11 Breaking dependency chains
5.12 Choosing the optimal instructions
INC and DEC
8-bit and 16-bit integers
Memory stores
Shifts and rotates
Integer multiplication
LEA
Register-to-register moves with FP, mmx and xmm registers
5.13 Bottlenecks in P4 and P4E
Memory access
Execution latency
Execution unit throughput
Port throughput
Trace cache delivery
Trace cache size
µop retirement
Instruction decoding
Branch prediction
Replaying of µops
6 Pentium Pro, II and III pipeline
6.1 The pipeline in PPro, P2 and P3
6.2 Instruction fetch
6.3 Instruction decoding
Instruction length decoding
The 4-1-1 rule
IFETCH block boundaries
Instruction prefixes
6.4 Register renaming
6.5 ROB read
6.6 Out of order execution
6.7 Retirement
6.8 Partial register stalls
Partial flags stalls
Flags stalls after shifts and rotates
6.9 Store forwarding stalls
6.10 Bottlenecks in PPro, P2, P3
7 Pentium M pipeline
7.1 The pipeline in PM
7.2 The pipeline in Core Solo and Duo
7.3 Instruction fetch
7.4 Instruction decoding
7.5 Loop buffer
7.6 Micro-op fusion
7.7 Stack engine
7.8 Register renaming
7.9 Register read stalls
7.10 Execution units
7.11 Execution units that are connected to both port 0 and 1
7.12 Retirement
7.13 Partial register access
Partial flags stall
7.14 Store forwarding stalls
7.15 Bottlenecks in PM
Memory access
Instruction fetch and decode
Micro-operation fusion
Register read stalls
Execution ports
Execution latencies and dependency chains
Partial register access
Branch prediction
Retirement
8 Core 2 and Nehalem pipeline
8.1 Pipeline
8.2 Instruction fetch and predecoding
Loopback buffer
Length-changing prefixes
8.3 Instruction decoding
8.4 Micro-op fusion
8.5 Macro-op fusion
8.6 Stack engine
8.7 Register renaming
8.8 Register read stalls
8.9 Execution units
Data bypass delays on Core2
Data bypass delays on Nehalem
Mixing µops with different latencies
8.10 Retirement
8.11 Partial register access
Partial access to general purpose registers
Partial flags stall
Partial access to XMM registers
8.12 Store forwarding stalls
8.13 Cache and memory access
Cache bank conflicts
Misaligned memory accesses
8.14 Breaking dependency chains
8.15 Multithreading in Nehalem
8.16 Bottlenecks in Core2 and Nehalem
Instruction fetch and predecoding
Instruction decoding
Register read stalls
Execution ports and execution units
Execution latency and dependency chains
Partial register access
Retirement
Branch prediction
Memory access
Literature
9 Sandy Bridge and Ivy Bridge pipeline
9.1 Pipeline
9.2 Instruction fetch and decoding
9.3 µop cache
9.4 Loopback buffer
9.5 Micro-op fusion
9.6 Macro-op fusion
9.7 Stack engine
9.8 Register allocation and renaming
Special cases of independence
Instructions that need no execution unit
Elimination of move instructions
9.9 Register read stalls
9.10 Execution units
Read and write bandwidth
Data bypass delays
Mixing µops with different latencies
256-bit vectors
Underflow and subnormals
9.11 Partial register access
Partial flags stall
Partial access to vector registers
9.12 Transitions between VEX and non-VEX modes
9.13 Cache and memory access
Cache bank conflicts
Misaligned memory accesses
Prefetch instructions
9.14 Store forwarding stalls
9.15 Multithreading
9.16 Bottlenecks in Sandy Bridge and Ivy Bridge
Instruction fetch and predecoding
µop cache
Register read stalls
Execution ports and execution units
Execution latency and dependency chains
Partial register access
Retirement
Branch prediction
Memory access
Multithreading
Literature
10 Haswell and Broadwell pipeline
10.1 Pipeline
10.2 Instruction fetch and decoding
10.3 µop cache
10.4 Loopback buffer
10.5 Micro-op fusion
10.6 Macro-op fusion
10.7 Stack engine
10.8 Register allocation and renaming
Special cases of independence
Instructions that need no execution unit
Elimination of move instructions
10.9 Execution units
Fused multiply and add
How many input dependencies can a µop have?
Read and write bandwidth
Data bypass delays
256-bit vectors
Mixing µops with different latencies
Underflow and subnormals
10.10 Partial register access
Partial flags access
Partial access to vector registers
10.11 Cache and memory access
Cache bank conflicts
Misaligned memory accesses
10.12 Store forwarding stalls
10.13 Multithreading
10.14 Bottlenecks in Haswell and Broadwell
Instruction fetch and predecoding
µop cache
Execution ports and execution units
Floating point addition has lower throughput than multiplication
Execution latency and dependency chains
Branch prediction
Memory access
Multithreading
Literature
11 Skylake pipeline
11.1 Pipeline
11.2 Instruction fetch and decoding
11.3 µop cache
11.4 Loopback buffer
11.5 Micro-op fusion
11.6 Macro-op fusion
11.7 Stack engine
11.8 Register allocation and renaming
Special cases of independence
Instructions that need no execution unit
Elimination of move instructions
11.9 Execution units
Fused multiply and add
How many input dependencies can a µop have?
Read and write bandwidth
Data bypass delays
256-bit vectors
Warm-up period for 256-bit vector operations
Underflow and subnormals
11.10 Partial register access
Partial flags access
Partial access to vector registers
11.11 Cache and memory access
Cache bank conflicts
11.12 Store forwarding stalls
11.13 Multithreading
11.14 Bottlenecks in Skylake
Instruction fetch and predecoding
µop cache
Execution ports and execution units
Execution latency and dependency chains
Branch prediction
Memory access
Multithreading
Literature
12 Intel Atom pipeline
12.1 Instruction fetch
12.2 Instruction decoding
12.3 Execution units
12.4 Instruction pairing
12.5 X87 floating point instructions
12.6 Instruction latencies
12.7 Memory access
12.8 Branches and loops
12.9 Multithreading
12.10 Bottlenecks in Atom
13 Intel Silvermont pipeline
13.1 Pipeline
13.2 Instruction fetch and decoding
13.3 Loop buffer
13.4 Macro-op fusion
13.5 Register allocation and out of order execution
13.6 Special cases of independence
13.7 Execution units
Read and write bandwidth
Data bypass delays
Underflow and subnormals
13.8 Partial register access
13.9 Cache and memory access
13.10 Store forwarding
13.11 Multithreading
13.12 Bottlenecks in Silvermont
Instruction fetch and decoding
Execution ports and execution units
Out of order execution
Branch prediction
Memory access
Multithreading
Literature
14 Intel Knights Corner pipeline
Literature
15 Intel Knights Landing pipeline
15.1 Pipeline
15.2 Instruction fetch and decoding
15.3 Loop buffer
15.4 Execution units
Latencies of the f.p./vector unit
Mask operations
Data bypass delays
Underflow and subnormals
Mathematical functions
15.5 Partial register access
15.6 Partial access to vector registers and VEX / non-VEX transitions
15.7 Special cases of independence
15.8 Cache and memory access
Read and write bandwidth
15.9 Store forwarding
15.10 Multithreading
15.11 Bottlenecks in Knights Landing
Instruction fetch and decoding
Microcode
Execution ports and execution units
Out of order execution
Branch prediction
Memory access
Multithreading
Literature
16 VIA Nano pipeline
16.1 Performance monitor counters
16.2 Instruction fetch
16.3 Instruction decoding
16.4 Instruction fusion
16.5 Out of order system
16.6 Execution ports
16.7 Latencies between execution units
Latencies between integer and floating point type XMM instructions
16.8 Partial registers and partial flags
16.9 Breaking dependence
16.10 Memory access
16.11 Branches and loops
16.12 VIA specific instructions
16.13 Bottlenecks in Nano
17 AMD K8 and K10 pipeline
17.1 The pipeline in AMD K8 and K10 processors
17.2 Instruction fetch
17.3 Predecoding and instruction length decoding
17.4 Single, double and vector path instructions
17.5 Stack engine
17.6 Integer execution pipes
17.7 Floating point execution pipes
17.8 Mixing instructions with different latency
17.9 64 bit versus 128 bit instructions
17.10 Data delay between differently typed instructions
17.11 Partial register access
17.12 Partial flag access
17.13 Store forwarding stalls
17.14 Loops
17.15 Cache
Level-2 cache
Level-3 cache
17.16 Bottlenecks in AMD K8 and K10
Instruction fetch
Out-of-order scheduling
Execution units
Mixed latencies
Dependency chains
Jumps and branches
Retirement
18 AMD Bulldozer, Piledriver and Steamroller pipeline
18.1 The pipeline in AMD Bulldozer, Piledriver and Steamroller
18.2 Instruction fetch
18.3 Instruction decoding
18.4 Loop buffer
18.5 Instruction fusion
18.6 Stack engine
18.7 Out-of-order schedulers
18.8 Integer execution pipes
18.9 Floating point execution pipes
Subnormal operands
Fused multiply and add
18.10 AVX instructions
18.11 Data delay between different execution domains
18.12 Instructions that use no execution units
18.13 Partial register access
18.14 Partial flag access
18.15 Dependency-breaking instructions
18.16 Branches and loops
18.17 Cache and memory access
18.18 Store forwarding stalls
18.19 Bottlenecks in AMD Bulldozer, Piledriver and Steamroller
Power saving
Shared resources
Instruction fetch
Instruction decoding
Out-of-order scheduling
Execution units
256-bit memory writes
Mixed latencies
Dependency chains
Jumps and branches
Memory and cache access
Retirement
18.20 Literature
19 AMD Bobcat and Jaguar pipeline
19.1 The pipeline in AMD Bobcat and Jaguar
19.2 Instruction fetch
19.3 Instruction decoding
19.4 Single, double and complex instructions
19.5 Integer execution pipes
19.6 Floating point execution pipes
19.7 Mixing instructions with different latency
19.8 Dependency-breaking instructions
19.9 Data delay between differently typed instructions
19.10 Partial register access
19.11 Cache
19.12 Store forwarding stalls
19.13 Bottlenecks in Bobcat and Jaguar
19.14 Literature:
20 Comparison of microarchitectures
20.1 The AMD K8 and K10 kernel
20.2 The AMD Bulldozer, Piledriver and Steamroller kernel
20.3 The Pentium 4 kernel
20.4 The Pentium M kernel
20.5 Intel Core 2 and Nehalem microarchitecture
20.6 Intel Sandy Bridge and later microarchitectures
21 Comparison of low power microarchitectures
21.1 Intel Atom microarchitecture
21.2 VIA Nano microarchitecture
21.3 AMD Bobcat microarchitecture
21.4 Conclusion
22 Future trends
23 Literature
24 Copyright notice
3. The microarchitecture of Intel, AMD and An optimization guide for assembly programmers and VIA CPUs compiler makers By Agner Fog. Technical University of Denmark. Copyright © 1996 - 2016. Last updated 2016-12-01. Contents 1 Introduction ....................................................................................................................... 5 1.1 About this manual ....................................................................................................... 5 1.2 Microprocessor versions covered by this manual ........................................................ 7 2 Out-of-order execution (All processors except P1, PMMX) ................................................ 9 2.1 Instructions are split into µops ..................................................................................... 9 2.2 Register renaming .................................................................................................... 10 3 Branch prediction (all processors) ................................................................................... 12 3.1 Prediction methods for conditional jumps .................................................................. 12 3.2 Branch prediction in P1 ............................................................................................. 18 3.3 Branch prediction in PMMX, PPro, P2, and P3 ......................................................... 21 3.4 Branch prediction in P4 and P4E .............................................................................. 23 3.5 Branch prediction in PM and Core2 .......................................................................... 25 3.6 Branch prediction in Intel Nehalem ........................................................................... 27 3.7 Branch prediction in Intel Sandy Bridge and Ivy Bridge ............................................. 28 3.8 Branch prediction in Intel Haswell, Broadwell and Skylake ........................................ 29 3.9 Branch prediction in Intel Atom, Silvermont and Knights Landing.............................. 29 3.10 Branch prediction in VIA Nano ................................................................................ 30 3.11 Branch prediction in AMD K8 and K10 .................................................................... 31 3.12 Branch prediction in AMD Bulldozer, Piledriver and Steamroller ............................. 33 3.13 Branch prediction in AMD Bobcat and Jaguar ......................................................... 34 3.14 Indirect jumps on older processors ......................................................................... 35 3.15 Returns (all processors except P1) ......................................................................... 35 3.16 Static prediction ...................................................................................................... 36 3.17 Close jumps ............................................................................................................ 37 4 Pentium 1 and Pentium MMX pipeline ............................................................................. 38 4.1 Pairing integer instructions ........................................................................................ 38 4.2 Address generation interlock ..................................................................................... 42 4.3 Splitting complex instructions into simpler ones ........................................................ 42 4.4 Prefixes ..................................................................................................................... 43 4.5 Scheduling floating point code .................................................................................. 44 5 Pentium 4 (NetBurst) pipeline .......................................................................................... 47 5.1 Data cache ............................................................................................................... 47 5.2 Trace cache .............................................................................................................. 47 5.3 Instruction decoding .................................................................................................. 52 5.4 Execution units ......................................................................................................... 53 5.5 Do the floating point and MMX units run at half speed? ............................................ 56 5.6 Transfer of data between execution units .................................................................. 58 5.7 Retirement ................................................................................................................ 61 5.8 Partial registers and partial flags ............................................................................... 61 5.9 Store forwarding stalls .............................................................................................. 62 5.10 Memory intermediates in dependency chains ......................................................... 62 5.11 Breaking dependency chains .................................................................................. 64 5.12 Choosing the optimal instructions ........................................................................... 64 5.13 Bottlenecks in P4 and P4E ...................................................................................... 67
6 Pentium Pro, II and III pipeline......................................................................................... 70 6.1 The pipeline in PPro, P2 and P3 ............................................................................... 70 6.2 Instruction fetch ........................................................................................................ 70 6.3 Instruction decoding .................................................................................................. 71 6.4 Register renaming .................................................................................................... 75 6.5 ROB read .................................................................................................................. 75 6.6 Out of order execution .............................................................................................. 79 6.7 Retirement ................................................................................................................ 80 6.8 Partial register stalls .................................................................................................. 81 6.9 Store forwarding stalls .............................................................................................. 84 6.10 Bottlenecks in PPro, P2, P3 .................................................................................... 85 7 Pentium M pipeline .......................................................................................................... 87 7.1 The pipeline in PM .................................................................................................... 87 7.2 The pipeline in Core Solo and Duo ........................................................................... 88 7.3 Instruction fetch ........................................................................................................ 88 7.4 Instruction decoding .................................................................................................. 88 7.5 Loop buffer ............................................................................................................... 90 7.6 Micro-op fusion ......................................................................................................... 90 7.7 Stack engine ............................................................................................................. 92 7.8 Register renaming .................................................................................................... 94 7.9 Register read stalls ................................................................................................... 94 7.10 Execution units ....................................................................................................... 96 7.11 Execution units that are connected to both port 0 and 1 .......................................... 96 7.12 Retirement .............................................................................................................. 98 7.13 Partial register access ............................................................................................. 98 7.14 Store forwarding stalls .......................................................................................... 100 7.15 Bottlenecks in PM ................................................................................................. 100 8 Core 2 and Nehalem pipeline ........................................................................................ 103 8.1 Pipeline ................................................................................................................... 103 8.2 Instruction fetch and predecoding ........................................................................... 103 8.3 Instruction decoding ................................................................................................ 106 8.4 Micro-op fusion ....................................................................................................... 106 8.5 Macro-op fusion ...................................................................................................... 107 8.6 Stack engine ........................................................................................................... 108 8.7 Register renaming .................................................................................................. 109 8.8 Register read stalls ................................................................................................. 109 8.9 Execution units ....................................................................................................... 110 8.10 Retirement ............................................................................................................ 114 8.11 Partial register access ........................................................................................... 114 8.12 Store forwarding stalls .......................................................................................... 116 8.13 Cache and memory access ................................................................................... 117 8.14 Breaking dependency chains ................................................................................ 118 8.15 Multithreading in Nehalem .................................................................................... 118 8.16 Bottlenecks in Core2 and Nehalem ....................................................................... 119 9 Sandy Bridge and Ivy Bridge pipeline ............................................................................ 121 9.1 Pipeline ................................................................................................................... 121 9.2 Instruction fetch and decoding ................................................................................ 121 9.3 µop cache ............................................................................................................... 122 9.4 Loopback buffer ...................................................................................................... 124 9.5 Micro-op fusion ....................................................................................................... 124 9.6 Macro-op fusion ...................................................................................................... 124 9.7 Stack engine ........................................................................................................... 125 9.8 Register allocation and renaming ............................................................................ 126 9.9 Register read stalls ................................................................................................. 127 9.10 Execution units ..................................................................................................... 127 9.11 Partial register access ........................................................................................... 131 9.12 Transitions between VEX and non-VEX modes .................................................... 131 9.13 Cache and memory access ................................................................................... 132 2
9.14 Store forwarding stalls .......................................................................................... 133 9.15 Multithreading ....................................................................................................... 133 9.16 Bottlenecks in Sandy Bridge and Ivy Bridge .......................................................... 134 10 Haswell and Broadwell pipeline ................................................................................... 136 10.1 Pipeline ................................................................................................................. 136 10.2 Instruction fetch and decoding .............................................................................. 136 10.3 µop cache ............................................................................................................. 136 10.4 Loopback buffer .................................................................................................... 137 10.5 Micro-op fusion ..................................................................................................... 137 10.6 Macro-op fusion .................................................................................................... 137 10.7 Stack engine ......................................................................................................... 138 10.8 Register allocation and renaming .......................................................................... 138 10.9 Execution units ..................................................................................................... 139 10.10 Partial register access ......................................................................................... 142 10.11 Cache and memory access ................................................................................. 143 10.12 Store forwarding stalls ........................................................................................ 144 10.13 Multithreading ..................................................................................................... 145 10.14 Bottlenecks in Haswell and Broadwell ................................................................. 145 11 Skylake pipeline .......................................................................................................... 148 11.1 Pipeline ................................................................................................................. 148 11.2 Instruction fetch and decoding .............................................................................. 148 11.3 µop cache ............................................................................................................. 148 11.4 Loopback buffer .................................................................................................... 149 11.5 Micro-op fusion ..................................................................................................... 149 11.6 Macro-op fusion .................................................................................................... 149 11.7 Stack engine ......................................................................................................... 150 11.8 Register allocation and renaming .......................................................................... 150 11.9 Execution units ..................................................................................................... 151 11.10 Partial register access ......................................................................................... 154 11.11 Cache and memory access ................................................................................. 155 11.12 Store forwarding stalls ........................................................................................ 156 11.13 Multithreading ..................................................................................................... 156 11.14 Bottlenecks in Skylake ........................................................................................ 156 12 Intel Atom pipeline ....................................................................................................... 159 12.1 Instruction fetch .................................................................................................... 159 12.2 Instruction decoding .............................................................................................. 159 12.3 Execution units ..................................................................................................... 159 12.4 Instruction pairing.................................................................................................. 160 12.5 X87 floating point instructions ............................................................................... 161 12.6 Instruction latencies .............................................................................................. 161 12.7 Memory access ..................................................................................................... 162 12.8 Branches and loops .............................................................................................. 163 12.9 Multithreading ....................................................................................................... 163 12.10 Bottlenecks in Atom ............................................................................................ 164 13 Intel Silvermont pipeline .............................................................................................. 164 13.1 Pipeline ................................................................................................................. 165 13.2 Instruction fetch and decoding .............................................................................. 165 13.3 Loop buffer ........................................................................................................... 166 13.4 Macro-op fusion .................................................................................................... 166 13.5 Register allocation and out of order execution ...................................................... 166 13.6 Special cases of independence............................................................................. 166 13.7 Execution units ..................................................................................................... 166 13.8 Partial register access ........................................................................................... 167 13.9 Cache and memory access ................................................................................... 167 13.10 Store forwarding.................................................................................................. 168 13.11 Multithreading ..................................................................................................... 168 13.12 Bottlenecks in Silvermont .................................................................................... 168 14 Intel Knights Corner pipeline........................................................................................ 170 3
15 Intel Knights Landing pipeline ...................................................................................... 171 15.1 Pipeline ................................................................................................................. 171 15.2 Instruction fetch and decoding .............................................................................. 171 15.3 Loop buffer ........................................................................................................... 172 15.4 Execution units ..................................................................................................... 172 15.5 Partial register access ........................................................................................... 174 15.6 Partial access to vector registers and VEX / non-VEX transitions ......................... 174 15.7 Special cases of independence............................................................................. 175 15.8 Cache and memory access ................................................................................... 175 15.9 Store forwarding ................................................................................................... 175 15.10 Multithreading ..................................................................................................... 176 15.11 Bottlenecks in Knights Landing ........................................................................... 176 16 VIA Nano pipeline ........................................................................................................ 178 16.1 Performance monitor counters .............................................................................. 178 16.2 Instruction fetch .................................................................................................... 178 16.3 Instruction decoding .............................................................................................. 178 16.4 Instruction fusion ................................................................................................... 178 16.5 Out of order system .............................................................................................. 179 16.6 Execution ports ..................................................................................................... 179 16.7 Latencies between execution units ....................................................................... 180 16.8 Partial registers and partial flags ........................................................................... 182 16.9 Breaking dependence ........................................................................................... 182 16.10 Memory access ................................................................................................... 183 16.11 Branches and loops ............................................................................................ 183 16.12 VIA specific instructions ...................................................................................... 183 16.13 Bottlenecks in Nano ............................................................................................ 184 17 AMD K8 and K10 pipeline ........................................................................................... 185 17.1 The pipeline in AMD K8 and K10 processors ........................................................ 185 17.2 Instruction fetch .................................................................................................... 187 17.3 Predecoding and instruction length decoding ........................................................ 187 17.4 Single, double and vector path instructions ........................................................... 188 17.5 Stack engine ......................................................................................................... 189 17.6 Integer execution pipes ......................................................................................... 189 17.7 Floating point execution pipes ............................................................................... 189 17.8 Mixing instructions with different latency ............................................................... 191 17.9 64 bit versus 128 bit instructions ........................................................................... 192 17.10 Data delay between differently typed instructions................................................ 193 17.11 Partial register access ......................................................................................... 193 17.12 Partial flag access ............................................................................................... 194 17.13 Store forwarding stalls ........................................................................................ 194 17.14 Loops .................................................................................................................. 195 17.15 Cache ................................................................................................................. 195 17.16 Bottlenecks in AMD K8 and K10 ......................................................................... 197 18 AMD Bulldozer, Piledriver and Steamroller pipeline ..................................................... 198 18.1 The pipeline in AMD Bulldozer, Piledriver and Steamroller ................................... 198 18.2 Instruction fetch .................................................................................................... 199 18.3 Instruction decoding .............................................................................................. 199 18.4 Loop buffer ........................................................................................................... 200 18.5 Instruction fusion ................................................................................................... 200 18.6 Stack engine ......................................................................................................... 200 18.7 Out-of-order schedulers ........................................................................................ 200 18.8 Integer execution pipes ......................................................................................... 201 18.9 Floating point execution pipes ............................................................................... 201 18.10 AVX instructions.................................................................................................. 202 18.11 Data delay between different execution domains ................................................ 203 18.12 Instructions that use no execution units .............................................................. 204 18.13 Partial register access ......................................................................................... 205 18.14 Partial flag access ............................................................................................... 205 4
18.15 Dependency-breaking instructions ...................................................................... 205 18.16 Branches and loops ............................................................................................ 206 18.17 Cache and memory access ................................................................................. 206 18.18 Store forwarding stalls ........................................................................................ 207 18.19 Bottlenecks in AMD Bulldozer, Piledriver and Steamroller .................................. 208 18.20 Literature ............................................................................................................ 210 19 AMD Bobcat and Jaguar pipeline ................................................................................ 210 19.1 The pipeline in AMD Bobcat and Jaguar ............................................................... 210 19.2 Instruction fetch .................................................................................................... 211 19.3 Instruction decoding .............................................................................................. 211 19.4 Single, double and complex instructions ............................................................... 211 19.5 Integer execution pipes ......................................................................................... 211 19.6 Floating point execution pipes ............................................................................... 211 19.7 Mixing instructions with different latency ............................................................... 212 19.8 Dependency-breaking instructions ........................................................................ 212 19.9 Data delay between differently typed instructions ................................................. 212 19.10 Partial register access ......................................................................................... 212 19.11 Cache ................................................................................................................. 212 19.12 Store forwarding stalls ........................................................................................ 213 19.13 Bottlenecks in Bobcat and Jaguar ....................................................................... 213 19.14 Literature: ........................................................................................................... 214 20 Comparison of microarchitectures ............................................................................... 214 20.1 The AMD K8 and K10 kernel ................................................................................ 214 20.2 The AMD Bulldozer, Piledriver and Steamroller kernel .......................................... 215 20.3 The Pentium 4 kernel ............................................................................................ 216 20.4 The Pentium M kernel ........................................................................................... 218 20.5 Intel Core 2 and Nehalem microarchitecture ......................................................... 218 20.6 Intel Sandy Bridge and later microarchitectures .................................................... 219 21 Comparison of low power microarchitectures .............................................................. 220 21.1 Intel Atom microarchitecture ................................................................................. 220 21.2 VIA Nano microarchitecture .................................................................................. 220 21.3 AMD Bobcat microarchitecture.............................................................................. 221 21.4 Conclusion ............................................................................................................ 221 22 Future trends ............................................................................................................... 223 23 Literature ..................................................................................................................... 226 24 Copyright notice .......................................................................................................... 226 1 Introduction 1.1 About this manual This is the third in a series of five manuals: 1. Optimizing software in C++: An optimization guide for Windows, Linux and Mac platforms. 2. Optimizing subroutines in assembly language: An optimization guide for x86 platforms. 3. The microarchitecture of Intel, AMD and VIA CPUs: An optimization guide for assembly programmers and compiler makers. 4. Instruction tables: Lists of instruction latencies, throughputs and micro-operation breakdowns for Intel, AMD and VIA CPUs. 5
5. Calling conventions for different C++ compilers and operating systems. The latest versions of these manuals are always available from www.agner.org/optimize. Copyright conditions are listed on page 226 below. The present manual describes the details of the microarchitectures of x86 microprocessors from Intel and AMD. The Itanium processor is not covered. The purpose of this manual is to enable assembly programmers and compiler makers to optimize software for a specific microprocessor. The main focus is on details that are relevant to calculations of how much time a piece of code takes to execute, such as the latencies of different execution units and the throughputs of various parts of the pipelines. Branch prediction algorithms are also covered in detail. This manual will also be interesting to students of microarchitecture. But it must be noted that the technical descriptions are mostly based on my own research, which is limited to what is measurable. The descriptions of the "mechanics" of the pipelines are therefore limited to what can be measured by counting clock cycles or micro-operations (µops) and what can be deduced from these measurements. Mechanistic explanations in this manual should be regarded as a model which is useful for predicting microprocessor behavior. I have no way of knowing with certainty whether it is in accordance with the actual physical structure of the microprocessors. The main purpose of providing this information is to enable programmers and compiler makers to optimize their code. On the other hand, my method of deducing information from measurements rather than relying on information published by microprocessor vendors provides a lot of new informa- tion that cannot be found anywhere else. Technical details published by microprocessor vendors is often superficial, incomplete, selective and sometimes misleading. My findings are sometimes in disagreement with data published by microprocessor vendors. Reasons for this discrepancy might be that such data are theoretical while my data are obtained experimentally under a particular set of testing conditions. I do not claim that all information in this manual is exact. Some timings etc. can be difficult or impossible to measure exactly, and I do not have access to the inside information on technical implementations that microprocessor vendors base their technical manuals on. The tests are done mostly in 32-bit and 64-bit protected mode. Most timing results are independent of the processor mode. Important differences are noted where appropriate. Far jumps, far calls and interrupts have mostly been tested in 16-bit mode for older processors. Call gates etc. have not been tested. The detailed timing results are listed in manual 4: "Instruction tables". Most of the information in this manual is based on my own research. Many people have sent me useful information and corrections, which I am very thankful for. I keep updating the manual whenever I have new important information. This manual is therefore more detailed, comprehensive and exact than other sources of information; and it contains many details not found anywhere else. This manual is not for beginners. It is assumed that the reader has a good understanding of assembly programming and microprocessor architecture. If not, then please read some books on the subject and get some programming experience before you begin doing complicated optimizations. See the literature list in manual 2: "Optimizing subroutines in assembly language" or follow the links from www.agner.org/optimize. The reader may skip chapters describing old microprocessor designs unless you are using these processors in embedded systems or you are interested in historical developments in microarchitecture. 6
Please don't send your programming questions to me, I am not gonna do your homework for you! There are various discussion forums on the Internet where you can get answers to your programming questions if you cannot find the answers in the relevant books and manuals. 1.2 Microprocessor versions covered by this manual The following families of x86 microprocessors are discussed in this manual: Microprocessor name Microarchitecture Abbreviation Intel Pentium (without name suffix) Intel Pentium MMX Intel Pentium Pro Intel Pentium II Intel Pentium III Intel Pentium 4 (NetBurst) Intel Pentium 4 with EM64T, Pentium D, etc. Intel Pentium M, Core Solo, Core Duo Intel Core 2 Intel Core i7 Intel 2nd generation Core Intel 3rd generation Core Intel 4th generation Core Intel 5th generation Core Intel 6th generation Core Intel Atom 330 Intel Bay Trail Intel Xeon Phi 7210 AMD Athlon AMD Athlon 64, Opteron, etc., 64-bit AMD Family 10h, Phenom, third generation Opteron AMD Family 15h, Bulldozer AMD Family 15h, Piledriver AMD Family 15h, Steamroller AMD Bobcat AMD Kabini, Temash, etc. VIA Nano, 2000 series VIA Nano, 3000 series code name P5 P5 P6 P6 P6 Netburst Netburst, Prescott P1 PMMX PPro P2 P3 P4 P4E Dothan, Yonah PM Merom, Wolfdale Nehalem Sandy Bridge Ivy Bridge Haswell Broadwell Skylake Diamondville Silvermont Knights Landing K7 K8 K10 Bulldozer Piledriver Steamroller Bobcat Jaguar Isaiah Core2 Nehalem Sandy Bridge Ivy Bridge Haswell Broadwell Skylake Atom Silvermont Knights Landing AMD K7 AMD K8 AMD K10 Bulldozer Piledriver Steamroller Bobcat Jaguar Nano 2000 Nano 3000 Table 1.1. Microprocessor families The abbreviations here are intended to distinguish between different kernel microarchitec- tures, regardless of trade names. The commercial names of microprocessors often blur the distinctions between different kernel technologies. The name Celeron applies to P2, P3, P4 or PM with less cache than the standard versions. The name Xeon applies to P2, P3, P4 or Core2 with more cache than the standard versions. The names Pentium D and Pentium Extreme Edition refer to P4E with multiple cores. The brand name Pentium was originally applied to the P5 and P6 microarchitectures, but the same name has later been reapplied to some processors with later microarchitectures. The name Centrino applies to Pentium M, Core Solo and Core Duo processors. Core Solo is rather similar to Pentium M. Core Duo is similar too, but with two cores. 7
The name Sempron applies to a low-end version of Athlon 64 with less cache. Turion 64 is a mobile version. Opteron is a server version with more cache. Some versions of P4E, PM, Core2 and AMD processors have multiple cores. The P1 and PMMX processors represent the fifth generation in the Intel x86 series of microprocessors, and their processor kernels are very similar. PPro, P2 and P3 all have the sixth generation kernel (P6). These three processors are almost identical except for the fact that new instructions are added to each new model. P4 is the first processor in the seventh generation which, for obscure reasons, is not called seventh generation in Intel documents. Quite unexpectedly, the generation number returned by the CPUID instruction in the P4 is not 7 but 15. The confusion is complete when the subsequent Intel CPUs: Pentium M, Core, and later processors all report generation number 6. The reader should be aware that different generations of microprocessors behave very differently. Also, the Intel and AMD microarchitectures are very different. What is optimal for one generation or one brand may not be optimal for the others. 8
分享到:
收藏