The microarchitecture of Intel and AMD CPUs.pdf

发布时间：2022-05-29 发布人：admin 分类：说明书资料大小：2.04M 资料格式：pdf 举报版权申诉

zijinxuan-10391827-4744302543310254022.pdf-第1页.png

第1页 / 共226页

zijinxuan-10391827-4744302543310254022.pdf-第2页.png

第2页 / 共226页

zijinxuan-10391827-4744302543310254022.pdf-第3页.png

第3页 / 共226页

zijinxuan-10391827-4744302543310254022.pdf-第4页.png

第4页 / 共226页

zijinxuan-10391827-4744302543310254022.pdf-第5页.png

第5页 / 共226页

zijinxuan-10391827-4744302543310254022.pdf-第6页.png

第6页 / 共226页

zijinxuan-10391827-4744302543310254022.pdf-第7页.png

第7页 / 共226页

zijinxuan-10391827-4744302543310254022.pdf-第8页.png

第8页 / 共226页

1 Introduction

1.1 About this manual

1.2 Microprocessor versions covered by this manual

2 Out-of-order execution (All processors except P1, PMMX)

2.1 Instructions are split into µops

2.2 Register renaming

3 Branch prediction (all processors)

3.1 Prediction methods for conditional jumps

Saturating counter

Two-level adaptive predictor with local history tables

Two-level adaptive predictor with global history table

The agree predictor

Loop counter

Indirect jump prediction

Subroutine return prediction

Hybrid predictors

Future branch prediction methods

3.2 Branch prediction in P1

BTB is looking ahead (P1)

Consecutive branches

3.3 Branch prediction in PMMX, PPro, P2, and P3

BTB organization

Misprediction penalty

Pattern recognition for conditional jumps

Tight loops (PMMX)

Indirect jumps and calls (PMMX, PPro, P2 and P3)

JECXZ and LOOP (PMMX)

3.4 Branch prediction in P4 and P4E

Pattern recognition for conditional jumps in P4

Alternating branches

Pattern recognition for conditional jumps in P4E

3.5 Branch prediction in PM and Core2

Misprediction penalty

Pattern recognition for conditional jumps

Pattern recognition for indirect jumps and calls

BTB organization

3.6 Branch prediction in Intel Nehalem

Misprediction penalty

Pattern recognition for conditional jumps

Pattern recognition for indirect jumps and calls

BTB organization

Prediction of function returns

3.7 Branch prediction in Intel Sandy Bridge and Ivy Bridge

Misprediction penalty

Pattern recognition for conditional jumps

Pattern recognition for indirect jumps and calls

BTB organization

Prediction of function returns

3.8 Branch prediction in Intel Haswell, Broadwell and Skylake

Misprediction penalty

Pattern recognition for conditional jumps

Pattern recognition for indirect jumps and calls

BTB organization

Prediction of function returns

3.9 Branch prediction in Intel Atom, Silvermont and Knights Landing

Misprediction penalty

Prediction of indirect branches

Return stack buffer

3.10 Branch prediction in VIA Nano

3.11 Branch prediction in AMD K8 and K10

BTB organization

Misprediction penalty

Pattern recognition for conditional jumps

Prediction of indirect branches

Return stack buffer

Literature:

3.12 Branch prediction in AMD Bulldozer, Piledriver and Steamroller

Misprediction penalty

Return stack buffer

Literature:

3.13 Branch prediction in AMD Bobcat and Jaguar

BTB organization

Misprediction penalty

Pattern recognition for conditional jumps

Prediction of indirect branches

Return stack buffer

Literature:

3.14 Indirect jumps on older processors

3.15 Returns (all processors except P1)

3.16 Static prediction

Static prediction in P1 and PMMX

Static prediction in PPro, P2, P3, P4, P4E

Static prediction in PM and Core2

Static prediction in AMD

3.17 Close jumps

Close jumps on PMMX

Chained jumps on PPro, P2 and P3

Chained jumps on P4, P4E and PM

Chained jumps on AMD

4 Pentium 1 and Pentium MMX pipeline

4.1 Pairing integer instructions

Perfect pairing

Imperfect pairing

4.2 Address generation interlock

4.3 Splitting complex instructions into simpler ones

4.4 Prefixes

4.5 Scheduling floating point code

5 Pentium 4 (NetBurst) pipeline

5.1 Data cache

5.2 Trace cache

Economizing trace cache use on P4

Trace cache use on P4E

Trace cache delivery rate

Branches in the trace cache

Guidelines for improving trace cache performance

5.3 Instruction decoding

5.4 Execution units

5.5 Do the floating point and MMX units run at half speed?

Hypothesis 1

Hypothesis 2

Hypothesis 3

Hypothesis 4

5.6 Transfer of data between execution units

Explanation A

Explanation B

Explanation C

5.7 Retirement

5.8 Partial registers and partial flags

5.9 Store forwarding stalls

5.10 Memory intermediates in dependency chains

Transferring parameters to procedures

Transferring data between floating point and other registers

Literature

5.11 Breaking dependency chains

5.12 Choosing the optimal instructions

INC and DEC

8-bit and 16-bit integers

Memory stores

Shifts and rotates

Integer multiplication

LEA

5.13 Bottlenecks in P4 and P4E

Memory access

Execution latency

Execution unit throughput

Port throughput

Trace cache delivery

Trace cache size

µop retirement

Instruction decoding

Branch prediction

Replaying of µops

6 Pentium Pro, II and III pipeline

6.1 The pipeline in PPro, P2 and P3

6.2 Instruction fetch

6.3 Instruction decoding

Instruction length decoding

The 4-1-1 rule

IFETCH block boundaries

Instruction prefixes

6.4 Register renaming

6.5 ROB read

6.6 Out of order execution

6.7 Retirement

6.8 Partial register stalls

Partial flags stalls

Flags stalls after shifts and rotates

6.9 Store forwarding stalls

6.10 Bottlenecks in PPro, P2, P3

7 Pentium M pipeline

7.1 The pipeline in PM

7.2 The pipeline in Core Solo and Duo

7.3 Instruction fetch

7.4 Instruction decoding

7.5 Loop buffer

7.6 Micro-op fusion

7.7 Stack engine

7.8 Register renaming

7.9 Register read stalls

7.10 Execution units

7.11 Execution units that are connected to both port 0 and 1

7.12 Retirement

7.13 Partial register access

Partial flags stall

7.14 Store forwarding stalls

7.15 Bottlenecks in PM

Memory access

Instruction fetch and decode

Micro-operation fusion

Execution ports

Execution latencies and dependency chains

Partial register access

Branch prediction

Retirement

8 Core 2 and Nehalem pipeline

8.1 Pipeline

8.2 Instruction fetch and predecoding

Loopback buffer

Length-changing prefixes

8.3 Instruction decoding

8.4 Micro-op fusion

8.5 Macro-op fusion

8.6 Stack engine

8.7 Register renaming

8.8 Register read stalls

8.9 Execution units

Data bypass delays on Core2

Data bypass delays on Nehalem

Mixing µops with different latencies

8.10 Retirement

8.11 Partial register access

Partial access to general purpose registers

Partial flags stall

Partial access to XMM registers

8.12 Store forwarding stalls

8.13 Cache and memory access

Cache bank conflicts

Misaligned memory accesses

8.14 Breaking dependency chains

8.15 Multithreading in Nehalem

8.16 Bottlenecks in Core2 and Nehalem

Instruction fetch and predecoding

Instruction decoding

Execution ports and execution units

Execution latency and dependency chains

Partial register access

Retirement

Branch prediction

Memory access

Literature

9 Sandy Bridge and Ivy Bridge pipeline

9.1 Pipeline

9.2 Instruction fetch and decoding

9.3 µop cache

9.4 Loopback buffer

9.5 Micro-op fusion

9.6 Macro-op fusion

9.7 Stack engine

9.8 Register allocation and renaming

Special cases of independence

Instructions that need no execution unit

Elimination of move instructions

9.9 Register read stalls

9.10 Execution units

Read and write bandwidth

Data bypass delays

Mixing µops with different latencies

256-bit vectors

Underflow and subnormals

9.11 Partial register access

Partial flags stall

Partial access to vector registers

9.12 Transitions between VEX and non-VEX modes

9.13 Cache and memory access

Cache bank conflicts

Misaligned memory accesses

Prefetch instructions

9.14 Store forwarding stalls

9.15 Multithreading

9.16 Bottlenecks in Sandy Bridge and Ivy Bridge

Instruction fetch and predecoding

µop cache

Execution ports and execution units

Execution latency and dependency chains

Partial register access

Retirement

Branch prediction

Memory access

Multithreading

Literature

10 Haswell and Broadwell pipeline

10.1 Pipeline

10.2 Instruction fetch and decoding

10.3 µop cache

10.4 Loopback buffer

10.5 Micro-op fusion

10.6 Macro-op fusion

10.7 Stack engine

10.8 Register allocation and renaming

Special cases of independence

Instructions that need no execution unit

Elimination of move instructions

10.9 Execution units

Fused multiply and add

How many input dependencies can a µop have?

Read and write bandwidth

Data bypass delays

256-bit vectors

Mixing µops with different latencies

Underflow and subnormals

10.10 Partial register access

Partial flags access

Partial access to vector registers

10.11 Cache and memory access

Cache bank conflicts

Misaligned memory accesses

10.12 Store forwarding stalls

10.13 Multithreading

10.14 Bottlenecks in Haswell and Broadwell

Instruction fetch and predecoding

µop cache

Execution ports and execution units

Floating point addition has lower throughput than multiplication

Execution latency and dependency chains

Branch prediction

Memory access

Multithreading

Literature

11 Skylake pipeline

11.1 Pipeline

11.2 Instruction fetch and decoding

11.3 µop cache

11.4 Loopback buffer

11.5 Micro-op fusion

11.6 Macro-op fusion

11.7 Stack engine

11.8 Register allocation and renaming

Special cases of independence

Instructions that need no execution unit

Elimination of move instructions

11.9 Execution units

Fused multiply and add

How many input dependencies can a µop have?

Read and write bandwidth

Data bypass delays

256-bit vectors

Warm-up period for 256-bit vector operations

Underflow and subnormals

11.10 Partial register access

Partial flags access

Partial access to vector registers

11.11 Cache and memory access

Cache bank conflicts

11.12 Store forwarding stalls

11.13 Multithreading

11.14 Bottlenecks in Skylake

Instruction fetch and predecoding

µop cache

Execution ports and execution units

Execution latency and dependency chains

Branch prediction

Memory access

Multithreading

Literature

12 Intel Atom pipeline

12.1 Instruction fetch

12.2 Instruction decoding

12.3 Execution units

12.4 Instruction pairing

12.5 X87 floating point instructions

12.6 Instruction latencies

12.7 Memory access

12.8 Branches and loops

12.9 Multithreading

12.10 Bottlenecks in Atom

13 Intel Silvermont pipeline

13.1 Pipeline

13.2 Instruction fetch and decoding

13.3 Loop buffer

13.4 Macro-op fusion

13.5 Register allocation and out of order execution

13.6 Special cases of independence

13.7 Execution units

Read and write bandwidth

Data bypass delays

Underflow and subnormals

13.8 Partial register access

13.9 Cache and memory access

13.10 Store forwarding

13.11 Multithreading

13.12 Bottlenecks in Silvermont

Instruction fetch and decoding

Execution ports and execution units

Out of order execution

Branch prediction

Memory access

Multithreading

Literature

14 Intel Knights Corner pipeline

Literature

15 Intel Knights Landing pipeline

15.1 Pipeline

15.2 Instruction fetch and decoding

15.3 Loop buffer

15.4 Execution units

Latencies of the f.p./vector unit

Mask operations

Data bypass delays

Underflow and subnormals

Mathematical functions

15.5 Partial register access

15.6 Partial access to vector registers and VEX / non-VEX transitions

15.7 Special cases of independence

15.8 Cache and memory access

Read and write bandwidth

15.9 Store forwarding

15.10 Multithreading

15.11 Bottlenecks in Knights Landing

Instruction fetch and decoding

Microcode

Execution ports and execution units

Out of order execution

Branch prediction

Memory access

Multithreading

Literature

16 VIA Nano pipeline

16.1 Performance monitor counters

16.2 Instruction fetch

16.3 Instruction decoding

16.4 Instruction fusion

16.5 Out of order system

16.6 Execution ports

16.7 Latencies between execution units

Latencies between integer and floating point type XMM instructions

16.8 Partial registers and partial flags

16.9 Breaking dependence

16.10 Memory access

16.11 Branches and loops

16.12 VIA specific instructions

16.13 Bottlenecks in Nano

17 AMD K8 and K10 pipeline

17.1 The pipeline in AMD K8 and K10 processors

17.2 Instruction fetch

17.3 Predecoding and instruction length decoding

17.4 Single, double and vector path instructions

17.5 Stack engine

17.6 Integer execution pipes

17.7 Floating point execution pipes

17.8 Mixing instructions with different latency

17.9 64 bit versus 128 bit instructions

17.10 Data delay between differently typed instructions

17.11 Partial register access

17.12 Partial flag access

17.13 Store forwarding stalls

17.14 Loops

17.15 Cache

Level-2 cache

Level-3 cache

17.16 Bottlenecks in AMD K8 and K10

Instruction fetch

Out-of-order scheduling

Execution units

Mixed latencies

Dependency chains

Jumps and branches

Retirement

18 AMD Bulldozer, Piledriver and Steamroller pipeline

18.1 The pipeline in AMD Bulldozer, Piledriver and Steamroller

18.2 Instruction fetch

18.3 Instruction decoding

18.4 Loop buffer

18.5 Instruction fusion

18.6 Stack engine

18.7 Out-of-order schedulers

18.8 Integer execution pipes

18.9 Floating point execution pipes

Subnormal operands

Fused multiply and add

18.10 AVX instructions

18.11 Data delay between different execution domains

18.12 Instructions that use no execution units

18.13 Partial register access

18.14 Partial flag access

18.15 Dependency-breaking instructions

18.16 Branches and loops

18.17 Cache and memory access

18.18 Store forwarding stalls

18.19 Bottlenecks in AMD Bulldozer, Piledriver and Steamroller

Power saving

Shared resources

Instruction fetch

Instruction decoding

Out-of-order scheduling

Execution units

256-bit memory writes

Mixed latencies

Dependency chains

Jumps and branches

Memory and cache access

Retirement

18.20 Literature

19 AMD Bobcat and Jaguar pipeline

19.1 The pipeline in AMD Bobcat and Jaguar

19.2 Instruction fetch

19.3 Instruction decoding

19.4 Single, double and complex instructions

19.5 Integer execution pipes

19.6 Floating point execution pipes

19.7 Mixing instructions with different latency

19.8 Dependency-breaking instructions

19.9 Data delay between differently typed instructions

19.10 Partial register access

19.11 Cache

19.12 Store forwarding stalls

19.13 Bottlenecks in Bobcat and Jaguar

19.14 Literature:

20 Comparison of microarchitectures

20.1 The AMD K8 and K10 kernel

20.2 The AMD Bulldozer, Piledriver and Steamroller kernel

20.3 The Pentium 4 kernel

20.4 The Pentium M kernel

20.5 Intel Core 2 and Nehalem microarchitecture

20.6 Intel Sandy Bridge and later microarchitectures

21 Comparison of low power microarchitectures

21.1 Intel Atom microarchitecture

21.2 VIA Nano microarchitecture

21.3 AMD Bobcat microarchitecture

21.4 Conclusion

22 Future trends

23 Literature

24 Copyright notice

3. The microarchitecture of Intel, AMD and An optimization guide for assembly programmers and VIA CPUs compiler makers By Agner Fog. Technical University of Denmark. Copyright © 1996 - 2016. Last updated 2016-12-01. Contents 1 Introduction ....................................................................................................................... 5 1.1 About this manual ....................................................................................................... 5 1.2 Microprocessor versions covered by this manual ........................................................ 7 2 Out-of-order execution (All processors except P1, PMMX) ................................................ 9 2.1 Instructions are split into µops ..................................................................................... 9 2.2 Register renaming .................................................................................................... 10 3 Branch prediction (all processors) ................................................................................... 12 3.1 Prediction methods for conditional jumps .................................................................. 12 3.2 Branch prediction in P1 ............................................................................................. 18 3.3 Branch prediction in PMMX, PPro, P2, and P3 ......................................................... 21 3.4 Branch prediction in P4 and P4E .............................................................................. 23 3.5 Branch prediction in PM and Core2 .......................................................................... 25 3.6 Branch prediction in Intel Nehalem ........................................................................... 27 3.7 Branch prediction in Intel Sandy Bridge and Ivy Bridge ............................................. 28 3.8 Branch prediction in Intel Haswell, Broadwell and Skylake ........................................ 29 3.9 Branch prediction in Intel Atom, Silvermont and Knights Landing.............................. 29 3.10 Branch prediction in VIA Nano ................................................................................ 30 3.11 Branch prediction in AMD K8 and K10 .................................................................... 31 3.12 Branch prediction in AMD Bulldozer, Piledriver and Steamroller ............................. 33 3.13 Branch prediction in AMD Bobcat and Jaguar ......................................................... 34 3.14 Indirect jumps on older processors ......................................................................... 35 3.15 Returns (all processors except P1) ......................................................................... 35 3.16 Static prediction ...................................................................................................... 36 3.17 Close jumps ............................................................................................................ 37 4 Pentium 1 and Pentium MMX pipeline ............................................................................. 38 4.1 Pairing integer instructions ........................................................................................ 38 4.2 Address generation interlock ..................................................................................... 42 4.3 Splitting complex instructions into simpler ones ........................................................ 42 4.4 Prefixes ..................................................................................................................... 43 4.5 Scheduling floating point code .................................................................................. 44 5 Pentium 4 (NetBurst) pipeline .......................................................................................... 47 5.1 Data cache ............................................................................................................... 47 5.2 Trace cache .............................................................................................................. 47 5.3 Instruction decoding .................................................................................................. 52 5.4 Execution units ......................................................................................................... 53 5.5 Do the floating point and MMX units run at half speed? ............................................ 56 5.6 Transfer of data between execution units .................................................................. 58 5.7 Retirement ................................................................................................................ 61 5.8 Partial registers and partial flags ............................................................................... 61 5.9 Store forwarding stalls .............................................................................................. 62 5.10 Memory intermediates in dependency chains ......................................................... 62 5.11 Breaking dependency chains .................................................................................. 64 5.12 Choosing the optimal instructions ........................................................................... 64 5.13 Bottlenecks in P4 and P4E ...................................................................................... 67

6 Pentium Pro, II and III pipeline......................................................................................... 70 6.1 The pipeline in PPro, P2 and P3 ............................................................................... 70 6.2 Instruction fetch ........................................................................................................ 70 6.3 Instruction decoding .................................................................................................. 71 6.4 Register renaming .................................................................................................... 75 6.5 ROB read .................................................................................................................. 75 6.6 Out of order execution .............................................................................................. 79 6.7 Retirement ................................................................................................................ 80 6.8 Partial register stalls .................................................................................................. 81 6.9 Store forwarding stalls .............................................................................................. 84 6.10 Bottlenecks in PPro, P2, P3 .................................................................................... 85 7 Pentium M pipeline .......................................................................................................... 87 7.1 The pipeline in PM .................................................................................................... 87 7.2 The pipeline in Core Solo and Duo ........................................................................... 88 7.3 Instruction fetch ........................................................................................................ 88 7.4 Instruction decoding .................................................................................................. 88 7.5 Loop buffer ............................................................................................................... 90 7.6 Micro-op fusion ......................................................................................................... 90 7.7 Stack engine ............................................................................................................. 92 7.8 Register renaming .................................................................................................... 94 7.9 Register read stalls ................................................................................................... 94 7.10 Execution units ....................................................................................................... 96 7.11 Execution units that are connected to both port 0 and 1 .......................................... 96 7.12 Retirement .............................................................................................................. 98 7.13 Partial register access ............................................................................................. 98 7.14 Store forwarding stalls .......................................................................................... 100 7.15 Bottlenecks in PM ................................................................................................. 100 8 Core 2 and Nehalem pipeline ........................................................................................ 103 8.1 Pipeline ................................................................................................................... 103 8.2 Instruction fetch and predecoding ........................................................................... 103 8.3 Instruction decoding ................................................................................................ 106 8.4 Micro-op fusion ....................................................................................................... 106 8.5 Macro-op fusion ...................................................................................................... 107 8.6 Stack engine ........................................................................................................... 108 8.7 Register renaming .................................................................................................. 109 8.8 Register read stalls ................................................................................................. 109 8.9 Execution units ....................................................................................................... 110 8.10 Retirement ............................................................................................................ 114 8.11 Partial register access ........................................................................................... 114 8.12 Store forwarding stalls .......................................................................................... 116 8.13 Cache and memory access ................................................................................... 117 8.14 Breaking dependency chains ................................................................................ 118 8.15 Multithreading in Nehalem .................................................................................... 118 8.16 Bottlenecks in Core2 and Nehalem ....................................................................... 119 9 Sandy Bridge and Ivy Bridge pipeline ............................................................................ 121 9.1 Pipeline ................................................................................................................... 121 9.2 Instruction fetch and decoding ................................................................................ 121 9.3 µop cache ............................................................................................................... 122 9.4 Loopback buffer ...................................................................................................... 124 9.5 Micro-op fusion ....................................................................................................... 124 9.6 Macro-op fusion ...................................................................................................... 124 9.7 Stack engine ........................................................................................................... 125 9.8 Register allocation and renaming ............................................................................ 126 9.9 Register read stalls ................................................................................................. 127 9.10 Execution units ..................................................................................................... 127 9.11 Partial register access ........................................................................................... 131 9.12 Transitions between VEX and non-VEX modes .................................................... 131 9.13 Cache and memory access ................................................................................... 132 2

9.14 Store forwarding stalls .......................................................................................... 133 9.15 Multithreading ....................................................................................................... 133 9.16 Bottlenecks in Sandy Bridge and Ivy Bridge .......................................................... 134 10 Haswell and Broadwell pipeline ................................................................................... 136 10.1 Pipeline ................................................................................................................. 136 10.2 Instruction fetch and decoding .............................................................................. 136 10.3 µop cache ............................................................................................................. 136 10.4 Loopback buffer .................................................................................................... 137 10.5 Micro-op fusion ..................................................................................................... 137 10.6 Macro-op fusion .................................................................................................... 137 10.7 Stack engine ......................................................................................................... 138 10.8 Register allocation and renaming .......................................................................... 138 10.9 Execution units ..................................................................................................... 139 10.10 Partial register access ......................................................................................... 142 10.11 Cache and memory access ................................................................................. 143 10.12 Store forwarding stalls ........................................................................................ 144 10.13 Multithreading ..................................................................................................... 145 10.14 Bottlenecks in Haswell and Broadwell ................................................................. 145 11 Skylake pipeline .......................................................................................................... 148 11.1 Pipeline ................................................................................................................. 148 11.2 Instruction fetch and decoding .............................................................................. 148 11.3 µop cache ............................................................................................................. 148 11.4 Loopback buffer .................................................................................................... 149 11.5 Micro-op fusion ..................................................................................................... 149 11.6 Macro-op fusion .................................................................................................... 149 11.7 Stack engine ......................................................................................................... 150 11.8 Register allocation and renaming .......................................................................... 150 11.9 Execution units ..................................................................................................... 151 11.10 Partial register access ......................................................................................... 154 11.11 Cache and memory access ................................................................................. 155 11.12 Store forwarding stalls ........................................................................................ 156 11.13 Multithreading ..................................................................................................... 156 11.14 Bottlenecks in Skylake ........................................................................................ 156 12 Intel Atom pipeline ....................................................................................................... 159 12.1 Instruction fetch .................................................................................................... 159 12.2 Instruction decoding .............................................................................................. 159 12.3 Execution units ..................................................................................................... 159 12.4 Instruction pairing.................................................................................................. 160 12.5 X87 floating point instructions ............................................................................... 161 12.6 Instruction latencies .............................................................................................. 161 12.7 Memory access ..................................................................................................... 162 12.8 Branches and loops .............................................................................................. 163 12.9 Multithreading ....................................................................................................... 163 12.10 Bottlenecks in Atom ............................................................................................ 164 13 Intel Silvermont pipeline .............................................................................................. 164 13.1 Pipeline ................................................................................................................. 165 13.2 Instruction fetch and decoding .............................................................................. 165 13.3 Loop buffer ........................................................................................................... 166 13.4 Macro-op fusion .................................................................................................... 166 13.5 Register allocation and out of order execution ...................................................... 166 13.6 Special cases of independence............................................................................. 166 13.7 Execution units ..................................................................................................... 166 13.8 Partial register access ........................................................................................... 167 13.9 Cache and memory access ................................................................................... 167 13.10 Store forwarding.................................................................................................. 168 13.11 Multithreading ..................................................................................................... 168 13.12 Bottlenecks in Silvermont .................................................................................... 168 14 Intel Knights Corner pipeline........................................................................................ 170 3

15 Intel Knights Landing pipeline ...................................................................................... 171 15.1 Pipeline ................................................................................................................. 171 15.2 Instruction fetch and decoding .............................................................................. 171 15.3 Loop buffer ........................................................................................................... 172 15.4 Execution units ..................................................................................................... 172 15.5 Partial register access ........................................................................................... 174 15.6 Partial access to vector registers and VEX / non-VEX transitions ......................... 174 15.7 Special cases of independence............................................................................. 175 15.8 Cache and memory access ................................................................................... 175 15.9 Store forwarding ................................................................................................... 175 15.10 Multithreading ..................................................................................................... 176 15.11 Bottlenecks in Knights Landing ........................................................................... 176 16 VIA Nano pipeline ........................................................................................................ 178 16.1 Performance monitor counters .............................................................................. 178 16.2 Instruction fetch .................................................................................................... 178 16.3 Instruction decoding .............................................................................................. 178 16.4 Instruction fusion ................................................................................................... 178 16.5 Out of order system .............................................................................................. 179 16.6 Execution ports ..................................................................................................... 179 16.7 Latencies between execution units ....................................................................... 180 16.8 Partial registers and partial flags ........................................................................... 182 16.9 Breaking dependence ........................................................................................... 182 16.10 Memory access ................................................................................................... 183 16.11 Branches and loops ............................................................................................ 183 16.12 VIA specific instructions ...................................................................................... 183 16.13 Bottlenecks in Nano ............................................................................................ 184 17 AMD K8 and K10 pipeline ........................................................................................... 185 17.1 The pipeline in AMD K8 and K10 processors ........................................................ 185 17.2 Instruction fetch .................................................................................................... 187 17.3 Predecoding and instruction length decoding ........................................................ 187 17.4 Single, double and vector path instructions ........................................................... 188 17.5 Stack engine ......................................................................................................... 189 17.6 Integer execution pipes ......................................................................................... 189 17.7 Floating point execution pipes ............................................................................... 189 17.8 Mixing instructions with different latency ............................................................... 191 17.9 64 bit versus 128 bit instructions ........................................................................... 192 17.10 Data delay between differently typed instructions................................................ 193 17.11 Partial register access ......................................................................................... 193 17.12 Partial flag access ............................................................................................... 194 17.13 Store forwarding stalls ........................................................................................ 194 17.14 Loops .................................................................................................................. 195 17.15 Cache ................................................................................................................. 195 17.16 Bottlenecks in AMD K8 and K10 ......................................................................... 197 18 AMD Bulldozer, Piledriver and Steamroller pipeline ..................................................... 198 18.1 The pipeline in AMD Bulldozer, Piledriver and Steamroller ................................... 198 18.2 Instruction fetch .................................................................................................... 199 18.3 Instruction decoding .............................................................................................. 199 18.4 Loop buffer ........................................................................................................... 200 18.5 Instruction fusion ................................................................................................... 200 18.6 Stack engine ......................................................................................................... 200 18.7 Out-of-order schedulers ........................................................................................ 200 18.8 Integer execution pipes ......................................................................................... 201 18.9 Floating point execution pipes ............................................................................... 201 18.10 AVX instructions.................................................................................................. 202 18.11 Data delay between different execution domains ................................................ 203 18.12 Instructions that use no execution units .............................................................. 204 18.13 Partial register access ......................................................................................... 205 18.14 Partial flag access ............................................................................................... 205 4

18.15 Dependency-breaking instructions ...................................................................... 205 18.16 Branches and loops ............................................................................................ 206 18.17 Cache and memory access ................................................................................. 206 18.18 Store forwarding stalls ........................................................................................ 207 18.19 Bottlenecks in AMD Bulldozer, Piledriver and Steamroller .................................. 208 18.20 Literature ............................................................................................................ 210 19 AMD Bobcat and Jaguar pipeline ................................................................................ 210 19.1 The pipeline in AMD Bobcat and Jaguar ............................................................... 210 19.2 Instruction fetch .................................................................................................... 211 19.3 Instruction decoding .............................................................................................. 211 19.4 Single, double and complex instructions ............................................................... 211 19.5 Integer execution pipes ......................................................................................... 211 19.6 Floating point execution pipes ............................................................................... 211 19.7 Mixing instructions with different latency ............................................................... 212 19.8 Dependency-breaking instructions ........................................................................ 212 19.9 Data delay between differently typed instructions ................................................. 212 19.10 Partial register access ......................................................................................... 212 19.11 Cache ................................................................................................................. 212 19.12 Store forwarding stalls ........................................................................................ 213 19.13 Bottlenecks in Bobcat and Jaguar ....................................................................... 213 19.14 Literature: ........................................................................................................... 214 20 Comparison of microarchitectures ............................................................................... 214 20.1 The AMD K8 and K10 kernel ................................................................................ 214 20.2 The AMD Bulldozer, Piledriver and Steamroller kernel .......................................... 215 20.3 The Pentium 4 kernel ............................................................................................ 216 20.4 The Pentium M kernel ........................................................................................... 218 20.5 Intel Core 2 and Nehalem microarchitecture ......................................................... 218 20.6 Intel Sandy Bridge and later microarchitectures .................................................... 219 21 Comparison of low power microarchitectures .............................................................. 220 21.1 Intel Atom microarchitecture ................................................................................. 220 21.2 VIA Nano microarchitecture .................................................................................. 220 21.3 AMD Bobcat microarchitecture.............................................................................. 221 21.4 Conclusion ............................................................................................................ 221 22 Future trends ............................................................................................................... 223 23 Literature ..................................................................................................................... 226 24 Copyright notice .......................................................................................................... 226 1 Introduction 1.1 About this manual This is the third in a series of five manuals: 1. Optimizing software in C++: An optimization guide for Windows, Linux and Mac platforms. 2. Optimizing subroutines in assembly language: An optimization guide for x86 platforms. 3. The microarchitecture of Intel, AMD and VIA CPUs: An optimization guide for assembly programmers and compiler makers. 4. Instruction tables: Lists of instruction latencies, throughputs and micro-operation breakdowns for Intel, AMD and VIA CPUs. 5

5. Calling conventions for different C++ compilers and operating systems. The latest versions of these manuals are always available from www.agner.org/optimize. Copyright conditions are listed on page 226 below. The present manual describes the details of the microarchitectures of x86 microprocessors from Intel and AMD. The Itanium processor is not covered. The purpose of this manual is to enable assembly programmers and compiler makers to optimize software for a specific microprocessor. The main focus is on details that are relevant to calculations of how much time a piece of code takes to execute, such as the latencies of different execution units and the throughputs of various parts of the pipelines. Branch prediction algorithms are also covered in detail. This manual will also be interesting to students of microarchitecture. But it must be noted that the technical descriptions are mostly based on my own research, which is limited to what is measurable. The descriptions of the "mechanics" of the pipelines are therefore limited to what can be measured by counting clock cycles or micro-operations (µops) and what can be deduced from these measurements. Mechanistic explanations in this manual should be regarded as a model which is useful for predicting microprocessor behavior. I have no way of knowing with certainty whether it is in accordance with the actual physical structure of the microprocessors. The main purpose of providing this information is to enable programmers and compiler makers to optimize their code. On the other hand, my method of deducing information from measurements rather than relying on information published by microprocessor vendors provides a lot of new informa- tion that cannot be found anywhere else. Technical details published by microprocessor vendors is often superficial, incomplete, selective and sometimes misleading. My findings are sometimes in disagreement with data published by microprocessor vendors. Reasons for this discrepancy might be that such data are theoretical while my data are obtained experimentally under a particular set of testing conditions. I do not claim that all information in this manual is exact. Some timings etc. can be difficult or impossible to measure exactly, and I do not have access to the inside information on technical implementations that microprocessor vendors base their technical manuals on. The tests are done mostly in 32-bit and 64-bit protected mode. Most timing results are independent of the processor mode. Important differences are noted where appropriate. Far jumps, far calls and interrupts have mostly been tested in 16-bit mode for older processors. Call gates etc. have not been tested. The detailed timing results are listed in manual 4: "Instruction tables". Most of the information in this manual is based on my own research. Many people have sent me useful information and corrections, which I am very thankful for. I keep updating the manual whenever I have new important information. This manual is therefore more detailed, comprehensive and exact than other sources of information; and it contains many details not found anywhere else. This manual is not for beginners. It is assumed that the reader has a good understanding of assembly programming and microprocessor architecture. If not, then please read some books on the subject and get some programming experience before you begin doing complicated optimizations. See the literature list in manual 2: "Optimizing subroutines in assembly language" or follow the links from www.agner.org/optimize. The reader may skip chapters describing old microprocessor designs unless you are using these processors in embedded systems or you are interested in historical developments in microarchitecture. 6

Please don't send your programming questions to me, I am not gonna do your homework for you! There are various discussion forums on the Internet where you can get answers to your programming questions if you cannot find the answers in the relevant books and manuals. 1.2 Microprocessor versions covered by this manual The following families of x86 microprocessors are discussed in this manual: Microprocessor name Microarchitecture Abbreviation Intel Pentium (without name suffix) Intel Pentium MMX Intel Pentium Pro Intel Pentium II Intel Pentium III Intel Pentium 4 (NetBurst) Intel Pentium 4 with EM64T, Pentium D, etc. Intel Pentium M, Core Solo, Core Duo Intel Core 2 Intel Core i7 Intel 2nd generation Core Intel 3rd generation Core Intel 4th generation Core Intel 5th generation Core Intel 6th generation Core Intel Atom 330 Intel Bay Trail Intel Xeon Phi 7210 AMD Athlon AMD Athlon 64, Opteron, etc., 64-bit AMD Family 10h, Phenom, third generation Opteron AMD Family 15h, Bulldozer AMD Family 15h, Piledriver AMD Family 15h, Steamroller AMD Bobcat AMD Kabini, Temash, etc. VIA Nano, 2000 series VIA Nano, 3000 series code name P5 P5 P6 P6 P6 Netburst Netburst, Prescott P1 PMMX PPro P2 P3 P4 P4E Dothan, Yonah PM Merom, Wolfdale Nehalem Sandy Bridge Ivy Bridge Haswell Broadwell Skylake Diamondville Silvermont Knights Landing K7 K8 K10 Bulldozer Piledriver Steamroller Bobcat Jaguar Isaiah Core2 Nehalem Sandy Bridge Ivy Bridge Haswell Broadwell Skylake Atom Silvermont Knights Landing AMD K7 AMD K8 AMD K10 Bulldozer Piledriver Steamroller Bobcat Jaguar Nano 2000 Nano 3000 Table 1.1. Microprocessor families The abbreviations here are intended to distinguish between different kernel microarchitec- tures, regardless of trade names. The commercial names of microprocessors often blur the distinctions between different kernel technologies. The name Celeron applies to P2, P3, P4 or PM with less cache than the standard versions. The name Xeon applies to P2, P3, P4 or Core2 with more cache than the standard versions. The names Pentium D and Pentium Extreme Edition refer to P4E with multiple cores. The brand name Pentium was originally applied to the P5 and P6 microarchitectures, but the same name has later been reapplied to some processors with later microarchitectures. The name Centrino applies to Pentium M, Core Solo and Core Duo processors. Core Solo is rather similar to Pentium M. Core Duo is similar too, but with two cores. 7

The name Sempron applies to a low-end version of Athlon 64 with less cache. Turion 64 is a mobile version. Opteron is a server version with more cache. Some versions of P4E, PM, Core2 and AMD processors have multiple cores. The P1 and PMMX processors represent the fifth generation in the Intel x86 series of microprocessors, and their processor kernels are very similar. PPro, P2 and P3 all have the sixth generation kernel (P6). These three processors are almost identical except for the fact that new instructions are added to each new model. P4 is the first processor in the seventh generation which, for obscure reasons, is not called seventh generation in Intel documents. Quite unexpectedly, the generation number returned by the CPUID instruction in the P4 is not 7 but 15. The confusion is complete when the subsequent Intel CPUs: Pentium M, Core, and later processors all report generation number 6. The reader should be aware that different generations of microprocessors behave very differently. Also, the Intel and AMD microarchitectures are very different. What is optimal for one generation or one brand may not be optimal for the others. 8

分享到：

赞收藏

资料库

The microarchitecture of Intel and AMD CPUs.pdf

相关推荐

开发技术

热门标签

最新资料