## Advanced Computer Architecture

LECTURE 3: INSTRUCTION STREAMS [5.1] BRANCH PREDICTION EFFICIENT FETCHING EFFICIENT DECODING EFFICIENT DISPATCHING

JAN LEMEIRE

Advanced Computer Architecture – Jan Lemeire – VUB - 2012-2013 - Lecture 3

### **Program Control Flow**

### Control flows irregularly through a program

o conditional branches

unconditional branches
indirect branches (instruction

specifing the address of the next instruction to execute)

• Problem: how do we get instructions into the pipeline?



(a)

Advanced Computer Architecture – Jan Lemeire – VUB - 2012-2013 - Leo

### **Instruction Streams**

- Goal: issue as much as possible useful instructions as early as possible (to keep pipeline filled)
- Correct branch prediction is extremely important
  - Even more important when
    - × pipelines become deeper (mispredication penalty)
    - × width of architecture increases (superscalar)
    - × branch instructions are more complex
- Efficient fetching & decoding is important
  - o high bandwidth
  - high frequencies
  - also for CISC architectures!!!

### **Problems with Branches**

# • Potentially big pipeline bubbles



Advanced Computer Architecture – Jan Lemeire

### **Branch Prediction**

• Main idea: predict where control will be transferred, fetch and execute speculatively

#### Observation

- o temporal locality in branching (loops)
- o can predict if we keep track of past
- o often can predict really well
   (+95% for some programs)

### Three tasks

- 1. branch condition speculation/prediction
- 2. branch target speculation/prediction
- 3. branch mispredictions recovery

### Static vs Dynamic Condition Prediction

#### Static

one prediction per conditional branch in binary code
o determined by software (or hardware convention)
o statically at compile time<sup>1</sup>

### • Dynamic

- many predictions possible, based on local or global branching history
- o determined by hardware
- o dynamically at run time<sup>1</sup>

<sup>1</sup> In proper English, one writes "compile-time optimization" and "optimization at run time".

Advanced Computer Architecture – Jan Lemeire – VUB - 2012-2013 - Lecture 3

### Static Branch Condition Prediction (1)

• Determine statically for each branch what its predicted condition will be (taken or not-taken)

### Condition determined by

- o conventions
- o hint bit in the instruction encoding

### Three options

- o rule-based
- o program-based
- o profile-based

### Static Branch Condition Prediction (2)

#### • Pro

o Cheap, not complex, little hardware

### • Con

• Not adaptive to program input

• Not adaptive to dynamic program behavior

### Interesting for

o hybrid static-dynamic predictors

- o low-power embedded processors
- o compiler optimization such as code layout

### Static Branch Condition Prediction (3)

### Rule-based

#### • Never taken

× simple hardware, sequential fetching

#### o Always taken

- × more complex hardware, need to know (PC-relative) target address
- × need to fill branch delay slot (hard in OoO processors, hard for compilers)
- Backward taken, forward not taken (BTFNT)
  - only backward branches (corresponding to loops) are mostly taken
  - × for others, compiler can play with code layout

#### • In common: based on low-level (machine code) properties

### Static Branch Condition Prediction (4)

#### Program-based

• Requires a hint bit in instruction encoding

• Features and structure of the source language determine hints

- Ioop branch: predict taken
- × NULL-test for pointers: predict non-NULL
- × pointer comparison: predict not equal
- More accurate than rule-based because of high-level decision logic

### Static Branch Condition Prediction (5)

#### Profile-based

• Requires a hint bit in instruction encoding

- Profile application to collect condition statistics
- Feed back the statistics to 2<sup>nd</sup> compiler run, fills in bits
   \* hint is taken when branch was taken more than 50% of the time
- More accurate than program-based because program-based rules can be tuned
- Requires representative inputs

## Dynamic Branch Condition Prediction (1)

12

- More accurate 80%-97% (↔ static 70%-80%)
- Some branches are hard to predict statically, but easy dynamically
  - o First half of program not-taken, second hald taken
  - Alternating taken and non-taken
  - Input-dependent branches
- Adapts to dynamic behavior of a program
  - Prediction depends on context of branches
- Common in all predictors
  - Finite state machines keep track of recent histories to determine current prediction: pattern history tables
  - Some indirection scheme to choose a particular finite state machine

### Dynamic Branch Condition Prediction (2) [225]

13

- Finite state machines : 2-bit saturating counters
  - Design decision: favoring taken or not-taken
  - o 2-bit states
  - o lookup
    - × oo: predict not-taken
    - × 01: predict not-taken
    - × 10: predict taken
    - × 11: predict taken
  - o update
    - x taken: +1 (saturating arithmetic)
    - × not taken: -1



## Dynamic Branch Condition Prediction (3)

14

• 2-bit saturating counters: suppose the following sequence of branch directions

| branch direction | state before | updated state | prediction |
|------------------|--------------|---------------|------------|
| 0                | 00           | 00            | 0          |
| 0                | 00           | 00            | 0          |
| 1                | 00           | 01            | 0          |
| 1                | 01           | 10            | 0 /2       |
| 1                | 10           | 11            | 1          |
| 1                | 11           | 11            | 1          |
| 0                | 11           | 10            | 1 /2       |
| 1                | 10           | 11            | 1          |
| 1                | 11           | 11            | 1 🔶        |
| 0                | 11           | 10            | 1          |
| 0                | 10           | 01            | 1          |

## **Dynamic Branch Condition Prediction (4)**

15

Basic bimodal branch predictor



 $2^m$  2-bit saturating counters

as in a cache, multiple (branch) addresses are mapped - onto the same line - onto the same FSM

this can result in aliasing

Advanced Computer Architecture – Jan Lemeire – VUB - 2012-2013 - Lecture 3

## **Dynamic Branch Condition Prediction (5)**

16

- Two-level adaptive predictors
- BHSR (recently executed branches)
  - o global (G)
  - o individual (P)

#### • PHT

- o global: 1 table (g)
- individual: 1 table per branch (p)
- shared: 1 table for a small number of branches (s)

#### • history-based FSM

- o adaptive (A)
- Allows several designs
  - GAg, PAg, PAs, ... with varying table sizes
  - a large design space, try to find optimal cost/performance

#### When branches correlate with behavior of other branches

Advanced Computer Architecture – Jan Lemeire – VUB - 2012-2013 - Lecture 3



Pattern history table (PHT)



Advanced Computer Architecture – Jan Lemeire – VUB - 2012-2013 - Lecture 3

## Dynamic Branch Condition Prediction (7)

18

- Two-level adaptive
- Gshare
- work because of correlation between branches
  - o e.g., branches that test same variable
  - o or just statistically correlated
  - o best with global predictors
- works because of recurring patterns
  - o best with local (individual) predictors
- different parameters for different correlations and patterns

## Hybrid Branch Condition Prediction (1)

19

### Sometimes prediction still goes wrong

#### • Branch can be hard to predict

- × Predictor is still being trained
- × Some branches behave truly random

#### • Tables are limited in size

- × Interference, conflicts or *aliasing* 
  - Can be negative, neutral or positive (correlated jumps)
- $\times$  Two possible reasons
  - Tables too small for number of branches
  - Hash function maps branches to same lines
- Behavior or branch does not fit type of predictor
- Solution: hybrid branch predictor

## Hybrid Branch Condition Prediction (2)

20

- General idea: combine different predictors
- Some branches will be predicted better by one of the predictors
- Remember which predictor is best for each branch
- Several types
  - o tournament
  - o static
  - o branch classification
  - o multi-hybrid
  - o etc.



### Tournament Predictor (2)

22

 Meta predictor is two-bit counter that decides which predictor is used
 O If <2 → P1; if ≥2 → P2</li>

• Update meta predictor

• Do nothing if both predictions correct

- Decrement if P1 correct and P2 incorrect
- Increment if P1 incorrect and P2 correct
- Update both predictors on update

### • Typically have a global and a local predictor



### Branch Classification (2)

24

- Use static predictor for branches that predict well statically (e.g., +95% changes)
- Predict other branches dynamically
- Pro:
  - o less branches in tables
  - hence less aliasing and better performance
- Con:
  - o hints are available in ID stage only

### Example 1: Alpha 21264

25

### Hybrid predictor consisting of

#### o Local PAg

- × 1st level: 1K 10-bit elements
- × 2nd level: 1K 3-bit saturating counters

#### o Globale GAg

- × 4K 2-bit saturating counters
- × 12 bits global branch history
- Meta predictor
  - × 4K 2-bit saturating counters
  - × Indexed as global predictor

### Vb. 2: IBM POWER4

 $\mathbf{26}$ 

#### • Hybride predictor consisting of

#### • Bimodal predictor

× 16K 1-bit saturating counters

#### • Gshare predictor

- × 16K 1-bit saturating counters
- × 11 bits global history

#### • Meta predictor

- × 16K 1-bit saturating counters
- × Indexed like gshare predictor

### Branch Target Buffers [226]

- need to know in time from where to fetch
  - when condition is speculated for conditional branches
  - always for non-conditional branches
- branch target buffer (or branch target access cache)
  - o cache
  - indexed by branch instruction address
  - lookup returns branch target address
- if address is not present, assume not taken
- very early in pipeline!
- update on retirement
- store all branches or only taken



### **Other Branch Prediction Techniques**

28

- many extensions exist
- trace cache (in a couple of slides)
- return address stack
  - keep a small stack of return addresses
  - o push on call
  - o pop return address on return
- to save time, different tables and caches are accessed together *and concurrently*, then choice is made !

## Misprediction Recovery [218-219]

29

- Speculative instructions are tagged (with tag specific for branch)
- When branch is really executed, prediction is validated
- Upon misprediction
  - mispredicted instructions are discarded
  - fetching at correct place is initiated



## Efficient Instruction Fetching (1) [504]

- Branch predictors reduce control flow dependencies
- Still fetching instructions from I\$ in program order
- Problem 1: what if fetch block spans more than one I\$ line
- Problem 2: together with a taken branch, non-executed instructions may be stored in cache



### Efficient Instruction Fetching (2)

31

- Solution 1: Compiler optimizes code layout to place basic blocks at good cache alignment
  - o problem: code generation becomes microarchitecture-dependent
  - far from optimal
- Solution 2: Auto-realignment hardware

Advanced Computer Architecture – Jan Lemeire – VUB - 2012-2013 - Lecture 3

## Efficient Instruction Fetching (3) [506]

32

### • Alternative: trace cache

- o instead of storing static instructions based on their address
- store dynamic instructions (traces) based on their address and on branch outcomes, higher bandwidth can be obtained

## Efficient Instruction Fetching (4)

33

### • Alternative: trace cache

- instead of storing static instructions based on their address
- store dynamic instructions (traces) based on their address and on branch outcomes, higher bandwidth can be obtained
- fetch-time storing or completion-time storing



Advanced Computer Architecture – Jan Lemeire – VUB - 2012-2013 - Lecture 3

## Efficient Instruction Fetching (5) [509]

#### High frequency fetching

- many techniques to speculate where to fetch
- o large tables of precise predictors are slow (multiple cycles)
- o overriding branch predictors (©2000)
  - × very accurate predictors are complex and slow
  - × hence first use a simple, single-cycle predictor
  - × override it one or more cycles later by complex, multi-cycle predictor

## Efficient Instruction Decoding (1) [195]

36

#### Decoding determines

- o what the individual instruction types in the fetch group are
- o what their types are, operands, etc...
- o identify dependencies & branch instructions
- => Comparators & multi-ported registers

#### Complexity depends on

- o ISA
- width of superscalar pipeline
- o frequency to be obtained

#### • RISC: easy

- o fixed instruction width
- o limited nr of instruction types
- CISC: much more complex -> several stages

### Efficient Instruction Decoding (2)

36

- CISC instruction widths vary
- Hence decoding is very difficult
- Requires multiple pipeline stages
  - Early on in pipeline -> bad for branch misprediction penalties
- Is very hard to parallelize (sequential dependence on width)
- Need to generate micro-ops
  - Intel: micro-operations
  - AMD: RISC operations
  - Intel: 1.5 2 micro-ops/instruction



Advanced Computer Architecture – Jan Lemeire – VUB - 2012-2013 - Lecture 3

## Efficient Instruction Decoding (3)

37

#### • Alternative: predecoding [198]

- o (partly) decode when instruction are brought in from memory
- Intel: trace cache in Pentium 4
- AMD: regular I\$

#### • Pro:

• decoding only once (more or less)

• much easier decoding

#### • Con:

- o larger caches
- higher cache-memory latency

#### • RISC? Yes, also

- to identify branches
- independent ops in fetch block



## Efficient Instruction Dispatching (1)

38

Routing instructions to functional units

#### Decentralizes

- previous pipeline stages are centralized
- FU pipelines are decentralized

### • Parallel

- o types are already known
- Dispatch instructions to
  - reservation station(s)
  - o temporary buffers
  - o waiting for operands



### Efficient Instruction Dispatching (2)

39

- Centralized reservation station
- Pro
  - o less blocking
  - higher IPC
- Con
  - o long and complex wiring
  - complex decision logic
- Example
  - Intel Pentium Pro



## Efficient Instruction Dispatching (3)

40

- Distributed reservation station
- Pro
  - o smaller structure, less wiring
  - simple decision logic
  - o low hardware complexity
- Con
  - Worse overall utilization
  - Saturation/blocking possible Iss
  - o lower IPC
- Example
  - IBM PowerPC 650
- Also combinations possible



Advanced Computer Architecture – Jan Lemeire – VUB - 2012-2013 - Lecture 3

### Efficient Instruction Dispatching (4)

### Terminology

### Dispatch

o push instruction into reservation station

- o in OoO architecture: push into reorder buffer
- o decentralized reservation stations: routing to correct station

#### • Issue

• select an instruction from reservation station

• start its execution in the functional unit (pipeline)

### Acknowledgement

42

• Thanks for (parts of) slides

o Bjorn De Sutter
o Lieven Eeckhout
o Milde II Liperti

o Mikko H. Lipasti

• James C. Hoe

o John P. Shen

Advanced Computer Architecture – Jan Lemeire – VUB - 2012-2013 - Lecture 3