# Advanced Computer Architecture

#### LECTURE 4: DATA STREAMS INSTRUCTION EXECUTION INSTRUCTION COMPLETION & RETIREMENT DATA FLOW & REGISTER RENAMING DYNAMIC EXECUTION CORE

JAN LEMEIRE

Advanced Computer Architecture – Jan Lemeire – VUB - 2012-2013 - Lecture 4



Advanced Computer Architecture – Jan Lemeire – VUB - 2012-2013 - Lecture 4

### Instruction Execution (1)

3

• Trend towards more and more specialized FUs

#### • Typically

- o multiple integer FUs
- o multiple floating-point FUs
- o one or more branch units
- o multiple load/store-units
- SIMD units (SSEx)
- o very recently: cryptographic units

### Instruction Execution (2)

- Width determined by fetch, decode and complete width
- Typically #FUs > pipeline width

• Pro

• Handle dynamic mix of instructions (sometimes not matching mix of FUs), even with increased specialization

• Con

- Forwarding hardware becomes much more complex
  - × order n<sup>2</sup>, slow because of high fan-out (# output lines)
  - × need to forward in same cycle as actual computation
  - × solution:
    - often don't want need full cross-bar
    - deal with structural hazards instead

### **Instruction Completion**

#### After execution

o instruction leaves FU

- o enters reorder buffer
- o architectural state not yet updated

#### Instruction leaves reorder buffer

- o called "completion"
- o in-order
- o architectural state is updated
- store leaves store queue to enter store buffer

### **Instruction Retirement**

- Stores actually write to memory after completion, during so-called "retirement"
- For other instructions: completion = retirement

### Illustration of the issue

#### • Example code fragment

- $\times$  w: R4 = R0 + R8
- $\mathbf{x}: \mathbf{R2} = \mathbf{R0} * \mathbf{R4}$
- × y:  $R_4 = R_4 + R_8$
- × z: R8 = R4 \* R2
- In how many cycles can this be executed?
  Addition: 2 cycles
  Multiplication: 3 cycles

### Limits to ILP

8

• Control flow transfers and pipeline bubbles

#### • Data hazards

- RAW: true dependencies
- WAW: output dependencies (not in in-order processors)
- WAR: anti dependencies (not in in-order processors)

#### Structural hazards

• resources

#### **DEPENDENCY GRAPH**

9

#### • Tomasulo Removes WAW



Advanced Computer Architecture – Jan Lemeire – VUB - 2012-2013 - Lecture 4



Advanced Computer Architecture – Jan Lemeire – VUB - 2012-2013 - Lecture 4

# Tomasulo Algorithm (2)

• Improved IBM 360/91: added reservation stations, common data bus and tags



# Tomasulo Algorithm (3)

 Improved IBM 360/91: added Reservation Stations (RS), common data bus and tags
 Storage bus

# Tags need to encode **12 options**:

11 possible sources (6 FLB
+ 5 RS) for data that is not yet available
+ 1 for when data is present already







Advanced Computer Architecture – Jan Lemeire – VUB - 2012-2013 - Lecture 4



Advanced Computer Architecture – Jan Lemeire – VUB - 2012-2013 - Lecture 4

# Tomasulo Algorithm (7)

15

#### 1. Structural (FU) dependence = > virtual FU's

- FLOS can hold and decode up to 8 instructions.
- Instructions are dispatched to the 5 reservation stations (virtual FU's) even though there are only two physical FU's.
- Hence, structural dependence does not stall dispatching.

#### 2. True dependence (RAW) = > pseudo operands + result forwarding

- If an operand is available in FLR, it is copied to a reservation station entry.
- If an operand is not available (i.e. there is pending write), then a tag is copied to the reservation station entry instead. This tag identifies the source of the pending write. This instruction then waits in its reservation station for the true dependence to be resolved.
- When the operand is finally produced by the source (ID of source = tag value), this source unit asserts its ID, i.e. its tag value, on the CDB followed by broadcasting of the operand on the CDB.
- All the reservation station entries and the FLR entries <u>and</u> SDB entries carrying this tag value in their tag fields will detect a match of tag values and latch in the broadcasted operand from the CDB.
- Hence, true dependence does not block subsequent independent instructions and does not stall a physical FU. Forwarding also minimizes delay due to true dependence.

### Tomasulo Algorithm (8)

16

- 3. **Anti-dependence (WAR)** = > operand copying
  - If an operand is available in FLR, it is copied to a reservation station entry.
  - By copying this operand to the reservation station, all anti-dependences due to future writes to this same register are resolved.
  - Hence, the reading of an operand is not delayed, possibly due to other dependences, and subsequent writes are also not delayed.

## Tomasulo Algorithm (9)

17

#### 4. **Output dependence (WAW)** = > register renaming + result forwarding

- If a register is waiting for a pending write, its tag field will contain the ID, or tag value, of the source for that pending write.
- When that source eventually produces the result, that result will be written into the register via the CDB.
- It is possible that prior to the completion of the pending write, another instruction can come along and also has that same register as its destination register.
- If this occurs, the operands (or pseudo operands) needed by this instruction are still copied to an available reservation station. In addition, the tag field of the destination register of this instruction is updated with the ID of this new reservation station, i.e. the old tag value is overwritten. This will ensure that the said register will get the latest value, i.e. the late completing earlier write cannot overwrite a later write.
- However, the forwarding to instructions between those two output dependent instructions will still use the old tag.
- Hence, the output dependence is resolved without stalling a physical functional unit, not requiring additional buffers to ensure sequential write back to the register file.

# Tomasulo Algorithm (10)

- Supports <u>out of order</u> execution of instructions.
- Resolves dependences dynamically using hardware.
- Attempts to delay the resolution of dependencies as late as possible.
- Structural dependence does not stall issuing; virtual FU's in the form of reservation stations are used.
- Output dependence does not stall issuing; copying of old tag to reservation station and updating of tag field of the register with pending write with the new tag.
- True dependence with a pending write operand does not stall the reading of operands; pseudo operand (tag) is copied to reservation station.
- Anti-dependence does not stall write back; earlier copying of operand awaiting read to the reservation station.
- Can support sequence of multiple output dependences.
- Forwarding from FU's to reservation stations bypasses the register file.

# Tomasulo Algorithm (11)

\_\_\_\_\_

|                    | IBM 360/91                 | Modern processors |
|--------------------|----------------------------|-------------------|
| Width              | Peak IPC = 1               | 4+                |
| Structural hazards | 2 FPU                      | Many FU           |
|                    | Single CDB                 | Many busses       |
| Anti-dependences   | Operand copy               | Reg. renaming     |
| Output dependences | Renamed reg. tag           | Reg. renaming     |
| True dependences   | Tag-based forw.            | Tag-based forw.   |
| Exceptions         | Imprecise                  | Precise (ROB)     |
| Implementation     | 3 x 66" x 15" x 78"        | 1 chip            |
|                    | 60ns cycle time            | 300ps             |
|                    | 11-12 gate delays per pipe | < \$100           |
|                    | stage                      |                   |
|                    | >\$1 million               |                   |
|                    |                            |                   |

Advanced Computer Architecture – Jan Lemeire – VUB - 2012-2013 - Lecture 4

### Data Flow ILP Limit (1) [244]

20

#### Suppose we have a program

w[i+k].ip = z[i].rp + z[m+i].rp; w[i+j].rp = e[k+1].rp \* (z[i].rp - z[m+i].rp) - e[k+1].ip \* (z[i].ip - z[m+i].ip);

(a)

i1: f2 ← load, 4(r2) i2:  $f0 \leftarrow load, 4(r5)$ i3:  $f0 \leftarrow fadd, f2, f0$ i4:  $4(r6) \leftarrow store, f0$ i5: f14 ← laod,8(r7) i6:  $f6 \leftarrow load, 0(r2)$ i7:  $f5 \leftarrow load, 0(r3)$ i8:  $f5 \leftarrow fsub, f6, f5$ i9: f4 ← fmul, f14, f5 i10: f15 ← load, 12(r7) ill:  $f7 \leftarrow load, 4(r2)$ i12:  $f8 \leftarrow load, 4(r3)$ i13: f8 ← fsub, f7, f8 i14: f8 ← fmul, f15, f8 i15: *f*8 ← *fsub*, *f*4, *f*8 i16: 0(r8) ← store, f8

(b)

# Data Flow ILP Limit (2) [245]

• Depth of data dependence graph determines minimal number of cycles on any machine

o only RAW dependencies

- Further limits imposed by finite number of resources
  - nr of FUs
  - o pipeline width
  - o nr of registers
- Registers are reused in code in flight
  - o because of register allocation
  - because of small loops
  - reuse poses additional limits because of WAW and WAR dependencies



### Register Renaming [237]

22

Reuse means storing more than one value in a register
 until older value is needed, it cannot be overwritten by newer value

#### • Central idea:

- o do not limit storage of operands to architectural registers
- o add other "renaming registers"
- this provides storage space to avoid reuse
- rename *architectural registers* (visible to programmer) in operands to avoid reuse in instruction window
  - × removes WAR dependencies
  - × removes WAW dependencies
- Idea: write to a register no more than once
- o all in hardware



Advanced Computer Architecture – Jan Lemeire – VUB - 2012-2013 - Lecture 4



Advanced Computer Architecture – Jan Lemeire – VUB - 2012-2013 - Lecture 4

# Register Renaming: (1) Source read

25

#### • Busy bit in ARF = 0

No instruction in flight that will write to this registerGet operand from this ARF register

#### • Busy bit in ARF = 1

- Instruction in flight that will write to this ARF registers
- Tag in map table identifies corresponding RRF register

#### o 2 possibilities

- × Valid bit in RRF = 1
  - Instruction is in flight but not yet completed, so read the operand from RRF
- $\times$  Valid bit in RRF = 0
  - Instruction that will write is not yet executed, feed tag of corresponding RRF to reservation buffer (where we will wait for the operand to arrive)

### Register Renaming: (2) Destination allocate

26

- Index ARF with register number
- Set busy bit in ARF to 1
- Find free register (busy bit = 0) in RRF
- Set busy bit in RRF to 1
- Set valid bit in RRF to o
- Store rename register tag in map table of ARF register

# Register Renaming: (3) Register Update

27

• When instruction leaves FU (finishes)

- Write result value in RRF at correct place
- Set valid bit in RRF to 1

#### • When instruction leaves the reorder (completes)

- o Copy the result value from RRF to ARF
- Set busy bit in RRF to o
- This happens in-order!

# • Both steps can be in consecutive cycles or with many cycles in between ...

# Register Renaming (7)

28

#### • Alternative: one combined register file (pooled or merged)

- architectural registers are part of one larger register file
- map table remembers which of the registers are considered as architectural ones (this may change)



- Con: before context switch, need to know which architectural registers to save
- Pro: no copying of data between RRF and ARF

Advanced Computer Architecture – Jan Lemeire – VUB - 2012-2013 - Lecture 4



Advanced Computer Architecture – Jan Lemeire – VUB - 2012-2013 - Lecture 4

# Register Renaming (9)

#### • Initialisation

- All physical registers that are mappings of architectural registers are in the "architectural register" state
- The other physical registers are "available"

#### • for the input operand

• Read physical register corresponding to the architectural register from the mapping table

#### • for the output operand

- Select an "available" physical register, and change its condition to "renaming register, content not available" update the mapping table
- If the original physical register was in the "architectural register" state, its new state becomes "available"
- If there is no "available" register, block until some become available

### Register Renaming (10)

31

- When an instruction finishes execution
  - Change the state to "renaming register, content available"

#### • When an instruction leaves the reorder buffer (completion)

- o change the state into "architectural register",
- the previous physical register associated with this architectural register becomes "available"
  - × Its previous value will not be necessary anymore in case of exceptions
  - × Its previous mapping was remembered in the reorder buffer





RR = renaming register

### Register Renaming (13)





AR = architectural register AV = available RR = renaming register

Advanced Computer Architecture – Jan Lemeire – VUB - 2012-2013 - Lecture 4



R3←R3+R5

#### R5←R3+R5

AR = architectural register AV = available RR = renaming register

Advanced Computer Architecture – Jan Lemeire – VUB - 2012-2013 - Lecture 4

# Register Renaming (16)



Advanced Computer Architecture – Jan Lemeire – VUB - 2012-2013 - Lecture 4



Register renaming now blocks because there are no more available registers.

Advanced Computer Architecture – Jan Lemeire – VUB - 2012-2013 - Lecture 4



Now suppose the first two instructions leave the reorder buffer

Advanced Computer Architecture – Jan Lemeire – VUB - 2012-2013 - Lecture 4



And the next two instructions leave the reorder buffer as well ...

Advanced Computer Architecture – Jan Lemeire – VUB - 2012-2013 - Lecture 4









## Reservation Stations (2)

• Long latency operations can be reselected and reissued

• for example loads that miss in the cache

- o instead of blocking the pipeline, go back to waiting state
- o reissue the instruction later when data is available

## **Reorder Buffer**

46

#### • Tracks the state of all instructions in flight

- being executed
- finished execution but not completed
- awaiting execution in reservation slot
- Bits indicate state
  - possibly tags when speculating beyond multiple branches
- Implemented as a circular queue
- Completion bandwidth
  - determins number of instructions that can complete from head pointer
  - depends on logic and the number of register writeback ports



## Instruction Window

 Reservation station and reorder buffer can be combined in one structure: the instruction window

• Pro

- o no duplication of data (i.e. total area is smaller)
- o everything combined in one controller

#### • Con

- o one large structure instead of two smaller ones
- hence long connections
- o is problematic for high frequencies

# Dynamic Instruction Scheduling (1)

• Two variations on reservation station/instruction window

48

#### o data captured scheduling

- × entries in reservation station hold values of source operands
- × contents read from RF before dispatch
- see previous slides
- o non-data captured scheduling
  - × entries in reservation station do not hold values of source operands
  - × values are read from RF after selection, just prior to execution
  - × there is no OF stage in pipeline front-end
  - × only renaming registers occur in reservation stations
  - × values are fetched from renaming registes on wake-up
  - × is smaller and less complex
    - no register contents
    - no wires to distribute values over all instructions in reservation stations or in instruction window







## Dynamic Instruction Scheduling (5)

### Four pipeline stages

- Wake-up and select
- Read register file
- Execute instruction
- Write result in register file

#### Consequences

- Wake-up and selection in one cycle
- Instruction execution and bypassing in one cycle
- Hence no increased instruction execution latency

## So far: reach the data flow limit

53

### • Try all kinds of techniques

- o branch prediction
- o efficient fetching
- o efficient decoding (caching decoded info)
- dynamic execution
- o register renaming
- Is this it?
- Can't we do better?

# Breaking the data flow limit (1)

### Value prediction

- like branch prediction, predict operand value at beginning of pipeline
- o don't need to wait for operands
- o based on value locality
  - × some instructions always produce the same value
  - others generate sequences of values with patterns (loop indexes)
- o store past in a table, and predict future
- o need to validate prediction
- o misprediction should have minimal cost
- hybrid designs obtain 8-23% IPC increase on SPEC benchmarks

# Breaking the data flow limit (2)

#### Dynamic Instruction Reuse

- o often instructions are executed repeatedly on the same inputs
- don't need to to reexecute them if outcome was saved somewhere
- o even happens for sequences of instructions
- o dynamic instruction reuse
  - × eliminates nodes (computations) and edges in the data flow graph
  - × whereas value prediction only eliminated edges
- o is not speculative, so no validation required

# Breaking the data flow limit (3)

#### Micro-instruction fusion

- CISC ISA instructions are broken down into microinstructions
- ISA remains unchanged
- o micro-architecture changes over time (pipeline depth, etc.)
- optimal mapping between ISA instructions and microinstructions also varies over time
- o one-to-many mapping is not necessarily optimal

o solution:

- × first brake down CISC instructions into micro-operations
- than fuse micro-operations (from multiple CISC instructions) into micro-operations that fit pipeline better

## Breaking the data flow limit (4)

#### • SIMD (single instruction multiple data)

#### o observations

- many (multimedia) applications operate on smaller data (8-bit, 16-bit)
- × executing them separately on 32-bit FUs is a waste of resources
- o original solution on RISC/VLIWs
  - × adapt the FUs to do multiple narrow operations in parallel
  - × add corresponding instructions to the ISA
  - \* add supporting instructions (packing, unpacking, shuffling, ...)
  - \* are basically special-purpose instructions
  - some have very cheap implementation (How do you implement two 16-bit additions on 32-bit adders?)

o was originally called subword parallelism

## Breaking the data flow limit (5)

#### SIMD cont'd

#### o first solution on CISC (x86)

- × where in need of more registers anyway
- × but need backwards binary compatibility
- × add new MMX registers for MMX instructions only
- × so new registers besides new instructions

#### o later extensions

- × even if we don't really need new registers
- adding wider registers is a cheap way to optimize performance
   o complexity of most of pipeline stays the same
   o only part of the data path gets wider
- × SSEx (also for floating point, cryptography, ...)

### Acknowledgement

59

#### • Thanks for (parts of) slides

- Bjorn De Sutter
- o Lieven Eeckhout
- o Mikko H. Lipasti
- o James C. Hoe
- o John P. Shen

Advanced Computer Architecture – Jan Lemeire – VUB - 2012-2013 - Lecture 4