

### Computerarchitectuur

### **22 The GPU architecture**

Jan Lemeire 2024-2025



**MARCHER** 

### What to know from this chapter

- Why is the potential computational power of the GPU so much higher than that of a CPU?
- ▶ Why is it good to have more threads than cores?
- Why is a GPU called a thread engine? Why are fine-grained threads efficient, while not on CPU?
- What are potential performance bottlenecks that prevent from attaining full performance?
- What is the difference between vector processing (SIMD) and GPU thread processing (SIMT)?



#### versus



## **1. The Power of GPUs**

**GPU VS CPU PEAK PEAK PERFORMANCE TRENDS 350 Million triangles/second<br>
<b>3** Dillion transisters CDU  $H_{\rm eff}$  is the has kept up with  $M_{\rm eff}$ **2010 3 Billion transistors GPU**

## **1995**

**5.000 triangles/second 800.000 transistors GPU**

# **2016**

Source : NVIDIA

**14.000 Million triangles/second 15 Billion transistors GPU**



**AUTOMOTIVE** 

 $11$ 

 $70m$ 

455866

 $\mathcal{D}_{\mathcal{S}}$ 

**SEP** 





### Graphical Processing Units (GPUs)





## Supercomputing for free

### ▶ FASTRA at university of Antwerp



http://fastra.ua.ac.be

*Collection of 8 graphical cards in PC*

*FASTRA 8 cards = 8x128 processors = 4000 euro*

*Similar performance as University's supercomputer (512 regular desktop PCs) that costed 3.5 million euro in 2005*

"**Supercomputing in a box**": a high-end GPU cost 500 to 2500 euro and has equivalent power as 40 quadcore CPUs



## Why are GPUs faster?



GPU specialized for math-intensive highly parallel computation

So, more transistors can be devoted to data processing rather than data caching and flow control

*Both, about 15 billion transistors*



*No branch prediction, out-oforder execution,* 

*…*

#### Devote transistors to… computation



## GPU Architecture



## **TRUSSELT Streaming Multiprocessor**





## 1 multiprocessor





## Peak GPU Performance

- GPUs consist of MultiProcessors (MPs) grouping a number of Scalar Processors (SPs)
- ▶ 1 Multiply-Add instruction performs 2 operations at once
- Nvidia GTX 280:
	- 30MPs x 8 SPs/MP x 2FLOPs/instr/SP x 1 instr/clock x 1.3 GHz  $= 624$  GFlops
- Nvidia Tesla C2050:
	- 14 MPs x 32 SPs/MP x 2FLOPs/instr/SP x 1 instr/clock x 1.15 GHz (clocks per second)
	- $= 1030$  GFlops



## Other limit: memory bandwidth

- Nvidia GTX 280:
	- 1.1 GHz memory clock
	- 141 GB/s
- Nvidia Tesla C2050:
	- 1.5 GHz memory clock
	- 144 GB/s

#### **Example:** real-time image processing



**CPU gives only 4 fps next generation machines need 50 fps GPUs deliver 70 fps**



### Example: pixel transformation (FPN)

usgn\_8 transform(usgn\_8 in, sgn\_16 gain, sgn\_16 gain\_divide, sgn\_8 offset)

```
{
  sgn_32 x = (in * gain / gain\_divide) + offset;if (x < 0)
```

```
x = 0;
if (x > 255)x = 255;
return x;
```
}



## Pixel transformation

- **Performance on Tesla C2050**
- 1 pixel is represented by 1 byte [0-255]
	- Per pixel: read 4 bytes (pixel, gain, divide & offset) and write 1 byte
- Integer operations: performance is half of floating point operations
- ▶ Pixel transformation: typically 6 operations (1 index calculation, 3 integer calculations and 2 comparisons)





## Roofline model applied





## Host (CPU) – Device (GPU)





## Typical Sequence of Events



## **2. Massive thread engine**



## Multithreading

- Performing multiple threads of execution in parallel
	- Replicate registers, PC, etc.
	- Fast switching between threads
- ▶ Fine-grain multithreading
	- Switch threads after each cycle
	- Interleave instruction execution
	- If one thread stalls, others are executed
- Coarse-grain multithreading
	- Only switch on long stall (e.g., L2-cache miss)
	- Simplifies hardware, but doesn't hide short stalls (eg, data hazards)



**Overhead Overhead** 

## Multithreading on CPU

- ▶ 1 process/thread simultaneously active per core
- ▶ When activating another thread: *context switch* 
	- Stop program execution: flush pipeline (let all instructions finish)
	- Save state of process/thread into Process Control Block : registers, program counter and operating system-specific data
	- Restore state of activated thread
	- Restart program execution and refill the pipeline



### Fine multi-threading: Hardware threads

- ▶ In several modern CPUs
	- typically 2HW threads
- ▶ Devote extra hardware for process state
- ▶ Thread switching by hardware
	- (almost) no overhead
	- Within 1 cycle!
	- Instructions in flight from different threads



## Simultaneous Multithreading

- ▶ In multiple-issue dynamically scheduled processor
	- Schedule instructions from multiple threads
	- Instructions from independent threads execute when function units are available
	- Within threads, dependencies handled by scheduling and register renaming
- Example: Intel Pentium-4 HyperThreading
	- Two threads: duplicated registers, shared function units and caches



### Benefits of fine-grained multithreading

- **Independent instructions (no bubbles)**
- **More time between instructions: possibility** for *latency hiding* 
	- Hide memory accesses
- $\triangleright$  If pipeline full
	- Forwarding not necessary
	- Branch prediction not necessary



## Thread executes kernel

- Massively parallel programs are usually written so that each thread computes one part of a problem
	- For vector addition, we will add corresponding elements from two arrays, so each thread will perform one addition
	- If we think about the thread structure visually, the threads will usually be arranged in the same shape as the data



## Thread Structure

- ▶ Consider a simple vector addition of 16 elements
	- 2 input buffers (A, B) and 1 output buffer (C) are required





## Thread Structure

- ▶ Create thread structure to match the problem
	- 1-dimensional problem in this case Thread IDs





## Thread Structure

▶ Each thread is responsible for adding the indices corresponding to its ID





## The effect of parallelism





## **Concurrency**

- Keep all processing units busy!
	- Enough threads
- All Multiprocessors (MPs)
- All Scalar Processors (SPs)
- Full pipeline of scalar processor
	- Pipeline of up to 24 stages

## **3. GPU architectuur**



## GPU Architecture



## **TRUSSELT Streaming Multiprocessor**





## Execution Model

- Kernel = smallest unit of execution, like a C function, executed by each work item ( $\approx$  kernel thread)
- ▶ Data parallelism: kernel is run by a grid of work groups
- ▶ Work group consist of instances of same kernel: work items
- ▶ Different data elements are fed into the work items of the work groups

 $\rightarrow$ We talk about *stream computing* 



## Kernel execution

#### ▶ Simple scheduler

- Assigns work groups to available streaming MultiProcessors (MPs)
- Basically, a waiting queue for work groups
- Work groups (WGs) execute independently
	- Global Synchronization among work groups is not possible!



#### GPU with 2 MPs

GPU with 4 MPs



## Architecture – Memory Model







## Global memory

- ▶ Divided into partitions
	- NVIDIA GPUs typically have 8 partitions
- Memory controller can serve 1 segment ( $\approx$  cache line of 4x32 Bytes)
- Memory coalescing for warps
	- Accessed elements of a warp should belong to same aligned segment
	- $\circ$  if not (uncoalesced access), memory requests are serialized  $\Rightarrow$ wille take more time
- ▶ Active warps of different cores/multiprocessors simultaneously access global memory
	- **Partition camping** when they access the same partition  $\Rightarrow$ serialization of memory requests



## Local/Shared memory

- **Local/Shared memory is divided into banks**
- Each bank can service one address per cycle
- Multiple simultaneous accesses to a bank result in a bank conflict
	- Conflicting accesses are serialized
	- $\circ$  Cost = max # simultaneous accesses to a single bank
	- No bank conflict:
		- all threads of a half-warp access different banks,
		- all threads of a half-warp access identical address, (broadcast)

**Bank** 

Bank 7

Bank 6

Bank 5

Bank 4

Bank 3

Bank 2

Bank 1

Bank 0

# **Bank Addressing Examples**



## **Bank Addressing Examples**





### Worst case

▶ Threads of the same warp accessing the same column of a matrix having a width of a multiple of 16

## **4. Hardware Threads & SIMT**



## 1 Compute Unit (core)





## The execution on a GPU



- Work groups are scheduled on compute units (cores).
- Warps of active work groups are scheduled on the core



#### ▶ Hardware thread (called warp by Nvidia):

- Work items are executed together in groups, the instructions of the kernel are executed at the same time they will execute the same instruction
- $\circ$  Nvidia: 32; AMD: 64; Intel: variable number  $(8/16/24/32)$

#### Consequences:

- 1. Running 1 work item or 32 work items takes the same amount of time
	- Create work groups which are multiples of 32 or 64 (AMD)
- 2. Branching: if work items of the same warp take different branches, all branches will be executed after each other
	- Performance loss
- 3. Concurrent memory access: if work items access memory, all work items of the same warp do it simultaneously
	- Not all memory access can be done with the same speed



### When is  $SIMT = vector processing?$

#### ▶ Contiguous data access (See lesson 2)



▶ Warp execution of instructions on the data is similar to vector instructions operating on vector registers.



## Vectors versus SIMT

- Vectors
	- Data should be stored in vector register
	- Instructions are performed onto these registers
	- Harder to program
- SIMT
	- Each thread of a warp can choose on which data it works
	- Easier to program: programmer does not have to worry about work item-data mapping



## Warp execution

- Work items are sent into pipeline grouped in a warp
	- ALUs all execute the same instruction in `lockstep': Single  $\ddot{\phantom{1}}$ Instruction, Multiple Threads (SIMT)
	- Every cycle a new warp can issue an instruction  $\ddotmark$



**Cycle**



## Warp execution

On an Nvidia Kepler architecture, a single precision floating point instruction (add or multiplication) takes 9 cycles, which is the length of the pipeline.

- 8 other warps can be scheduled in the mean the mean time  $\leftarrow$
- After 9 cycles, the second instruction of the first warp (multiplication) can be issued, next the second warp and so on
- $\Rightarrow$  With 9 warps the pipeline is completely filled, no stalling/idling, the completion latency of 9 cycles is hidden.





### **SIMT Conditional Processing**

**O** If work items of a warp follow different branches, the instructions of both branches have to be executed, but are desactivated for some threads.

=> Performance loss!

Example: assume 8 threads, one instruction in if-clause, one in then-clause

← 3 cyles in which 24 instructions are executed, 8 lost cycles (66% usage)

Desactivated instructions (red)







## ARM's conditional instructions

- $\rightarrow$  The condition is tested against the current processor flags and if not met the instruction is treated as a no-op.
- $\triangleright$  removes the need to branch, avoiding pipeline stalls and increasing speed. It also increases code density.
- ▶ By using suffix EQ, NE, GT, ...

CMP r0,  $#5$  ; if  $(a == 5)$ BLEQ fn ; fn(10)

MOVEQ r0, #10 ; only executed if equality

## **5. Conclusions**

## Weguential' processor: super-scalar out-of-order pipeline





### **INIVERSITEIT** GPU architecture strategy

### ▶ Light-weight threads, supported by the hardware

- Thread processors, upto 96 threads per processing element
- Switching between threads can happen in 1 cycle!

#### ▶ No caching mechanism, branch prediction, ...

- GPU does not try to be efficient for every program, does not spend transistors on optimization
- Simple straight-forward sequential programming should be abandoned…
- **Less higher-level memory:** 
	- GPU: 16KB shared memory per SIMD multiprocessor
	- CPU: L2 cache contains several MB's
- Massively floating-point computation power
- ▶ RISC ISA instead of CISC
- **Transparent system organization**

◆ Modern (sequential) CPUs based on simple Von Neumann architecture



## GPU processor pipeline

- 6-24 stages
- in-order execution!!
- ▶ no branch prediction!!
- ▶ no forwarding!!
- ▶ no register renaming!!
- Memory system:
	- relatively small
	- Until recently no caching
	- On the other hand: much more registers (see later)
- ▶ No program call stack
	- All functions inlined
	- No recursion possible

#### Challenges of GPU computing





*GPU processing power is not for free*

# Obstacle 1

Hard(er) to implement

# Obstacle 2

Hard(er) to get efficiency

*Discussed in detail in my course GPU Computing (2nd semester)*