GPU Computing Labs

theory - documentation - project

Lab 1 (20/2/2023)

Introduction - installation - code - examples - exercises

Lab 2 (27/2/2023)

Explanation and exercises

Lab 3 (6/3/2023)

Lab 4 (13/3/2023)

Lab 5 (20/3/2023)

Lab 6 (27/3/2023)

Assignment mini-project 2023

         Goal: Develop and study some simple kernels and their performance
         Deliverables: report + code
         Deadline: 13 March

Choose one/two basic operations that you will apply on 1 or 2 arrays

addition - multiplication - mul-add - division - sqrt - max - bitwise operation (and) - ...

Choose one/two datatype: integer, float, double
Choose and compare either two basic operations (e.g. compare multiplication and division) or two different datatypes (e.g. compare float and double).
1. Measure the computational and memory performance in function of the array size (automate this!)

performance will converge to its maximum
it will be memory-bound, so constrained by the memory bandwidth

2. Let the kernel execute on multiple elements (there are multiple ways to select which elements each thread will do)

The number of elements is a parameter
Plot the performance in function of this parameter

3. Artificially increase the compute intensity by doing more of the same operation on the same data

Add a loop over the operation and/or use the idea of a pseudo-random number generator (see this code)
The compute intensity will depend on the loop count which you specify as a parameter
By doing so, the computational performance will increase as is demonstrated by the roofline model
Draw the roofline model (which is hardware dependent) for the operation and the GPU

Specify which GPU you are using, its parameters (generation, #multiprocessors, #scalar processors on each multiprocessor, frequency, ...) and calculate the theoretical peak performance

Several characteristics can be queried with cuda.
With the compute capability you know of which generation your gpu is (see lesson 3 or wikipedia)

Use the benchmark app of www.gpuperformance.org to empirically measure the peak performance (computational and memory)

By clicking on the data you'll see the measurements (or roofline)

Compare the performances (theoretical, benchmark, kernels) and try to explain them

Assignment 2021 & 2022

Goal: Develop and study some simple kernels and their performance

Take basic operations like for the PPP mini-project that you apply on 1 or 2 arrays

check your choice with me before you start

1. Measure the computational and memory performance in function of the array size (automate this!)

measure it on your GPUs and on some HYDRAs GPUs
performance will converge to its maximum
it will be memory-bound, so constrained by the memory bandwidth

2. Let the kernel execute on multiple elements (see notes of Chapter 2: "one-to-many mapping" which can happen in 2 ways, and optionally the "Locally Spaced One-to-many")

the number of elements is a parameter

to pass the parameter from host (CPU) code to the kernel (GPU) while compiling, add options-argument to building instruction:
cl::Program program = jc::buildProgram(kernel_file, context, device, "-D CONSTANT_PARAMETER=10");
Now you can use CONSTANT_PARAMETER in the kernel code which will be replaced with 10 before compilation
see project multiplyFloatsWithConstantParameter

Plot the performance in function of this parameter

3. Artificially increase the compute intensity by doing more of the same operation on the same data

Add a loop over the operation and/or use the idea of a pseudo-random number generator (see the code of the PPP mini-project)
The compute intensity will depend on the loop count which you specify as a parameter

I will show you how to pass the parameter from host (CPU) code to the kernel (GPU) while compiling

By doing so, the computational performance will increase as is demonstrated by the roofline model
Draw the roofline model (which is hardware dependent) for the operation and every GPU (automate this!).

You can try to make generic code for the previous 3 analysis's on the chosen operations.

I can help with that

Assignment 2020

Develop and study a microbenchmark to measure the performance (operations per second) of an individual instruction (or combination of 2 instructions)

choose one of the existing microbenchmarks (see documents on theory page and gpuperformance.org)
develop one of a few extensions
test it, automate it (calculation of performance, iterations and comparison of the variants)
run it on all available GPU
compare results of all variants and compare with theoretical performance (explained in document of lesson 1)
put results and comparison in report (table)
also: vary number of work items (from 1 to ...) and plot performance in function of work items

see project sumIntsOccupancy in which array size and number of work items is varied between 1 and MAX_ARRAY_SIZE. With the function delta(array_size) (defined in JC/util.h) the following sequence is generated: 1, 2, 3, 4, 6, 8, 12, 16, 24, 32, 48, 64, ...

present it to the class and discuss

Make available:

Install Debian linux and NVidia GTX 280
Find out how to use GPUs of super computing center
Make it possible to run C & Java programs on linux GPU servers

See documents of lesson 1 and document on levels 0 and 1 of lesson 2.

Additional information 2020

Fool the compiler: see document of lesson 2.
Query clock frequency, number of compute units and local memory size: functions added to openCLUtil.hpp

Function to query local memory size and number of compute units

Calculate performance (operations per second)

number of operations = number of work items x number of instructions per work item

Execute the same code and take the average result

see project sumIntsMultipleRuns

Control occupancy (number of concurrent threads):

set workgroup size (number of work items in 1 work group) with cl::NDRange local(wg_size); and pass local with runAndTimeKernel (see listErosion project)
set number of concurrent work groups via local memory as kernel argument (see listErosion project)

kernel.setArg<cl::LocalSpaceArg>(2, cl::__local(localMemorySize))
nbrConcurrentWorkgroups = localMemorySize / memoryOfWorkgroup

Calculate CPI for the Lambdas

use clock frequency to convert seconds to cycles
calculate how many thread/warp instructions are executed on one core (compute unit):

divide total number of instructions by the number of cores and the warp size (Nvidia: 32, AMD: 64, Intel: undefined - take 1)

calculate CPI: cycles per thread instructions of 1 core
after ranging occupancy from 1 to maximum:

minimal CPI is issue latency
maximal CPI is completion latency

To Find Out Later: Tijd meten op gpu - zie ref -> registers voor clock ticks, leek niet te kloppen, kan door compiler herschikt worden - opencl inline assembly

- Back to top -