GPU
Computing Labs
Lab 1 (20/2/2023)
Lab 2 (27/2/2023)
Assignment mini-project 2023
Goal: Develop
and study some simple kernels and their performance
Deliverables:
report + code
Deadline: 13 March
- Choose one/two basic operations that you will apply on 1 or 2
arrays
- addition - multiplication - mul-add - division - sqrt -
max - bitwise operation (and) - ...
- Choose one/two datatype: integer, float, double
- Choose and compare either two basic operations
(e.g. compare multiplication and division) or two different
datatypes (e.g. compare float and double).
- 1. Measure the computational and memory performance in
function of the array size (automate this!)
- performance will converge to its maximum
- it will be memory-bound, so constrained by the memory
bandwidth
- 2. Let the kernel execute on multiple elements (there are
multiple ways to select which elements each thread will do)
- The number of elements is a parameter
- Plot the performance in function of this parameter
- 3. Artificially increase the compute intensity by doing more
of the same operation on the same data
- Add a loop over the operation and/or use the idea of a
pseudo-random number generator (see this code)
- The compute intensity will depend on the loop count which
you specify as a parameter
- By doing so, the computational performance will increase as
is demonstrated by the roofline model
- Draw the roofline model (which is hardware dependent) for
the operation and the GPU
- Specify which GPU you are using, its parameters (generation,
#multiprocessors, #scalar processors on each multiprocessor,
frequency, ...) and calculate the theoretical peak performance
- Several characteristics can be queried with cuda.
- With the compute capability you know of which generation
your gpu is (see lesson 3 or wikipedia)
- Use the benchmark app of www.gpuperformance.org
to empirically measure the peak performance (computational and
memory)
- By clicking on the data you'll see the measurements (or
roofline)
- Compare the performances (theoretical, benchmark, kernels) and
try to explain them
Assignment 2021 & 2022
- Goal: Develop and study some simple kernels and their
performance
- Take basic operations like for the PPP mini-project
that you apply on 1 or 2 arrays
- check your choice with me before you start
- 1. Measure the computational and memory performance in
function of the array size (automate this!)
- measure it on your GPUs and on some HYDRAs GPUs
- performance will converge to its maximum
- it will be memory-bound, so constrained by the memory
bandwidth
- 2. Let the kernel execute on multiple elements (see notes of
Chapter 2: "one-to-many mapping" which can
happen in 2 ways, and optionally the "Locally Spaced
One-to-many")
- the number of elements is a parameter
- to pass the parameter from host (CPU) code to the kernel
(GPU) while compiling, add options-argument to building
instruction:
- cl::Program program = jc::buildProgram(kernel_file,
context, device, "-D CONSTANT_PARAMETER=10");
- Now you can use CONSTANT_PARAMETER in the kernel code
which will be replaced with 10 before compilation
- see project multiplyFloatsWithConstantParameter
- Plot the performance in function of this parameter
- 3. Artificially increase the compute intensity by doing more
of the same operation on the same data
- Add a loop over the operation and/or use the idea of a
pseudo-random number generator (see the code of the PPP
mini-project)
- The compute intensity will depend on the loop count which
you specify as a parameter
- I will show you how to pass the parameter from host
(CPU) code to the kernel (GPU) while compiling
- By doing so, the computational performance will increase
as is demonstrated by the roofline model
- Draw the roofline model (which is hardware dependent) for
the operation and every GPU (automate this!).
- You can try to make generic code for the previous 3
analysis's on the chosen operations.
Assignment 2020
- Develop and study a microbenchmark to measure the performance
(operations per second) of an individual instruction (or
combination of 2 instructions)
- choose one of the existing microbenchmarks (see documents on
theory page and gpuperformance.org)
- develop one of a few extensions
- test it, automate it (calculation of performance, iterations
and comparison of the variants)
- run it on all available GPU
- compare results of all variants and compare with theoretical
performance (explained in document
of lesson 1)
- put results and comparison in report (table)
- also: vary number of work items (from 1 to ...) and plot
performance in function of work items
- see project sumIntsOccupancy in which array size
and number of work items is varied between 1 and
MAX_ARRAY_SIZE. With the function delta(array_size)
(defined in JC/util.h) the following sequence is generated:
1, 2, 3, 4, 6, 8, 12, 16, 24, 32, 48, 64, ...
- present it to the class and discuss
- Make available:
- Install Debian linux and NVidia GTX 280
- Find out how to use GPUs of super computing center
- Make it possible to run C & Java programs on linux GPU
servers
See documents of lesson 1 and document on levels 0 and 1 of
lesson 2.
Additional information 2020
- Fool the compiler: see document
of lesson 2.
- Query clock frequency, number of compute units and local
memory size: functions added to openCLUtil.hpp
- Calculate performance (operations per second)
- number of operations = number of work items x number of
instructions per work item
- Execute the same code and take the average result
- see project sumIntsMultipleRuns
THIS WILL NOT BE CONSIDERED THIS YEAR (Only if you want)
- Control occupancy (number of concurrent threads):
- set workgroup size (number of work items in 1 work group)
with cl::NDRange local(wg_size); and pass local with
runAndTimeKernel (see listErosion project)
- set number of concurrent work groups via local memory as
kernel argument (see listErosion project)
- kernel.setArg<cl::LocalSpaceArg>(2,
cl::__local(localMemorySize))
- nbrConcurrentWorkgroups = localMemorySize /
memoryOfWorkgroup
- Calculate CPI for the Lambdas
- use clock frequency to convert seconds to cycles
- calculate how many thread/warp instructions are executed on
one core (compute unit):
- divide total number of instructions by the number of cores
and the warp size (Nvidia: 32, AMD: 64, Intel: undefined -
take 1)
- calculate CPI: cycles per thread instructions of 1 core
- after ranging occupancy from 1 to maximum:
- minimal CPI is issue latency
- maximal CPI is completion latency
- To Find Out Later: Tijd meten op gpu - zie ref ->
registers voor clock ticks, leek niet te kloppen, kan door
compiler herschikt worden - opencl inline assembly
- Back to top -