GPU Computing
Project
GOAL:
implement an algorithm on GPU, analyze and model its
performance and optimize it.
Project 2023
Assignment
Plane
Sweep Code
- Q&A project: 17 April
- deadline project: May 15
- exam: 13-14 June
Upto 2022
Topics
You are free in choosing a topic. For instance, you can parallelize
the algorithm of your thesis, or another one that interests you.
Discuss your choice with the professor.
Inspiration can be found in the project section of
the PPP course and below you also find GPU-specific topics.
Special topics
You can also choose among the following topics:
-
Test and use SYCL (see
khronos) for your GPU implementation
-
device code is specified
in source code with advanced C++ constructs
(Lambdas)
-
see also
coldplay.com ->
ComputeCpp based on SYCL - for linux
- Make a memory access logging and analysis system
- The memory access pattern greatly determines the
performance.
- Log at runtime each memory access (adding 'probes' to
the code that write memory address to an output array)
- Analyze the memory log: check the patterns for
coalesced/uncoalesced access and bank conflicts
- Predict the runtime
- Develop and test more intricate microbenchmarks
- e.g. detect the scheduler of GPUs
- in-depth analysis of Intel/AMD GPUs which are
different
Objectives
For a (frequently occurring) computational problem you should
develop an efficient GPU solution in OpenCL.
- Implement or get the sequential version.
- Create a naive, GPU implementation which is correct (compare
with sequential result).
- Also measure speedup, performance (flops) and bandwidth. We
advice you to make some management code (or script) that
automates this!
- Optimize the naive implementation by trying to overcome the
performance limits it contains.
- Study alternative optimizations and compare their
performance.
- Model the performance with the pipeline model, test how close
the simulated performance is to the real performance.
Make sure to invest enough time in the
comparisons and modeling. The focus is not only on the complexity
of the GPU implementation! It's better to have a less complex
implementation with a good analysis!
Organization
Here are the rules for the project:
- The project will be under the guidance of Jan Lemeire.
- You can work alone or in groups of two/three.
- The deadline to choose a topic is 1st April, 2020.
- Meet us somewhere halfway the project to discuss your current
problems, to let us give advice and to define the expected end
result. Make an individual appointment.
- The deadline for the project is half of May.
Programming and Optimization Tips: see end of lesson 5.
We expect the following deliverables:
- All relevant code related to the project.
- sequential code + parallel implementations
- the parallel versions should check their result with the one
of the sequential version to proof that the outcome is
correct!
- A short report that describes:
- The problem (brief).
- The different implementations. You can be brief here,
since we have your source code. A diagram or scheme might be
helpful here.
- Links to sources of information and of source code.
- Description of the GPUs used. Refer to the sharepoint
excel sheet.
- Most important: a discussion of the performance of the
different implementations
- Give speedups, computational performance (flops) and
bandwidth of the different experiments
- Discussion of results: compare implementations and try
to explain inefficiencies
- especially in case of bad speedups, try to find out
why it is performing so bad
- Model the performance and simulate it with the
www.gpuperformance.org simulator
Additional topics
a) Tree Construction and Tree Traversal (can be split into 2
separate topics)
- Given a mesh of vertices (forming triangles) describing a 3D
volume:
- We want to know in each point in space the distance to the
volume, i.e. the distance to the closest vertex/triangle:
- A naive way to do this is to traverse for each point in space
over all vertices and find the closest one.
- If the mesh contains about 1 million vertices and for the
space 512x512x512 points are considered, this will take ages...
- Finding the closest vertex fast happens by tree describing a
Bounding Volume Hierarchy:
- CUDA-code is provided. Use if for our problem, make it OpenCL
and think about optimizations. On a CPUquadcore i7 the above
mesh takes 15 minutes to complete... On a GPU?
- we will provide you the data.