GPGPU Training

Documents

OpenCL Reference Pages

http://www.khronos.org/registry/cl/sdk/1.2/docs/man/xhtml/

Useful Functions

The Visual Studio solution contains a header file util.h that contains a number of useful functions. To use it you should add the inc directory to the Additional Include Directories and include the file as follows:

#include <JC/util.h>

It contains the following functions:

std::string fileToString(const std::string&): takes a filename and returns a string with the file contents.
cl::Program buildProgram(const std::string&, const cl::Context&, const std::vector<cl::Device>& devices): takes a filename, an OpenCL context and a vector of OpenCL devices and tries to return an OpenCL program. In the case of an error, a message is printed and an exception is thrown.
cl_ulong runAndTimeKernel(const cl::Kernel&, const cl::CommandQueue&, const cl::NDRange&, const cl::NDRange&): takes an OpenCL kernel, an OpenCL command queue and two OpenCL ND ranges and tries to run a kernel (via the command queue) and return the time in nanoseconds.
const char* readableStatus(cl_int): takes an OpenCL error status and returns a human readable (C-style) string.
closestMultiple(unsigned int, unsigned int): takes two numbers and returns the smallest number that is greater than the first number and that is a multiple of the second number.
template <class T> void showMatrix(T*, unsigned int, unsigned int): takes a pointer to an element T that is assumed to represent a one or two dimensional array together with its width and height and shows it on the screen.
bool closeEnough(float a, float b, float e): checks whether two floats are close enough to each other with precision e.

Visual Studio Solution

http://parallel.vub.ac.be/~jgcornel/gpu/PSC3.zip

The Visual Studio solution contains a number of projects each corresponding to a different exercise. They will be presented throughout the training.

The solutions are here.

FAQ

What do I need to develop OpenCL programs?

How do I list OpenCL devices on my computer?

Which tools can help me developing OpenCL programs?

How can I debug my OpenCL programs?

How can I profile my OpenCL programs?

What is the roofline model?

What do I need to develop OpenCL programs?

Of course, you need a computer. Furthermore, your computer should be equipped with a device that supports OpenCL. This could be a discrete GPU, an integrated GPU or a multi-core processor. Our main purpose is to accelerate programs using GPUs. Therefore, if you have the choice you should prefer a GPU over a processor and a discrete GPU over an integrated GPU.

You need the necessary OpenCL header files and an appropriate OpenCL library file for your system. The header files are part of the code that is provided, together with OpenCL libraries for Windows.

You need a way to edit, compile and link C++ or C code. The code that is provided is written in C++ and uses the OpenCL C++ wrapper API, but if you are a die-hard C-fan you may use C and the OpenCL C API. Note that you will have to tell your compiler where it can find the OpenCL header files. Similarly, your linker needs to know it should use the OpenCL library and where it can find this library. The code that is provided contains a Visual Studio solution that takes care of these things. If you use another system you will have to figure out a few things yourself.

How do I list OpenCL devices on my computer?

The first way to do so is to write code yourself to do so. As a matter of fact a program is provided that does exactly this. For each OpenCL platform it lists the OpenCL devices that belong to it.

Alternatively, you can download GPU Caps Viewer a program that, among many other things, tells you which OpenCL platforms are present on your computer together with the OpenCL devices belonging to it.

Which tools can help me developing OpenCL programs?

OpenCL programs are made up of two components that we will call the host code and the device code. The host code is regular C or C++ code that uses the OpenCL API. You can develop this code in your favorite development environment. If you use a computer in room 4K228, you have to use Visual Studio.

The host code is written in OpenCL which is based on C. Editing this code in for example Windows notepad is not a good idea. At least you need syntax highlighting. Here are some useful resources for a number of popular editors:

Vim comes with a syntax file for OpenCL 1.1
Notepad++ has a user defined language file for OpenCL 1.0.
In Visual Studio you should add a new file extension "ocl" selecting Microsoft Visual C++ as Editor via the menus Tools -> Options -> Text Editor -> File Extension
CLEditor is an open source OpenCL Editor

The main hardware vendors provide their own tools to facilitate the development of OpenCL code. Those tools have a lot of features not only for editing but also for debugging and profiling. For the project you should probably not use such a tool as the investment to use them might be prohibitively high compared to using the tools already mentioned. Furthermore, if we chose one of the tools for our course we would favor one vendor over another. We mention them in alphabetic order:

How can I debug my OpenCL programs?

Your first goal should be to write correct code. Both your host code written in C or C++ and your device code written in OpenCL should be correct. We advise you to always compare the results of the computation on the OpenCL device with the result of an equivalent sequential computation of which you are 100% certain it is correct. You will probably find that errors are easily made and that finding them can be very hard. Therefore, we give a few tips.

AMD and Intel devices support the use of printf. To do so you need to include #pragma OPENCL EXTENSION cl_amd_printf : enable in your OpenCL code. When using this directive it is a good idea to let only a few work items print something and to include the work item id in what is printed.
A poor man's method to debug OpenCL code is to download the results from the GPU and to print them on the screen above or under the results of the sequential version. When you use this method you should keep your work size small, for example working with 32 work items organized in groups of 4. This method can further be refined by only execution portions of the kernel and to verify whether the intermediate results are what you expect them to be.

There are a number of commonly made mistakes. We will provide a list of them.

How can I profile my OpenCL programs?

Your second concern is to develop fast code. This means that you should be able to measure the execution time of your program and to break down this time into its most important components. There are three OpenCL related components that will contribute to the execution time of your program.

Time to set up and breakdown OpenCL and to compile your OpenCL code.
Time to transfer your data to and from the GPU. This is done via the PCI Express bus, thus it may take substantial time.
Time to execute kernels on your OpenCL device.

We present two ways to measure execution time. Depending on your situation you should use the appropriate one. First, if you want to measure the execution time of a single kernel run on a device, you can use OpenCL events. This is exactly what is done by the function runAndTimeKernel that runs a kernel on a device and returns the time in nanoseconds.

Secondly, you can and sometimes you should use an external timer. This is the case when you have a program that runs one or more kernels a great number of times. In this case, it is a bad idea to use OpenCL events to measure each kernel run separately because they add a certain amount of overhead that will become forbiddingly high when accumulated. An external timer class jc::Timer is provided.

You measure time because you want to optimize your code. Therefore, it is important to know what exactly you are measuring. In particular you should make at least a separate measurement for the time spent executing kernels. You can also measure the time that is lost transferring data to and from the GPU and the time needed to setup OpenCL and compile the OpenCL code.

What is the roofline model?

The roofline model is a graph that plots the maximum attainable performance of a code running on a given device in function of the operational intensity of said code. This definition implies that there is one roofline model for a given device and that to determine the maximum attainable performance of a code on a device it is necessary to draw the roofline model of the device, determine the operational intensity of the code and then read the maximum attainable performance from the roofline model.

Draw a roofline model

Assume a device with memory bandwidth BW and operational peak performance OPP. Now assume this device needs to run a code for which M operations need to be executed and for which N bytes need to be accessed in memory. If Tc is the time necessary to execute the operations, Tm the time necessary to access the bytes in memory and Tmin the minimum time necessary to run the code on the device, then it is easy to see that:

Tc = M/OPP
Tm = N/BW
Tmin = max(Tc, Tm)

Furthermore, it should be clear that if M/N = OPP/BW, the computational time is equal to the memory access time. This leads us to the following formula for the maximum attainable performance in function of the operational intensity:

M/N < OPP/BW : P = BW * M/N
M/N >= OPP/BW : P = OPP

Here is the roofline model of the NVIDIA Tesla C2050 for floating point operations. For this device OPP is 1030 GFlops/s and BW is 115 GB/s. On this device a code with a computational intensity larger than 9 is compute bound. Note that roofline graphs use logarithmic scale for both the maximum obtainable performance and the operational intensity.


GFlops/s
        +--+----+-++---+---+-++---+---+-++---+---+--++--+---+--++--+----+-++
   1000 +Tesla Roofline ******+          +        **************************
        +                                     ****                         +
        +                                 ****                             +
    100 ++                            ****                                ++
        +                         ****                                     +
        +                     *****                                        +
     10 ++                *****                                           ++
        |             *****                                                |
      1 ++         115 GB/s                                               ++
        +      ****                                                        +
        +  ****                                                            +
    0.1 ***                                                               ++
        +                                                                  +
        +                                                                  +
   0.01 ++                                                                ++
        +                                                                  +
        +          +          +          +           +          +          +
  0.001 ++-+----+-++---+---+-++---+---+-++---+---+--++--+---+--++--+----+-++
      0.001       0.01       0.1         1           10        100        1000
                               operational intensity

Thus, we need to know two things to draw a roofline graph for a device: its memory bandwidth and its operational peak performance. The memory bandwidth is typically specified in the device specifications. It is also possible to determine the operational peak performance for typical operations from the device specifications. If you multiply the number of cores by their clock frequency you obtain the operational peak performance for integer operations and for floating point operations that are not multiply-add. For multiply-add operations the operational peak performance is twice this value.