The Visual Studio solution contains a header file
util.h that contains a
number of useful functions. To use it you should add the
inc directory to the
Additional Include Directories and include the file as follows:
It contains the following functions:
std::string fileToString(const std::string&): takes a filename and returns a string with the file contents.
cl::Program buildProgram(const std::string&, const cl::Context&, const std::vector<cl::Device>& devices): takes a filename, an
OpenCLcontext and a vector of
OpenCLdevices and tries to return an
OpenCLprogram. In the case of an error, a message is printed and an exception is thrown.
cl_ulong runAndTimeKernel(const cl::Kernel&, const cl::CommandQueue&, const cl::NDRange&, const cl::NDRange&): takes an
OpenCLcommand queue and two
OpenCLND ranges and tries to run a kernel (via the command queue) and return the time in nanoseconds.
const char* readableStatus(cl_int): takes an
OpenCLerror status and returns a human readable (C-style) string.
closestMultiple(unsigned int, unsigned int): takes two numbers and returns the smallest number that is greater than the first number and that is a multiple of the second number.
template <class T> void showMatrix(T*, unsigned int, unsigned int): takes a pointer to an element
Tthat is assumed to represent a one or two dimensional array together with its width and height and shows it on the screen.
bool closeEnough(float a, float b, float e): checks whether two floats are close enough to each other with precision e.
The Visual Studio solution contains a number of projects each corresponding to a different exercise. They will be presented throughout the training.The solutions are here.
Of course, you need a computer. Furthermore, your computer should be equipped with a device that supports OpenCL. This could be a discrete GPU, an integrated GPU or a multi-core processor. Our main purpose is to accelerate programs using GPUs. Therefore, if you have the choice you should prefer a GPU over a processor and a discrete GPU over an integrated GPU.
You need the necessary OpenCL header files and an appropriate OpenCL library file for your system. The header files are part of the code that is provided, together with OpenCL libraries for Windows.
You need a way to edit, compile and link C++ or C code. The code that is provided is written in C++ and uses the OpenCL C++ wrapper API, but if you are a die-hard C-fan you may use C and the OpenCL C API. Note that you will have to tell your compiler where it can find the OpenCL header files. Similarly, your linker needs to know it should use the OpenCL library and where it can find this library. The code that is provided contains a Visual Studio solution that takes care of these things. If you use another system you will have to figure out a few things yourself.
The first way to do so is to write code yourself to do so. As a matter of fact a program is provided that does exactly this. For each OpenCL platform it lists the OpenCL devices that belong to it.
Alternatively, you can download GPU Caps Viewer a program that, among many other things, tells you which OpenCL platforms are present on your computer together with the OpenCL devices belonging to it.
OpenCL programs are made up of two components that we will call the host code and the device code. The host code is regular C or C++ code that uses the OpenCL API. You can develop this code in your favorite development environment. If you use a computer in room 4K228, you have to use Visual Studio.
The host code is written in OpenCL which is based on C. Editing this code in for example Windows notepad is not a good idea. At least you need syntax highlighting. Here are some useful resources for a number of popular editors:
The main hardware vendors provide their own tools to facilitate the development of OpenCL code. Those tools have a lot of features not only for editing but also for debugging and profiling. For the project you should probably not use such a tool as the investment to use them might be prohibitively high compared to using the tools already mentioned. Furthermore, if we chose one of the tools for our course we would favor one vendor over another. We mention them in alphabetic order:
Your first goal should be to write correct code. Both your host code written in C or C++ and your device code written in OpenCL should be correct. We advise you to always compare the results of the computation on the OpenCL device with the result of an equivalent sequential computation of which you are 100% certain it is correct. You will probably find that errors are easily made and that finding them can be very hard. Therefore, we give a few tips.
printf. To do so you need to include
#pragma OPENCL EXTENSION cl_amd_printf : enablein your OpenCL code. When using this directive it is a good idea to let only a few work items print something and to include the work item id in what is printed.
There are a number of commonly made mistakes. We will provide a list of them.
Your second concern is to develop fast code. This means that you should be able to measure the execution time of your program and to break down this time into its most important components. There are three OpenCL related components that will contribute to the execution time of your program.
We present two ways to measure execution time. Depending on your situation
you should use the appropriate one. First, if you want to measure the execution
time of a single kernel run on a device, you can use OpenCL events. This is
exactly what is done by the function
runAndTimeKernel that runs a
kernel on a device and returns the time in nanoseconds.
Secondly, you can and sometimes you should use an external timer. This is the
case when you have a program that runs one or more kernels a great number of
times. In this case, it is a bad idea to use OpenCL events to measure each
kernel run separately because they add a certain amount of overhead that will
become forbiddingly high when accumulated. An external timer class
jc::Timer is provided.
You measure time because you want to optimize your code. Therefore, it is important to know what exactly you are measuring. In particular you should make at least a separate measurement for the time spent executing kernels. You can also measure the time that is lost transferring data to and from the GPU and the time needed to setup OpenCL and compile the OpenCL code.
The roofline model is a graph that plots the maximum attainable performance of a code running on a given device in function of the operational intensity of said code. This definition implies that there is one roofline model for a given device and that to determine the maximum attainable performance of a code on a device it is necessary to draw the roofline model of the device, determine the operational intensity of the code and then read the maximum attainable performance from the roofline model.
Assume a device with memory bandwidth
BW and operational peak
OPP. Now assume this device needs to run a code for
M operations need to be executed and for which
bytes need to be accessed in memory. If
Tc is the time necessary to
execute the operations,
Tm the time necessary to access the bytes
in memory and
Tmin the minimum time necessary to run the code on
the device, then it is easy to see that:
Tc = M/OPP
Tm = N/BW
Tmin = max(Tc, Tm)
Furthermore, it should be clear that if
M/N = OPP/BW, the
computational time is equal to the memory access time. This leads us to the
following formula for the maximum attainable performance in function of the
M/N < OPP/BW : P = BW * M/N
M/N >= OPP/BW : P = OPP
Here is the roofline model of the NVIDIA Tesla C2050 for floating point
operations. For this device
OPP is 1030 GFlops/s and
BW is 115 GB/s. On this device a code with a computational
intensity larger than 9 is compute bound. Note that roofline graphs use
logarithmic scale for both the maximum obtainable performance and the
GFlops/s +--+----+-++---+---+-++---+---+-++---+---+--++--+---+--++--+----+-++ 1000 +Tesla Roofline ******+ + ************************** + **** + + **** + 100 ++ **** ++ + **** + + ***** + 10 ++ ***** ++ | ***** | 1 ++ 115 GB/s ++ + **** + + **** + 0.1 *** ++ + + + + 0.01 ++ ++ + + + + + + + + + 0.001 ++-+----+-++---+---+-++---+---+-++---+---+--++--+---+--++--+----+-++ 0.001 0.01 0.1 1 10 100 1000 operational intensity
Thus, we need to know two things to draw a roofline graph for a device: its memory bandwidth and its operational peak performance. The memory bandwidth is typically specified in the device specifications. It is also possible to determine the operational peak performance for typical operations from the device specifications. If you multiply the number of cores by their clock frequency you obtain the operational peak performance for integer operations and for floating point operations that are not multiply-add. For multiply-add operations the operational peak performance is twice this value.