This is a brief recapitulation of a typical OpenCL program. In this explanation
we will use the OpenCL C++ wrapper library and our own jc::GPU
wrapper library
to explain the basics of OpenCL.
A nice tutorial can be found here. Note that this tutorial uses the original OpenCL API. Obviously the original API allows you to use OpenCL in a much more flexible way, but it also requires a lot more supporting code. However the purpose of this practicum is to test simple kernels running on a single device.
A program consists of two parts: the host program and the device program (kernel). The host program is responsible for initializing the OpenCL infrastructure, transferring data to and from the device and launching a kernel on the device.
The kernel code is typically contained in a text file with extension cl
or
ocl
. Typically, one file may contain the code for many kernels. Although a
kernel function looks like a C function the following differences are
noteworthy:
__kernel
attribute.
__global
attribute or the __local
attribute. The first points to data in global memory, the second to data in
local memory.
get_global_id(i)
: get the id of the work-item in dimension i
of the
global work-item arrangement.
get_local_id(i)
: get the id of the work-item in dimension i
of the
work-group.
get_group_id(i)
: get the id of the work-group in dimension i
of the
global work-group arrangement.
Remember that each work-item is part of a 1-, 2- or 3-dimensional work-group which in turn is part of a 1-, 2- or 3-dimensional group of work-groups.
More information on kernel code can be found in the OpenCL Compiler section of the OpenCL reference pages.
Here is an example of a two dimensional kernel which takes two matrices as input
and returns a matrix as output containing the sum of the two input matrices
multiplied by some scaline factor. The data dimensions are assumed to be equal
to the global work-item dimensions i.e.
the width is equal to get_global_id(0)
and the height to get_global_id()
:
__kernel void matrix_sum(__global float* A, __global float *B, __global float *C, float scale) { unsigned int column = get_global_id(0); unsigned int row = get_global_id(0); C[row*get_global_size(0) + column] = scale*(A[row*get_global_size(0) + column] + B[row*get_global_size(0) + column]); }
The following steps are typical in a host program:
jc::GPU
object. This is done by specifying the name of the
device you want to use and the name of the file that contains the kernel code
jc::GPU gpu(device_name, file_name);Alternatively, you can omit both arguments. In that case the first OpenCL capable GPU will be chosen and you will have to add your source later on using the method
add_source
. I.e.:
jc::GPU gpu; gpu.add_source(file_name);
cl::Buffer
object for each data structure of which data needs
to be copied to or from the device. For example for the kernel shown in the
previous section:
cl::Buffer A_buffer = gpu.createBuffer<cl_float>(CL_MEM_READ_ONLY, width*height); cl::Buffer B_buffer = gpu.createBuffer<cl_float>(CL_MEM_READ_ONLY, width*height); cl::Buffer C_buffer = gpu.createBuffer<cl_float>(CL_MEM_WRITE_ONLY, width*height);
// A and B are pointers to the data in the RAM of the CPU // width*height are the number of floating point elements gpu.enqueueWriteBuffer<cl_float>(A_buffer, CL_TRUE, 0, width*height, A); gpu.enqueueWriteBuffer<cl_float>(B_buffer, CL_TRUE, 0, width*height, B);
cl::Kernel
object. This is done by specifying the name of the
kernel function you want to run.
cl::Kernel kernel = gpu.createKernel(kernel_name);
cl::Kernel
object.
kernel.setArg<cl::Buffer>(0, A_buffer); // corresponds to __global float *A kernel.setArg<cl::Buffer>(1, B_buffer); // corresponds to __global float *B kernel.setArg<cl::Buffer>(2, C_buffer); // corresponds to __global float *C cl_float scale = 0.75; // corresponds to float scale kernel.setArg<cl::float>(3, scale);
gpu.enqueueWriteBuffer<cl_float>(C_buffer, CL_TRUE, 0, width*height, C);
You can launch a kernel by calling the method enqueueNDRangeKernel
on the
jc::GPU
object. This method returns the number of nanoseconds it took to run
the kernel on the device and takes the following arguments:
cl::Kernel
object you
created earlier.
cl::NDRange
object that will always be cl::NullRange
in our case.
cl::NDRange
object that specifies the global work-item configuration e.g.
assume we want to run 4096 by 4096 work-items. The first argument specifies
the number of work-items in the X-direction, the second in the Y-direction, etc.
cl::NDRange global(4096, 4096);
cl::NDRange
object that specifies the work-group configuration e.g. assume
we want our work-groups to contain 16 by 16 work-items. It is important to
note that both global and local ranges need to have the same number of
dimensions and furthermore the sizes of the work-groups need to be whole
divisors of the corresponding sizes of the global work-item configuration.
cl::NDRange local(16, 16);
Our kernel can then be launched as follows:
cl_ulong t = gpu.enqueueNDRangeKernel(kernel, cl::NullRange, global, local);
We define the occupancy of a kernel as the number of work-items that are concurrently active on a compute unit when this kernel is run relative to the maximum number of work-items that can be active concurrently on a compute unit.
The occupancy is determined by a number of limited resources:
The third resource represents the maximum number of work-items that may be active on a compute unit. On the devices you are using this number will probably be 1024. Because only 8 work-groups can be active at the same time on a compute unit we can only attain maximum occupancy for work-groups larger than 128 work-items.
The occupancy of a kernel can be fixed by reserving local memory for a work-group in the following way:
__kernel void matrix_sum(__global float* A, __global float *B, __global float *C, float scale, __local float *dummy)
kernel.setArg<cl::LocalSpaceArg>(4, cl::__local(local_size));
Note that local_size
corresponds to the number of bytes you want to reserve
for shared memory. By varying local_size
you can vary the occupancy. The
following table shows some typical occupancy figures for different values of
local_size
and the work-group size. Note that in this case it is assumed that
the local memory available on one compute unit is 16KB and that no other factors
are limiting the occupancy:
local_size | work-group size | occupancy |
---|---|---|
4000 | 256 | 100 |
5000 | 256 | 75 |
8000 | 256 | 50 |
16000 | 256 | 25 |
2000 | 128 | 100 |
2300 | 128 | 87.5 |
2600 | 128 | 75 |
3000 | 128 | 62.5 |
4000 | 128 | 50 |
5000 | 128 | 37.5 |
8000 | 128 | 25 |
16000 | 128 | 12.5 |