OpenCL

This is a brief recapitulation of a typical OpenCL program. In this explanation we will use the OpenCL C++ wrapper library and our own jc::GPU wrapper library to explain the basics of OpenCL.

A nice tutorial can be found here. Note that this tutorial uses the original OpenCL API. Obviously the original API allows you to use OpenCL in a much more flexible way, but it also requires a lot more supporting code. However the purpose of this practicum is to test simple kernels running on a single device.

Program Structure

A program consists of two parts: the host program and the device program (kernel). The host program is responsible for initializing the OpenCL infrastructure, transferring data to and from the device and launching a kernel on the device.

Device program

The kernel code is typically contained in a text file with extension cl or ocl. Typically, one file may contain the code for many kernels. Although a kernel function looks like a C function the following differences are noteworthy:

A kernel function has the __kernel attribute.
A pointer argument must have either the __global attribute or the __local attribute. The first points to data in global memory, the second to data in local memory.
A kernel specifies the code run by one work-item. To distinguish between the different work-item a number of built-in functions are used:
- get_global_id(i): get the id of the work-item in dimension i of the global work-item arrangement.
- get_local_id(i): get the id of the work-item in dimension i of the work-group.
- get_group_id(i): get the id of the work-group in dimension i of the global work-group arrangement.

Remember that each work-item is part of a 1-, 2- or 3-dimensional work-group which in turn is part of a 1-, 2- or 3-dimensional group of work-groups.

More information on kernel code can be found in the OpenCL Compiler section of the OpenCL reference pages.

Here is an example of a two dimensional kernel which takes two matrices as input and returns a matrix as output containing the sum of the two input matrices multiplied by some scaline factor. The data dimensions are assumed to be equal to the global work-item dimensions i.e. the width is equal to get_global_id(0) and the height to get_global_id():

__kernel void matrix_sum(__global float* A, __global float *B, __global float *C, float scale)
{
    unsigned int column = get_global_id(0);
    unsigned int row = get_global_id(0);

    C[row*get_global_size(0) + column] = scale*(A[row*get_global_size(0) + column] + B[row*get_global_size(0) + column]);
}

Host program

The following steps are typical in a host program:

Initialize the data you want to process i.e. the data structures containing the input data and the output data. You can do this any way you like. Remember, however that data will be passed to the device using pointers.
Creation of a jc::GPU object. This is done by specifying the name of the device you want to use and the name of the file that contains the kernel code
```
jc::GPU gpu(device_name, file_name);
```
Alternatively, you can omit both arguments. In that case the first OpenCL capable GPU will be chosen and you will have to add your source later on using the method add_source. I.e.:
```
jc::GPU gpu;
gpu.add_source(file_name);
```

Creation of a cl::Buffer object for each data structure of which data needs to be copied to or from the device. For example for the kernel shown in the previous section:

cl::Buffer A_buffer = gpu.createBuffer<cl_float>(CL_MEM_READ_ONLY, width*height);
cl::Buffer B_buffer = gpu.createBuffer<cl_float>(CL_MEM_READ_ONLY, width*height);
cl::Buffer C_buffer = gpu.createBuffer<cl_float>(CL_MEM_WRITE_ONLY, width*height);

Transferring input data to the device.

// A and B are pointers to the data in the RAM of the CPU
// width*height are the number of floating point elements
gpu.enqueueWriteBuffer<cl_float>(A_buffer, CL_TRUE, 0, width*height, A);
gpu.enqueueWriteBuffer<cl_float>(B_buffer, CL_TRUE, 0, width*height, B);

Creation of a cl::Kernel object. This is done by specifying the name of the kernel function you want to run.
```
cl::Kernel kernel = gpu.createKernel(kernel_name);
```

Setting the arguments of the cl::Kernel object.

kernel.setArg<cl::Buffer>(0, A_buffer); // corresponds to __global float *A
kernel.setArg<cl::Buffer>(1, B_buffer); // corresponds to __global float *B
kernel.setArg<cl::Buffer>(2, C_buffer); // corresponds to __global float *C
cl_float scale = 0.75;                  // corresponds to float scale
kernel.setArg<cl::float>(3, scale);

Launch the kernel on the device. Given the importance of this call it will be discussed in a section of its own.

Transferring output data from the device to the host.

gpu.enqueueWriteBuffer<cl_float>(C_buffer, CL_TRUE, 0, width*height, C);

Launching the Kernel

You can launch a kernel by calling the method enqueueNDRangeKernel on the jc::GPU object. This method returns the number of nanoseconds it took to run the kernel on the device and takes the following arguments:

The actual kernel you want to launch. This is the cl::Kernel object you created earlier.
A cl::NDRange object that will always be cl::NullRange in our case.
A cl::NDRange object that specifies the global work-item configuration e.g. assume we want to run 4096 by 4096 work-items. The first argument specifies the number of work-items in the X-direction, the second in the Y-direction, etc.
```
cl::NDRange global(4096, 4096);
```
A cl::NDRange object that specifies the work-group configuration e.g. assume we want our work-groups to contain 16 by 16 work-items. It is important to note that both global and local ranges need to have the same number of dimensions and furthermore the sizes of the work-groups need to be whole divisors of the corresponding sizes of the global work-item configuration.
```
cl::NDRange local(16, 16);
```

Our kernel can then be launched as follows:

cl_ulong t = gpu.enqueueNDRangeKernel(kernel, cl::NullRange, global, local);

Fixing the Occupancy

We define the occupancy of a kernel as the number of work-items that are concurrently active on a compute unit when this kernel is run relative to the maximum number of work-items that can be active concurrently on a compute unit.

The occupancy is determined by a number of limited resources:

The amount of local memory per compute unit.
The amount of registers per compute unit.
The maximum number of work-items that may be active on one compute unit.
The maximum number of work-groups that may be active on one compute unit.

The third resource represents the maximum number of work-items that may be active on a compute unit. On the devices you are using this number will probably be 1024. Because only 8 work-groups can be active at the same time on a compute unit we can only attain maximum occupancy for work-groups larger than 128 work-items.

The occupancy of a kernel can be fixed by reserving local memory for a work-group in the following way:

Add an extra argument to your kernel e.g.:

__kernel void matrix_sum(__global float* A, __global float *B, __global float *C, float scale, __local float *dummy)

Set this extra argument as follows in the host code:

kernel.setArg<cl::LocalSpaceArg>(4, cl::__local(local_size));

Note that local_size corresponds to the number of bytes you want to reserve for shared memory. By varying local_size you can vary the occupancy. The following table shows some typical occupancy figures for different values of local_size and the work-group size. Note that in this case it is assumed that the local memory available on one compute unit is 16KB and that no other factors are limiting the occupancy:

local_size	work-group size	occupancy
4000	256	100
5000	256	75
8000	256	50
16000	256	25
2000	128	100
2300	128	87.5
2600	128	75
3000	128	62.5
4000	128	50
5000	128	37.5
8000	128	25
16000	128	12.5