OpenCL

This is a brief recapitulation of a typical OpenCL program. In this explanation we will use the OpenCL C++ wrapper library and our own jc::GPU wrapper library to explain the basics of OpenCL.

A nice tutorial can be found here. Note that this tutorial uses the original OpenCL API. Obviously the original API allows you to use OpenCL in a much more flexible way, but it also requires a lot more supporting code. However the purpose of this practicum is to test simple kernels running on a single device.

Program Structure

A program consists of two parts: the host program and the device program (kernel). The host program is responsible for initializing the OpenCL infrastructure, transferring data to and from the device and launching a kernel on the device.

Device program

The kernel code is typically contained in a text file with extension cl or ocl. Typically, one file may contain the code for many kernels. Although a kernel function looks like a C function the following differences are noteworthy:

Remember that each work-item is part of a 1-, 2- or 3-dimensional work-group which in turn is part of a 1-, 2- or 3-dimensional group of work-groups.

More information on kernel code can be found in the OpenCL Compiler section of the OpenCL reference pages.

Here is an example of a two dimensional kernel which takes two matrices as input and returns a matrix as output containing the sum of the two input matrices multiplied by some scaline factor. The data dimensions are assumed to be equal to the global work-item dimensions i.e. the width is equal to get_global_id(0) and the height to get_global_id():

__kernel void matrix_sum(__global float* A, __global float *B, __global float *C, float scale)
{
    unsigned int column = get_global_id(0);
    unsigned int row = get_global_id(0);

    C[row*get_global_size(0) + column] = scale*(A[row*get_global_size(0) + column] + B[row*get_global_size(0) + column]);
}

Host program

The following steps are typical in a host program:

Launching the Kernel

You can launch a kernel by calling the method enqueueNDRangeKernel on the jc::GPU object. This method returns the number of nanoseconds it took to run the kernel on the device and takes the following arguments:

Our kernel can then be launched as follows:

cl_ulong t = gpu.enqueueNDRangeKernel(kernel, cl::NullRange, global, local);

Fixing the Occupancy

We define the occupancy of a kernel as the number of work-items that are concurrently active on a compute unit when this kernel is run relative to the maximum number of work-items that can be active concurrently on a compute unit.

The occupancy is determined by a number of limited resources:

  1. The amount of local memory per compute unit.
  2. The amount of registers per compute unit.
  3. The maximum number of work-items that may be active on one compute unit.
  4. The maximum number of work-groups that may be active on one compute unit.

The third resource represents the maximum number of work-items that may be active on a compute unit. On the devices you are using this number will probably be 1024. Because only 8 work-groups can be active at the same time on a compute unit we can only attain maximum occupancy for work-groups larger than 128 work-items.

The occupancy of a kernel can be fixed by reserving local memory for a work-group in the following way:

  1. Add an extra argument to your kernel e.g.:
    __kernel void matrix_sum(__global float* A, __global float *B, __global float *C, float scale, __local float *dummy)
    
  2. Set this extra argument as follows in the host code:
    kernel.setArg<cl::LocalSpaceArg>(4, cl::__local(local_size));
    

Note that local_size corresponds to the number of bytes you want to reserve for shared memory. By varying local_size you can vary the occupancy. The following table shows some typical occupancy figures for different values of local_size and the work-group size. Note that in this case it is assumed that the local memory available on one compute unit is 16KB and that no other factors are limiting the occupancy:

local_size work-group size occupancy
4000 256 100
5000 256 75
8000 256 50
16000 256 25
2000 128 100
2300 128 87.5
2600 128 75
3000 128 62.5
4000 128 50
5000 128 37.5
8000 128 25
16000 128 12.5