Occupancy

Definitions

The occupancy of a GPU is the number of warps running concurrently on a multiprocessor divided by the maximum number of warps that can run concurrently (Nvidia).

The occupancy of a GPU is the ratio of active wavefronts to the maximum number of possible wavefronts supported by the hardware (AMD).

Although the definitions are slightly different they boil down to the same thing. When running a kernel on a GPU in a certain execution configuration the actual occupancy will be determined by one of the following limited resources.

For both Nvidia and AMD it can be interesting to experimentally determine the maximum number of warps or wavefronts that can be run concurrently on one multiprocessor or compute unit in OpenCL terminology.

In the next section we will give a method to determine the maximum number of work-items that can be active on a single compute unit at the same time.

Determine the maximum number of work-items

Consider the following OpenCL kernel that copies data across the global memory:

__kernel void copy(__global int* src, __global int* dst, __local int* loc)
{
    dst[get_global_id(0)] = src[get_global_id(0)];
}

The interesting part of this kernel is the third argument which carries the attribute __local. By varying the size of this argument we will be able to control the number of work-groups that will be active at the same time. If you use the OpenCL C++ wrapper library, you can specify the size of the __local memory as follows:

kernel.setArg<cl::LocalSpaceArg>(2, cl::__local(local_size))

If you specify the size of the __local memory to be equal to the size of the local memory arena for the device you are testing, only one work-group can be active at the same time on a single compute unit. You should then determine the occupancy of running this kernel using work-groups of a known size. To determine the occupancy you can use NVIDIA Computer Visual Profiler for NVIDIA GPUs or the AMD APP Profiler for AMD GPUs. The maximum number of work-items that can run on a single compute unit, then, is equal to the work-group size divided by the occupancy reported by the profiler.

Note that you can determine the size of the local memory arena for the device by calling clGetDeviceInfo for the parameter CL_DEVICE_LOCAL_MEM_SIZE.

We get the following results for our cards:

GPU local memory size work-group size occupancy maximum
GeForce GTX280 49152 256 16.67 % 1536
Tesla C2050/C2070 16384 256 25 % 1024
Radeon HD6950 32768 256    

Note that we were not able to get correct occupancy results for the AMD Radeon HD6950: the AMD APP Profiler reported the occupancy to be 0! Without further ado we assume that our card can have 1536 work-items in flight on a single compute unit.