GPU Performance Analysis

1 Introduction

This text is roughly based on a number of webinars from Nvidia. It discusses how you can analyze, understand and improve the performance of your GPU kernel. Note that although the webinars talk about CUDA it should be straightforward to translate the concepts to OpenCL.

The first section explains how you can find out what is limiting the performance of your kernel. The subsequent sections discuss instruction limited and bandwidt limited kernels. Finally, there is a section on register spilling.

2 Identifying Performance Limiters

Typically, the performance of a GPU kernel is limited by one of the following factors:

memory throughput;
instruction throughput;
latency;
a combination of the above.

There exist three ways to determine which of those limits the performance of your kernel. We will discuss each of them in a separate subsection. We will first define what we mean by a memory-bound or instruction bound kernel.

For a GPU with memory bandwidth M and instruction bandwidth I, all kernels with an instruction:byte ratio lower than I/M are memory bandwidth, while all kernels with a higher ration are instruction bandwidth.

2.1 Algorithmic

The first way to determine what is holding the performance of your kernel back, is counting the number of instructions and memory access operations. Let I be the number of instructions and M the number of bytes read from or written to memory, then I/M will be the instruction:byte ratio.

The disadvantage of this method is that you will typically count less instructions or memory accesses than actually exist. To alleviate this problem you could check the machine-language assembly instructions. For, example you can compile CUDA code to PTX - an intermediate assembly language for NVIDIA - by invoking nvcc with the -ptx option. AMD’s kernel analyzer allows you to do the same for different AMD cards via a user friendly GUI.

2.2 Profiler

You can use a profiler to determine the instruction:byte ratio. In this case you will base yourself on profiler-collected memory and instruction counters. Note, however, that this method does not account for overlapped memory movements and computations. Furthermore, the counters that you should use to determine this ratio might not be available for your GPU.

The counters we concern ourselves with are instructions_issued for the instructions and dram_reads and dram_writes for memory access. Note that the first counter is incremented by 1 per warp, while the other counters are incremented by 1 per 32 bytes access to the DRAM. Thus you should multiply each counter value by 32 (although you may not do so because both factors will neutralize one another).

You can also look at instruction and memory throughputs reported by the profiler. In particular, IPC (instructions per clock) and GB/s achieved for memory. Compare the reported values with the theoretical maximum. If you’re close, you know you hit a limit.

2.3 Code Modification

Finally, the most interesting method is to change the source code of your kernel such that is becomes either a memory-only either arithmetic-only kernel. This method will not only help you to determine whether your kernel is memory-bound or instruction-bound, it will also show how well memory operations are overlapped with computation. Unfortunately, this method can not be applied to every kernel.

A number of things need to be taken into account when you modify your code. The following parts discuss what you should do to do this correctly for each type of modification.

Memory-only

You should remove as much arithmetic as possible without changing the access pattern. You can also use the profiler to verify that the load/store count is the same.

Store-only

In this case, you should also remove the loads.

Compute-only

You should remove the global memory accesses. In this case however, you need to trick the compiler, because it throws away all code that is detected as not contributing to stores. A trick is to put stores inside conditionals that always evaluate to false. A typical way to do this is as follows:

__global__ void add(float *output, float *A, float *B, int flag) 
{
  ...
  value = A[idx] + B[idx};
  // the conditional should depend on the value to avoid
  // that the computation is moved into the then-branch
  if (1 == value*flag)
    out[idx] = value;
}

Another possible problem is that your occupancy changes when you modify your kernel, for example because you use fewer registers. To avoid this you may add shared memory to the kernel invocation. The amount of shared memory must be such that you achieve the same occupancy as before you modified the kernel.

kernel<<<grid, block, smem, ...>>>(...)

3 Instruction Limited Kernels

The following factors could be limiting for instruction limited kernels:

The instruction mix. Different instructions have different throughputs.
Instruction serialization: when threads in a warp execute/issue the same instruction after each other rather than in parallel. This could be caused by shared memory bank conflicts, constant memory bank conflicts and branching.

It is important to differentiate instructions that are executed and instructions that are issued. A difference between both indicates that instructions have been serialized. You should find out why and try to minimize this effect.

Some optimizations are proposed for each type of problem. They are discussed in the following subsections.

3.1 Instruction Mix

Use __sin(), …rather than sin(), …functions.
Use double floating point only when you need it.
Use compiler flags.

3.2 Serialization Issues

Warp divergence can be measured with the divergent_branch and branch counters. Because these counters only measure the branch instructions, it is better to look at the thread instruction executed counter. If there is no divergence than threads_instructions_executed = 32*instruction_executed. If there is divergence the left expression is smaller than the right.

Shared memory bank conflicts can be measured using the l1_shared_bank_conflict, shared_load and shared_store counters. bank conflicts are significant for instruction limited kernels, when l1_shared_bank_conflict is significant compared to instructions_issued.

The proposed optimizations roughly correspond with those presented in our MSc thesis.

3.3 Latency Hiding Issues

There might be insufficient latency hiding because there are too few concurrent threads per SM. This can be due to poor occupancy or to a small overall number of threads in your execution configuration. Typical advice to alleviate the problem is:

Run more threads.
Process more than one element per thread.
Use smaller threadblocks to alleviate the effect of barrier synchronization.

4 Bandwidth Limited Kernels

A number of factors will determine the bandwidth achieved by your kernel. The following subsections discuss the most important ones.

4.1 Launch Configuration

The exact size of your workgroups and the number of elements that is mapped on one work item can have an enormous impact on the resulting memory throughput. For example,processing more than one element per work item can lead to a higher memory throughput.

4.2 Memory Access Patterns

The exact pattern of your memory access will highly influence the memory throughput. For example, on NVIDIA GPUs memory access is optimal when data is read or written in a coalesced fashion.

Global Memory Operations

Note that this discussion only concerns NVIDIA GPUs with newer Fermi architecture. The GPUs in the parallel systems lab are unfortunately older.

Loads are either caching - the default - or non-caching - through a compiler flag. For the fists the load granularity is a 128-byte line, for the second the load granularity is 32-bytes.

As a consequence the impact of misaligned memory access is greater for caching loads than for non-caching loads. In the first case the bus utilization drops to 50 %, while in the second case it only drops to 80 %.

Also if all threads access bytes within the same 4-byte word the bus utilization drops to 3.125 % for cached loading and only to 12.5 % for non-cached loading.

When accessing 32 scattered 4-byte words that fall within N segments, the bus utilization will drop become 128/ N * 128. for cached loads and 128/N * 32 for non-cached loads.

On-Chip Memory

Should I use explicit shared memory or implicit caching? It is possible to choose the amount of shared memory / cache memory from 16KB / 48KB and 48KB / 16KB.

Additional Memory

It can still be useful to use texture and constant memory.

5 Register Spilling

Local memory refers to memory private to a thread but stored in global memory. The only differences with global memory are that addressing is resolved by the compiler and stores are cached in L1.

Local memory is used for register spilling and arrays declared inside the kernel. Register spilling is not necessarily a bad thing.

There are a number of ways to check local memory usage:

Certain flags force nvcc to print the number of bytes used for local memory.
Look at a number of profiler counters: l1_local_load_hit, l1_local_load_miss, l1_local_store_hit and l1_local_store_miss.
Look at L2 counters on read and write sector queries.

There are a number of optimizations you can apply when register spilling is problematic:

Increase the limit of registers per thread. This will however result in a lower occupancy.
Use non-caching loads for global memory. As a result there may be fewer contentions with spilled registers in L1 cache.
Increase L1 cache size to 48KB. This can be done on a kernel basis and on a device basis.