Access patterns matter

One of the oft cited difficulties of GPU programming is dealing with memory layout and access patterns. In order to achieve maximum memory bandwidth, it is important to structure your application so that different threads do not access the same bank of memory at the same time. In other words, you need to avoid bank conflicts.

I was recently given a test question that presented a simple CUDA kernel and asked us to rewrite it to increase parallelism and to minimize thread divergence. Translated to OpenCL, the kernel looks like this:

#define get(A, N, i, j) ((A)[((i) * (N)) + (j)])

__kernel void MyAdd(const double __global *A, const double __global *B, double __global *C, unsigned int N) {
    unsigned int i = get_global_id(0);

    for(unsigned int j = 0; j < N; j++) {
        if (i % 2 == 0)
            get(C, N, i, j) = get(A, N, i, j) + get(A, N, i, j);
        else
            get(C, N, i, j) = get(A, N, i, j) - get(A, N, i, j);
    }
}

Let's see how we can transform this program, and more importantly, let's see how different variations perform.

First, let's try to increase parallelism. This kernel is operating over a two dimensional array. Each thread processes one row, does a for loop to process each column in that row. The even number threads add the appropriate rows from A and A, while the odd number threads subtract. There are no dependences between any iteration of these loops, so it is straightforward to parallelize them by using a two dimensional kernel:

EDIT: I had intended this code to operate on elements on both A and B, but due to what was probably me being sloppy with copy and past, the code only uses values from A. Thanks to reader Gasche for pointing this out!

__kernel void MyAdd_2D(const double __global *A, const double __global *B, double __global *C, unsigned int N) {
    unsigned int i = get_global_id(0);
    unsigned int j = get_global_id(1);

    if (i % 2 == 0)
        get(C, N, i, j) = get(A, N, i, j) + get(A, N, i, j);
    else
        get(C, N, i, j) = get(A, N, i, j) - get(A, N, i, j);
}

Obviously, you'd need to adjust the host code to issue a 2D kernel instead. Now, each thread processes exactly one element of each array, which is the exact kind of parallelism that GPUs are very well suited for. I was able to test this on three different GPUs, and for the most part this yielded a substantial performance improvement. The table below shows the time to execute the kernel for a 1024×1024 matrix.

Kernel	Tesla C1060	GeForce GTX 460	ATI Radeon HD 6750M
MyAdd	1.848 ms	2.196 ms	21.776 ms
MyAdd_2D	7.130 ms	0.520 ms	2.238 ms

The two consumer grade GPUs give between about a 4x and 10x improvement. Oddly, the professional grade Tesla C1060 is nearly 4x slower. This is our first example of another common frustration in optimizing GPU codes: architectures vary wildly and behave very differently.

Now let's see if we can do something about the thread divergence. Right now every other thread runs a different program. GPUs work best when threads can exectue in lock step, so right now the situation is pretty bad. What if instead we had the first \(\frac{N}{2}\) threads handle the even rows and the rest handle the odd rows? If we did this, our code would look like so:

__kernel void MyAdd_2D_unweave(const double __global *A, const double __global *B, double __global *C, unsigned int N) {
    unsigned int i = get_global_id(0);
    unsigned int j = get_global_id(1);

    if(i < N / 2) {
        unsigned int i = 2 * i;
        get(C, N, i, j) = get(A, N, i, j) + get(A, N, i, j);
    }
    else {
        unsigned int i = 2 * (i - N / 2) + 1;
        get(C, N, i, j) = get(A, N, i, j) - get(A, N, i, j);
    }
}

It's mostly the same, except now we have some extra index arithmetic to map each thread onto each row in a different order. The table below shows the performance for this version.

Kernel	Tesla C1060	GeForce GTX 460	ATI Radeon HD 6750M
MyAdd_2D_unweave	0.406 ms	0.092 ms	1.920 ms

All GPUs speed up on this version, but the two NVIDIA processors have especially high performance increases. This fits well with the way that NVIDIA GPUs divide a thread block into warps of 32 threads, which each execute in lock step. The fact that we didn't see as extreme of a performance improvement on the ATI GPU suggests that they may not execute threads in warps the same way that NVIDIA GPUs do.

Overall, these two transformations greatly improved performance. Even though we took a hit initially on the high end C1060, after reducing branch divergence we have about a 4x improvement over our original code. The GTX 460 showed a more than 20x improvement! One surprising thing is that the fairly inexpensive GTX 460 ended up being about four times faster than the C1060. This is a testament to how fast GPus are improving. I suspect the main issue here is memory bandwidth. This benchmark does very little computation, so we should once again be memory limited. The GTX 460 has GDDR5 memory, while the C1060 only has GDDR3. According to Wikipedia's handy list of device bandwidths, GDDR5 should be about 2.4x faster than GDDR3, which would account for much of the performance difference.

Column-wise access🔗

At first, I didn't expect the thread divergence transformation to improve the performance significantly. The reason is that GPUs read memory blocks at a time (similar to how CPUs read in entire cache lines at once). Before, each block was only read by one warp, whereas afterwards two warps had to read each block. In the absence of a cache that is big enough to hold the whole matrices, this means we'll need to read twice as much data.

To test this hypothesis, I wrote versions that alternated on the j dimension instead of the i dimension. To save space here, I've put all three of the column-wise kernels in a gist. The performance for all three variants is given below.

Kernel	Tesla C1060	GeForce GTX 460	ATI Radeon HD 6750M
MyAdd_col	1.847 ms	2.200 ms	20.437 ms
MyAdd_2D_col	6.863 ms	0.712 ms	3.327 ms
MyAdd_2D_unweave_col	8.416 ms	0.478 ms	2.480 ms

My hypothesis holds for the C1060, as it actually got slower between the regular 2D kernel and the 2D kernel that has been reordered to reduce branch divergence. In all cases, the sequential code performs about the same as it did in the row-wise version. In the column-wise kernels, the GTX 460 and the Radeon HD 6750M both improve with less thread divergence, but the GTX 460 does not improve to nearly the same extent as it did in the row-wise version.

Summary🔗

We've looked at several ways of doing very similar computations, and seen that their performance depends greatly on the memory access pattern and amount of thread divergence. Here's all the data we looked at in one convenient table:

Kernel	Tesla C1060	GeForce GTX 460	ATI Radeon HD 6750M
MyAdd	1.848 ms	2.196 ms	21.776 ms
MyAdd_2D	7.130 ms	0.520 ms	2.238 ms
MyAdd_2D_unweave	0.406 ms	0.092 ms	1.920 ms
MyAdd_col	1.847 ms	2.200 ms	20.437 ms
MyAdd_2D_col	6.863 ms	0.712 ms	3.327 ms
MyAdd_2D_unweave_col	8.416 ms	0.478 ms	2.480 ms

One thing I still have not figured out is why the C1060 takes such a performance hit when going to the 2D kernel, but the GeForce and the Radeon do not show this same trend. In the GeForce case, I suspect it might have something to do with the new thread scheduler in the Fermi series. I am not as familiar with the Radeon architecture, but it seems to have interesting performance characteristics and would be interesting to study further.

The code for this post is available at https://github.com/eholk/bench-thread-diverge. It's written in Rust, using the rust-opencl bindings.

UPDATE: I added a follow up post taking into account several suggestions from readers.