Some Simple GPU Optimizations

One of the goals of designing a high level GPU programming language is to allow the compiler to perform optimizations on your code. One optimization we've been doing for a while in Harlan is one I've been calling "kernel fusion." This is a pretty obvious transformation to do, and many other GPU languages do it. However, kernel fusion comes in several different variants that I'd like to discuss.

Kernel Inlining🔗

Let's consider an example program written in Harlan:

(let ((temp* (kernel ((x x*) (y y*))
               (+ (* x x) (* y y)))))
  (kernel ((t temp*))
    (sqrt t)))

I believe this is the first time I've included Harlan code snippets in my blog, so some introduction is in order. Harlan is a high level language for data parallel computing that I've been working on for a while now as part of my Ph.D. research. You can read our initial paper about Harlan here, although it will be immediately apparent from the drastically different syntax that the language has evolved significantly in that time.

The primary construct for parallelism in Harlan is the kernel form. Syntactically, this looks like Scheme's let, in that it takes a list of variable names and expressions to bind the names to, and then a body describing what computation to do on those variables. The difference is that each of the bound expressions must be a vector, and the result of a kernel is another vector. Furthermore, each input vector must be of the same length, and the result of the kernel will have the same length as its inputs. Kernels operate on parallel, with the body being applied to each corresponding set of elements in the inputs. You can think of kernels as a parallel map or zipWith.

In the example above, if we imagine x* and y* as containing the X and Y coordinates of a set of 2D vectors, then the code snippet calculates the magnitude of each of these vectors. As written, however, we have a couple of obvious inefficiencies. This example allocates space for the result of the first kernel, temp*, which could be significant if x* and y* are very long. Secondly, both kernels are tiny--that is, they have low arithmetic intensity. Arithmetic intensity is a measure of how many arithmetic operations we have for each memory reference. If it is low, we won't be using the full computational power of the GPU.

Fortunately, we can solve both of these problems with the first form of fusion, which I feel is better called kernel inlining. With kernel inlining, we would transform our program into the following:

(kernel ((x x*) (y y*))
  (let ((t (+ (* x x) (* y y))))
    (sqrt t)))

In case it's not clear what happened, we took the body of the first kernel and inserted it into the second kernel. In doing this, we also got rid of the parameters to the second kernel and replaced them with the parameters from the first kernel. Now there is only one kernel, the temp* vector is completely gone, and the arithmetic intensity of the newly fused kernel is higher than we had before.

This transformation is usually a win when it's possible. We may not want to do it if, for example, the temp* array is also referenced somewhere else. Also, while kernels with higher arithmetic intensity tend to work better because there is more computation for the GPU to use to hide memory access latency, it can also increase the number of registers needed by each thread and thereby limit how much parallelism is possible.

2D Kernel Fusion🔗

For the next type of fusion, let's consider the following code that would appear in a hypothetical program to render the Mandelbrot set:

(kernel ((i (iota height)))
  (kernel ((j (iota width)))
    (compute-mandelbrot i j)))

Harlan allows nested kernels like this, but Harlan compiles to OpenCL, which does not support nested kernels. Harlan generally deals with this by sequentializing all but one of the kernels. In this case, we could either transform the out or inner kernel into a loop. However, since the kernels form a grid with no dependencies between them, the compiler can fuse these into a single, two dimensional kernel, which OpenCL does support. The Harlan compiler is able to do this for simple cases like this one.

Kernel Concatenation🔗

Let's look at one more form of fusion, although the Harlan compiler does not currently do this form. Consider a program like this:

(let ((a* (kernel ((i inputs))
            (do-something i)))
      (b* (kernel ((i inputs))
            (do-something-else i))))
  (write-outputs a* b*))

Here we have two kernels that execute in sequence with no dependencies between them. Furthermore, they have the same set of inputs. If we pretend for a moment that Harlan has a multiple return value form, like Scheme's values, we could combine these two kernels as follows:

(let-values (((a* b*) (kernel ((i inputs))
                        (let ((a (do-something i))
                              (b (do-something-else j)))
                          (values a b)))))
  (write-outputs a* b*))

In keeping with the general theme of fusion transformations, we now have only one kernel that performs more work instead of two smaller kernels.

Conclusion🔗

We've looked at three different program transformations that could be called fusion. A key point is that they are transformations, not necessarily optimizations. In my experience, these sorts of transformations almost always improve performance, but that need not always be the case. By combining small pieces of code into larger chunks, we reduce the overhead of launching new kernels and also give the GPU more code to use for overlapping memory accesses with computation. On the other hand, larger code could lead to less efficient instruction cache usage, and also lead to more memory accesses if each thread's working set can no longer fit in registers. It's good to be able to do these transformations, but it's important to benchmark and see what works best for each particular program.