One of the goals of designing a high level GPU programming language is to allow the compiler to perform optimizations on your code. One optimization we’ve been doing for a while in Harlan is one I’ve been calling “kernel fusion.” This is a pretty obvious transformation to do, and many other GPU languages do it. However, kernel fusion comes in several different variants that I’d like to discuss.
Kernel Inlining
Let’s consider an example program written in Harlan:
1 2 3 4 

I believe this is the first time I’ve included Harlan code snippets in my blog, so some introduction is in order. Harlan is a high level language for data parallel computing that I’ve been working on for a while now as part of my Ph.D. research. You can read our initial paper about Harlan here, although it will be immediately apparent from the drastically different syntax that the language has evolved significantly in that time.
The primary construct for parallelism in Harlan is the kernel
form. Syntactically, this looks like Scheme’s let
, in that it takes
a list of variable names and expressions to bind the names to, and
then a body describing what computation to do on those variables. The
difference is that each of the bound expressions must be a vector, and
the result of a kernel is another vector. Furthermore, each input
vector must be of the same length, and the result of the kernel will
have the same length as its inputs. Kernels operate on parallel, with
the body being applied to each corresponding set of elements in the
inputs. You can think of kernels as a parallel map or zipWith.
In the example above, if we imagine x*
and y*
as containing the X
and Y coordinates of a set of 2D vectors, then the code snippet
calculates the magnitude of each of these vectors. As written,
however, we have a couple of obvious inefficiencies. This example
allocates space for the result of the first kernel, temp*
, which
could be significant if x*
and y*
are very long. Secondly, both
kernels are tiny—that is, they have low arithmetic
intensity. Arithmetic intensity is a measure of how many arithmetic
operations we have for each memory reference. If it is low, we won’t
be using the full computational power of the GPU.
Fortunately, we can solve both of these problems with the first form of fusion, which I feel is better called kernel inlining. With kernel inlining, we would transform our program into the following:
1 2 3 

In case it’s not clear what happened, we took the body of the first
kernel and inserted it into the second kernel. In doing this, we also
got rid of the parameters to the second kernel and replaced them with
the parameters from the first kernel. Now there is only one kernel,
the temp*
vector is completely gone, and the arithmetic intensity of
the newly fused kernel is higher than we had before.
This transformation is usually a win when it’s possible. We may not
want to do it if, for example, the temp*
array is also referenced
somewhere else. Also, while kernels with higher arithmetic intensity
tend to work better because there is more computation for the GPU to
use to hide memory access latency, it can also increase the number of
registers needed by each thread and thereby limit how much parallelism
is possible.
2D Kernel Fusion
For the next type of fusion, let’s consider the following code that would appear in a hypothetical program to render the Mandelbrot set:
1 2 3 

Harlan allows nested kernels like this, but Harlan compiles to OpenCL, which does not support nested kernels. Harlan generally deals with this by sequentializing all but one of the kernels. In this case, we could either transform the out or inner kernel into a loop. However, since the kernels form a grid with no dependencies between them, the compiler can fuse these into a single, two dimensional kernel, which OpenCL does support. The Harlan compiler is able to do this for simple cases like this one.
Kernel Concatenation
Let’s look at one more form of fusion, although the Harlan compiler does not currently do this form. Consider a program like this:
1 2 3 4 5 

Here we have two kernels that execute in sequence with no dependencies
between them. Furthermore, they have the same set of inputs. If we
pretend for a moment that Harlan has a multiple return value form,
like Scheme’s values
, we could combine these two kernels as follows:
1 2 3 4 5 

In keeping with the general theme of fusion transformations, we now have only one kernel that performs more work instead of two smaller kernels.
Conclusion
We’ve looked at three different program transformations that could be called fusion. A key point is that they are transformations, not necessarily optimizations. In my experience, these sorts of transformations almost always improve performance, but that need not always be the case. By combining small pieces of code into larger chunks, we reduce the overhead of launching new kernels and also give the GPU more code to use for overlapping memory accesses with computation. On the other hand, larger code could lead to less efficient instruction cache usage, and also lead to more memory accesses if each thread’s working set can no longer fit in registers. It’s good to be able to do these transformations, but it’s important to benchmark and see what works best for each particular program.