One important, though surprisingly uncommon, feature of macro systems is that of hygiene. I mentioned in a previous post that I would eventually say something about hygiene. It turns out macro hygiene is somewhat tricky to define precisely, and I know a couple of people who are actively working on a formal definition of hygiene. The intuition behind hygiene isn’t too bad though. Basically, we want our macros to not break our code. So how can macros break code?
Some Simple GPU Optimizations
One of the goals of designing a high level GPU programming language is to allow the compiler to perform optimizations on your code. One optimization we’ve been doing for a while in Harlan is one I’ve been calling “kernel fusion.” This is a pretty obvious transformation to do, and many other GPU languages do it. However, kernel fusion comes in several different variants that I’d like to discuss.
Using Scheme With Travis CI
Early on in the development of the Harlan compiler, my collaborators and I realized we were spending a lot of time writing compilers that translate Scheme-like languages into C or C++. A lot of this code should be common between projects, so we decided to factor some of this code into the Elegant Weapons project. Elegant Weapons even had a trivial test suite. Unfortunately, because the primary consumer of Elegant Weapons was Harlan, the design was still far to specific to Harlan. As we realized when Nada Amin submitted a fix for the Elegant Weapons tests, we weren’t even running our own tests anymore. Clearly we needed to do something better if Elegant Weapons were truly going to be a project worthy of existing on its own.
Lately I’ve seen a lot of GitHub repositories that include an image like this in their Readme:
It turns out this is from a free service called Travis CI, which makes it easy to integrate continuous integration with GitHub projects. Basically, this means you can define a set of tests that run every time you push a commit to GitHub, and you’ll be notified if something breaks.
Scheme was not one of the languages with first class support, but fortunately we can make it work. Here’s how.
Some Picky Presentation Tips
I just spent the last week at IPDPS in Boston. It was a good time. I got to meet a few new people, and connect with a lot of friends who are now living in the Boston area. I also presented our work on Rust for the GPU at HIPS. In the course of watching a lot of presentations, I came up with a few tips. I admit I did not follow all of these in my own presentation, but hopefully all of us can learn from these.
Data Parallel Operators
In my previous post, we discussed some of the data structures that support data parallel programming. Now we’ll turn our attention to the common operators that manipulate these data structures. I’ll discuss several of them: map, reduce, scan, permute, back-permute and filter.
Data Parallel Data Structures
Data parallelism is a style of programming where essentially the same operation is applied to a large collection of values. This style became popular during the 80s and early 90s as a convenient way of programming large vector processors. Data parallelism has remained popular, especially in light of the rise of GPGPU programming. Often, data parallel programming is used for fine-grained parallelism, but it works at larger granularity too. For example, MapReduce is a restricted example of data parallelism.
Systems that support data parallelism typically provide a handful of data structures that can be manipulated with a set of parallel operators. These data structures include vectors, segmented vectors, and arrays. In this post, we’ll take a look at these different structures and in a later post we’ll discuss some of the parallel operators that manipulate them.
Beware the Logarithms
Logarithms are great. They let you talk about incredibly wide ranges of numbers, and they transform multiplication into addition. Algorithms with logarithmic running times are so fast they might as well be constant time algorithms. Logarithmic scales on graphs can also make your results look much better. Let’s see how.
Patterns With Ellipses
Last time, we talked about matching patterns in Scheme. Now we will look at how to extend the pattern matcher and template instantiation code to handle patterns with ellipses.
Matching Patterns With Scheme
A while back, I wrote a
post
about macros in Scheme. Today I want to take a look at how one might
begin to implement a macro system. In Scheme, whether you use
syntax-rules or syntax-case do write your macros, at some point
you’ll write patterns and templates. Macros match their input against
a pattern and then use this to instantiate a template. Let’s consider
a two-way or macro:
1 2 3 4 5 | |
We can see how this works at the REPL by using expand:
1 2 | |
This matches the expression (or2 1 2) against the pattern (_ e1
e2). The underscore means we don’t care what’s in that position (it’s
always or2, since this is part of the or2 macro definition). The
names e1 and e2 allow us to refer to what occurs in this pattern
in the template. When instantiating the template, the macro system
replaces all references to e1 with the chunk of syntax from that
position (1 in this case), and likewise with e2. The result is the
expression, (let ([t.8 1]) (if t.8 t.8 2)). Here we let-bind the
first expression because we do not want to accidentally evaluate any
side effects it might contain twice. Don’t worry too much about why
the variable name is t.8 instead of just t. This has to do with
something called hygiene, which I will hopefully say more on in a
later post.
Now let’s take a look at how we would write a function to match
patterns like this. We’ll start with a procedure called match that
takes four arguments: the pattern to match against, the expression to
match, a success continuation, and a failure continuation. The success
continuation is called if the expression matches the pattern, and the
failure continuation is called otherwise. When I first started writing
this, it seemed like it would be cleaner to use success and failure
continuations, although I’m not entirely sure that’s true anymore. At
the moment, we will support three things for patterns:
- Pairs – A pair pattern matches an expression if and only if the expression is a pair, and the car of the pattern matches the car of the expression and the cdr of the pattern matches the cdr of the expression.
- Symbols – A symbol pattern binds a pattern variable. This matches any expression, and binds that expression to the variable.
- Null – A null pattern matches an expression if and only if the expression is null.
Scheme’s pattern matching supports several other features, like auxiliary keywords, which we’re ignoring for the now for the sake of simplicity. Using our three kinds of patterns so far, we can translate this pretty directly into a Scheme program:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 | |
Notice that we pass a list of bindings to the success continuation. These will be used by the template instantiation code. Let’s try an example:
1 2 | |
For the most part, this does what we want. The matcher says the
expression matches the pattern, and tells us that e1 was bound to
1 and e2 was bound to 2. However, we also bound _ to or2. In
reality, we want _ to mean ignore, which we can do by adding one
line before the symbol? line in our cond clause:
1 2 | |
Now we get only the bindings we care about.
The next part is to use the resulting bindings to instantiate the
template. We’ll do this with a function called instantiate, which
takes a template (called p), and a list of bindings. This is a
straightforward tree walk that tries to replace symbols with
expressions when they are bound:
1 2 3 4 5 6 7 | |
If we did this right, we should be able to instantiate the template
from the or2 macro above.
1 2 3 4 5 6 | |
It looks like we got what we wanted, although without hygiene.
At this point we have a pattern matcher and template instantiation function that is suitable for making a very simple macro system. One very nice feature that we are missing at the moment is ellipsis patterns, which allow us to match repeated patterns. For example, we could use this to write a custom let macro like this:
1 2 3 4 | |
In a post at some point in the hopefully near future, I will show how
to handle patterns with ... in them.
Access Patterns Matter, Part 2
A couple of readers pointed out some improvements and corrections to my last post on GPU access patterns. These were pretty significant, so I thought it’d be worth doing a follow up post to see how the change things.
First of all, I meant to operate on both arrays, A and B, but
through some sloppy coding I ended up only using A. Incidentally, I
did some back-of-the-envelope calculations to figure out the memory
bandwidth I was getting, and I was surprised to see that I was getting
close to twice the theoretical peak for the cards I was working
with. It looks like it’s because I was only reading half the data I
thought I was. Here are the corrected figures (the experiment is the
same other than the small corrections to my code):
| Kernel | Tesla C1060 | GeForce GTX 460 | ATI Radeon HD 6750M |
| MyAdd | 2.764 ms | 4.524 ms | 36.325 ms |
| MyAdd_2D | 10.560 ms | 0.763 ms | 4.273 ms |
| MyAdd_2D_unweave | 0.740 ms | 0.100 ms | 2.170 ms |
| MyAdd_col | 2.777 ms | 4.527 ms | 26.686 ms |
| MyAdd_2D_col | 10.391 ms | 0.961 ms | 7.723 ms |
| MyAdd_2D_unweave_col | 12.398 ms | 0.708 ms | 3.413 ms |
We’re slower across the board, but the overall shape of the data is about the same. Interestingly, the fastest kernels are not much slower than the fastest kernels from before.
Next, reddit user ser999 pointed out that we could forego the branch entirely by doing some clever arithmetic. Instead of doing
1 2 3 4 | |
we could instead do this:
1
| |
This isn’t exactly as clear as the code we had before, but perhaps a sufficiently smart compiler could perform this optimization so the programmer could still write the nicer code. To be fair, my thread divergence optimization wasn’t great for code readability either. The important this is, how does this “no branching” version perform? The table below shows the performance along with the “unweaved” version from before for comparison.
| Kernel | Tesla C1060 | GeForce GTX 460 | ATI Radeon HD 6750M |
| MyAdd_2D_unweave | 0.740 ms | 0.100 ms | 2.170 ms |
| MyAdd_2D_nobranch | 9.969 ms | 0.731 ms | 3.381 ms |
| MyAdd_2D_unweave_col | 12.398 ms | 0.708 ms | 3.413 ms |
| MyAdd_2D_col_nobranch | 9.982 ms | 0.729 ms | 3.465 ms |
For row-wise access, the “unweave” variant always wins. For column-wise access, the “nobranch” version wins on the C1060, while the GTX 460 and the ATI card do better with the “unweave” variant. In both the ATI and GTX 460 column-wise case, however, the two perform basically the same.
So why is this? Branches are often pretty expensive, especially when they create thread divergence. However, by removing the branch we always have to do a multiplication, and we also have to convert an integer value into a floating point value. Multiplication in particular is a fairly expensive operation. In the case, the branch isn’t so bad.
As before, the code from this post is available at https://github.com/eholk/bench-thread-diverge.
