Eric Holk

You wouldn't like me when I'm not coding.

What Is Macro Hygiene?

One important, though surprisingly uncommon, feature of macro systems is that of hygiene. I mentioned in a previous post that I would eventually say something about hygiene. It turns out macro hygiene is somewhat tricky to define precisely, and I know a couple of people who are actively working on a formal definition of hygiene. The intuition behind hygiene isn’t too bad though. Basically, we want our macros to not break our code. So how can macros break code?

Some Simple GPU Optimizations

One of the goals of designing a high level GPU programming language is to allow the compiler to perform optimizations on your code. One optimization we’ve been doing for a while in Harlan is one I’ve been calling “kernel fusion.” This is a pretty obvious transformation to do, and many other GPU languages do it. However, kernel fusion comes in several different variants that I’d like to discuss.

Using Scheme With Travis CI

Early on in the development of the Harlan compiler, my collaborators and I realized we were spending a lot of time writing compilers that translate Scheme-like languages into C or C++. A lot of this code should be common between projects, so we decided to factor some of this code into the Elegant Weapons project. Elegant Weapons even had a trivial test suite. Unfortunately, because the primary consumer of Elegant Weapons was Harlan, the design was still far to specific to Harlan. As we realized when Nada Amin submitted a fix for the Elegant Weapons tests, we weren’t even running our own tests anymore. Clearly we needed to do something better if Elegant Weapons were truly going to be a project worthy of existing on its own.

Lately I’ve seen a lot of GitHub repositories that include an image like this in their Readme:

Build Status

It turns out this is from a free service called Travis CI, which makes it easy to integrate continuous integration with GitHub projects. Basically, this means you can define a set of tests that run every time you push a commit to GitHub, and you’ll be notified if something breaks.

Scheme was not one of the languages with first class support, but fortunately we can make it work. Here’s how.

Some Picky Presentation Tips

I just spent the last week at IPDPS in Boston. It was a good time. I got to meet a few new people, and connect with a lot of friends who are now living in the Boston area. I also presented our work on Rust for the GPU at HIPS. In the course of watching a lot of presentations, I came up with a few tips. I admit I did not follow all of these in my own presentation, but hopefully all of us can learn from these.

Data Parallel Operators

In my previous post, we discussed some of the data structures that support data parallel programming. Now we’ll turn our attention to the common operators that manipulate these data structures. I’ll discuss several of them: map, reduce, scan, permute, back-permute and filter.

Data Parallel Data Structures

Data parallelism is a style of programming where essentially the same operation is applied to a large collection of values. This style became popular during the 80s and early 90s as a convenient way of programming large vector processors. Data parallelism has remained popular, especially in light of the rise of GPGPU programming. Often, data parallel programming is used for fine-grained parallelism, but it works at larger granularity too. For example, MapReduce is a restricted example of data parallelism.

Systems that support data parallelism typically provide a handful of data structures that can be manipulated with a set of parallel operators. These data structures include vectors, segmented vectors, and arrays. In this post, we’ll take a look at these different structures and in a later post we’ll discuss some of the parallel operators that manipulate them.

Beware the Logarithms

Logarithms are great. They let you talk about incredibly wide ranges of numbers, and they transform multiplication into addition. Algorithms with logarithmic running times are so fast they might as well be constant time algorithms. Logarithmic scales on graphs can also make your results look much better. Let’s see how.

Patterns With Ellipses

Last time, we talked about matching patterns in Scheme. Now we will look at how to extend the pattern matcher and template instantiation code to handle patterns with ellipses.

Matching Patterns With Scheme

A while back, I wrote a post about macros in Scheme. Today I want to take a look at how one might begin to implement a macro system. In Scheme, whether you use syntax-rules or syntax-case do write your macros, at some point you’ll write patterns and templates. Macros match their input against a pattern and then use this to instantiate a template. Let’s consider a two-way or macro:

1
2
3
4
5
(define-syntax or2
  (syntax-rules ()
    ((_ e1 e2)
     (let ((t e1))
       (if t t e2)))))

We can see how this works at the REPL by using expand:

1
2
> (expand '(or2 1 2))
(let ([t.8 1]) (if t.8 t.8 2))

This matches the expression (or2 1 2) against the pattern (_ e1 e2). The underscore means we don’t care what’s in that position (it’s always or2, since this is part of the or2 macro definition). The names e1 and e2 allow us to refer to what occurs in this pattern in the template. When instantiating the template, the macro system replaces all references to e1 with the chunk of syntax from that position (1 in this case), and likewise with e2. The result is the expression, (let ([t.8 1]) (if t.8 t.8 2)). Here we let-bind the first expression because we do not want to accidentally evaluate any side effects it might contain twice. Don’t worry too much about why the variable name is t.8 instead of just t. This has to do with something called hygiene, which I will hopefully say more on in a later post.

Now let’s take a look at how we would write a function to match patterns like this. We’ll start with a procedure called match that takes four arguments: the pattern to match against, the expression to match, a success continuation, and a failure continuation. The success continuation is called if the expression matches the pattern, and the failure continuation is called otherwise. When I first started writing this, it seemed like it would be cleaner to use success and failure continuations, although I’m not entirely sure that’s true anymore. At the moment, we will support three things for patterns:

  1. Pairs – A pair pattern matches an expression if and only if the expression is a pair, and the car of the pattern matches the car of the expression and the cdr of the pattern matches the cdr of the expression.
  2. Symbols – A symbol pattern binds a pattern variable. This matches any expression, and binds that expression to the variable.
  3. Null – A null pattern matches an expression if and only if the expression is null.

Scheme’s pattern matching supports several other features, like auxiliary keywords, which we’re ignoring for the now for the sake of simplicity. Using our three kinds of patterns so far, we can translate this pretty directly into a Scheme program:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
(define (match p e sk fk)
  (cond
    ((and (pair? p) (pair? e))
     (match (car p) (car e)
            (lambda (b)
              (match (cdr p) (cdr e)
                     (lambda (b^) (sk (append b b^)))
                     fk))
            fk))
    ((and (symbol? p))
     (sk (list (cons p e))))
    ((and (null? p) (null? e))
     (sk '()))
    (else (fk))))

Notice that we pass a list of bindings to the success continuation. These will be used by the template instantiation code. Let’s try an example:

1
2
> (match '(_ e1 e2) '(or2 1 2) (lambda (b) b) (lambda () #f))
((_ . or2) (e1 . 1) (e2 . 2))

For the most part, this does what we want. The matcher says the expression matches the pattern, and tells us that e1 was bound to 1 and e2 was bound to 2. However, we also bound _ to or2. In reality, we want _ to mean ignore, which we can do by adding one line before the symbol? line in our cond clause:

1
2
((eq? p '_)
 (sk '()))

Now we get only the bindings we care about.

The next part is to use the resulting bindings to instantiate the template. We’ll do this with a function called instantiate, which takes a template (called p), and a list of bindings. This is a straightforward tree walk that tries to replace symbols with expressions when they are bound:

1
2
3
4
5
6
7
(define (instantiate p bindings)
  (cond
    ((pair? p)
     (cons (instantiate (car p) bindings)
           (instantiate (cdr p) bindings)))
    ((assq p bindings) => cdr)
    (else p)))

If we did this right, we should be able to instantiate the template from the or2 macro above.

1
2
3
4
5
6
> (instantiate
      '(let ((t e1))
         (if t t e2))
      '((e1 . 1) (e2 . 2)))

(let ([t 1]) (if t t 2))

It looks like we got what we wanted, although without hygiene.

At this point we have a pattern matcher and template instantiation function that is suitable for making a very simple macro system. One very nice feature that we are missing at the moment is ellipsis patterns, which allow us to match repeated patterns. For example, we could use this to write a custom let macro like this:

1
2
3
4
(define-syntax my-let
  (syntax-rules ()
    ((_ ((x e) ...) b)
     ((lambda (x ...) b) e ...))))

In a post at some point in the hopefully near future, I will show how to handle patterns with ... in them.

Access Patterns Matter, Part 2

A couple of readers pointed out some improvements and corrections to my last post on GPU access patterns. These were pretty significant, so I thought it’d be worth doing a follow up post to see how the change things.

First of all, I meant to operate on both arrays, A and B, but through some sloppy coding I ended up only using A. Incidentally, I did some back-of-the-envelope calculations to figure out the memory bandwidth I was getting, and I was surprised to see that I was getting close to twice the theoretical peak for the cards I was working with. It looks like it’s because I was only reading half the data I thought I was. Here are the corrected figures (the experiment is the same other than the small corrections to my code):

KernelTesla C1060GeForce GTX 460ATI Radeon HD 6750M
MyAdd 2.764 ms 4.524 ms 36.325 ms
MyAdd_2D 10.560 ms 0.763 ms 4.273 ms
MyAdd_2D_unweave 0.740 ms 0.100 ms 2.170 ms
MyAdd_col 2.777 ms 4.527 ms 26.686 ms
MyAdd_2D_col 10.391 ms 0.961 ms 7.723 ms
MyAdd_2D_unweave_col 12.398 ms 0.708 ms 3.413 ms

We’re slower across the board, but the overall shape of the data is about the same. Interestingly, the fastest kernels are not much slower than the fastest kernels from before.

Next, reddit user ser999 pointed out that we could forego the branch entirely by doing some clever arithmetic. Instead of doing

1
2
3
4
if (i % 2 == 0)
    get(C, N, i, j) = get(A, N, i, j) + get(B, N, i, j);
else
    get(C, N, i, j) = get(A, N, i, j) - get(B, N, i, j);

we could instead do this:

1
get(C, N, i, j) = get(A, N, i, j) + get(B, N, i, j)*(1 - ((i&1)<<1));

This isn’t exactly as clear as the code we had before, but perhaps a sufficiently smart compiler could perform this optimization so the programmer could still write the nicer code. To be fair, my thread divergence optimization wasn’t great for code readability either. The important this is, how does this “no branching” version perform? The table below shows the performance along with the “unweaved” version from before for comparison.

KernelTesla C1060GeForce GTX 460ATI Radeon HD 6750M
MyAdd_2D_unweave 0.740 ms 0.100 ms 2.170 ms
MyAdd_2D_nobranch 9.969 ms 0.731 ms 3.381 ms
MyAdd_2D_unweave_col 12.398 ms 0.708 ms 3.413 ms
MyAdd_2D_col_nobranch 9.982 ms 0.729 ms 3.465 ms

For row-wise access, the “unweave” variant always wins. For column-wise access, the “nobranch” version wins on the C1060, while the GTX 460 and the ATI card do better with the “unweave” variant. In both the ATI and GTX 460 column-wise case, however, the two perform basically the same.

So why is this? Branches are often pretty expensive, especially when they create thread divergence. However, by removing the branch we always have to do a multiplication, and we also have to convert an integer value into a floating point value. Multiplication in particular is a fairly expensive operation. In the case, the branch isn’t so bad.

As before, the code from this post is available at https://github.com/eholk/bench-thread-diverge.