<?xml version="1.0" encoding="utf-8"?>
<feed xmlns="http://www.w3.org/2005/Atom">
  <title><![CDATA[Eric Holk]]></title>
  <link href="http://blog.theincredibleholk.org/atom.xml" rel="self"/>
  <link href="http://blog.theincredibleholk.org/"/>
  <updated>2013-06-17T11:36:08-06:00</updated>
  <id>http://blog.theincredibleholk.org/</id>
  <author>
    <name><![CDATA[Eric Holk]]></name>
    
  </author>
  <generator uri="http://octopress.org/">Octopress</generator>

  
    <entry>
      




<title type="html"><![CDATA[What is Macro Hygiene?]]></title>
<link href="http://blog.theincredibleholk.org/blog/2013/06/17/what-is-macro-hygiene/"/>
<updated>2013-06-17T10:50:00-06:00</updated>
<id>http://blog.theincredibleholk.org/blog/2013/06/17/what-is-macro-hygiene</id>

      <content type="html"><![CDATA[<p>One important, though surprisingly uncommon, feature of macro systems
is that of <em>hygiene</em>. I mentioned in a <a href="http://blog.theincredibleholk.org/blog/2013/02/11/matching-patterns-with-scheme/">previous post</a> that I would
eventually say something about hygiene. It turns out macro hygiene is
somewhat tricky to define precisely, and I know a couple of people who
are actively working on a formal definition of hygiene. The intuition
behind hygiene isn&rsquo;t too bad though. Basically, we want our macros to
not break our code. So how can macros break code?</p>

<!-- MORE -->


<p><a href="http://blog.theincredibleholk.org/blog/2012/12/02/a-look-at-macros-in-scheme/">Recall</a> that macros are basically programs that transform your
program&rsquo;s code, rather than runtime values. In doing so, they may
introduce new <em>variable bindings</em>. If we&rsquo;re not careful, these new
bindings can end up <em>capturing</em> variables in your own code. That is,
the new binding might <em>shadow</em> a variable you as the programmer have
already created. All of a sudden, the variable you thought you were
referring to is no longer the same thing, and because all of this code
is hidden in a macro expansion, it will be very hard to figure out
what&rsquo;s going on.</p>

<p>Consider the following Scheme macro for <code>or</code>.</p>

<figure class='code'><figcaption><span></span></figcaption><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class='line-number'>1</span>
<span class='line-number'>2</span>
<span class='line-number'>3</span>
<span class='line-number'>4</span>
<span class='line-number'>5</span>
</pre></td><td class='code'><pre><code class='scheme'><span class='line'><span class="p">(</span><span class="k">define-syntax </span><span class="nv">or</span>
</span><span class='line'>  <span class="p">(</span><span class="k">syntax-rules </span><span class="p">()</span>
</span><span class='line'>    <span class="p">((</span><span class="nf">_</span> <span class="nv">e1</span> <span class="nv">e2</span><span class="p">)</span>
</span><span class='line'>     <span class="p">(</span><span class="k">let </span><span class="p">((</span><span class="nf">t</span> <span class="nv">e1</span><span class="p">))</span>
</span><span class='line'>       <span class="p">(</span><span class="k">if </span><span class="nv">t</span> <span class="nv">t</span> <span class="nv">e2</span><span class="p">)))))</span>
</span></code></pre></td></tr></table></div></figure>


<p>Like a good macro, this uses a temporary variable, <code>t</code>, to avoid
calculating <code>e1</code> twice and potentially duplicating any side effects in
that expression. Unfortunately, if we&rsquo;re not careful, this binding can
capture an existing binding of <code>t</code>. Consider the following program.</p>

<figure class='code'><figcaption><span></span></figcaption><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class='line-number'>1</span>
<span class='line-number'>2</span>
</pre></td><td class='code'><pre><code class='scheme'><span class='line'><span class="p">(</span><span class="k">let </span><span class="p">((</span><span class="nf">t</span> <span class="mi">5</span><span class="p">))</span>
</span><span class='line'>  <span class="p">(</span><span class="k">or </span><span class="no">#f</span> <span class="nv">t</span><span class="p">))</span>
</span></code></pre></td></tr></table></div></figure>


<p>If you run this in the Scheme REPL, you should get <code>5</code>. Let&rsquo;s see what
happens if we blindly expand the <code>or</code> macro without regard for
hygiene. We would end up with the following program.</p>

<figure class='code'><figcaption><span></span></figcaption><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class='line-number'>1</span>
<span class='line-number'>2</span>
<span class='line-number'>3</span>
</pre></td><td class='code'><pre><code class='scheme'><span class='line'><span class="p">(</span><span class="k">let </span><span class="p">((</span><span class="nf">t</span> <span class="mi">5</span><span class="p">))</span>
</span><span class='line'>  <span class="p">(</span><span class="k">let </span><span class="p">((</span><span class="nf">t</span> <span class="no">#f</span><span class="p">))</span>
</span><span class='line'>    <span class="p">(</span><span class="k">if </span><span class="nv">t</span> <span class="nv">t</span> <span class="nv">t</span><span class="p">)))</span>
</span></code></pre></td></tr></table></div></figure>


<p>This program evaluates to <code>#f</code>, which is the exact opposite of what we
were supposed to get! Expanding our macro has shadowed the binding of
<code>t</code> to <code>5</code> with a binding of <code>t</code> to <code>#f</code>.</p>

<p>One way to work around this, which was a common trick for LISP
programmers of yore is to choose variable names that a programmer is
unlikely to guess. We could rewrite our <code>or</code> macro like this:</p>

<figure class='code'><figcaption><span></span></figcaption><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class='line-number'>1</span>
<span class='line-number'>2</span>
<span class='line-number'>3</span>
<span class='line-number'>4</span>
<span class='line-number'>5</span>
<span class='line-number'>6</span>
<span class='line-number'>7</span>
</pre></td><td class='code'><pre><code class='scheme'><span class='line'><span class="p">(</span><span class="k">define-syntax </span><span class="nv">or</span>
</span><span class='line'>  <span class="p">(</span><span class="k">syntax-rules </span><span class="p">()</span>
</span><span class='line'>    <span class="p">((</span><span class="nf">_</span> <span class="nv">e1</span> <span class="nv">e2</span><span class="p">)</span>
</span><span class='line'>     <span class="p">(</span><span class="k">let </span><span class="p">((</span><span class="nf">this-is-my-super-secret-name-which-you-will-never-guess</span> <span class="nv">e1</span><span class="p">))</span>
</span><span class='line'>       <span class="p">(</span><span class="k">if </span><span class="nv">this-is-my-super-secret-name-which-you-will-never-guess</span>
</span><span class='line'>           <span class="nv">this-is-my-super-secret-name-which-you-will-never-guess</span>
</span><span class='line'>           <span class="nv">e2</span><span class="p">)))))</span>
</span></code></pre></td></tr></table></div></figure>


<p>This works okay, except it&rsquo;s a lot more typing. Eventually, some
overly clever programmer is going to name their own variable
<code>this-is-my-super-secret-name-which-you-will-never-guess</code>, and then
their program will break in really unexpected ways. It&rsquo;d really be
great if our macro expander could take care of these issues on its
own. That way, the macro writers could type less and use names they
like, and macro users don&rsquo;t have to worry about their variables being
captured unexpectedly.</p>

<p>We could modify the macro expander to automatically rename any
variables bound by a macro expansion. In this case, our simple test
program would expand as follows, using our first definition of <code>or</code>.</p>

<figure class='code'><figcaption><span></span></figcaption><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class='line-number'>1</span>
<span class='line-number'>2</span>
<span class='line-number'>3</span>
</pre></td><td class='code'><pre><code class='scheme'><span class='line'><span class="p">(</span><span class="k">let </span><span class="p">((</span><span class="nf">t</span> <span class="mi">5</span><span class="p">))</span>
</span><span class='line'>  <span class="p">(</span><span class="k">let </span><span class="p">((</span><span class="nf">t</span><span class="o">.</span><span class="mi">1</span> <span class="no">#f</span><span class="p">))</span>
</span><span class='line'>    <span class="p">(</span><span class="k">if </span><span class="nv">t</span><span class="o">.</span><span class="mi">1</span> <span class="nv">t</span><span class="o">.</span><span class="mi">1</span> <span class="nv">t</span><span class="p">)))</span>
</span></code></pre></td></tr></table></div></figure>


<p>This program evaluates as expected, so things are looking good. But,
what if someone writes this program?</p>

<figure class='code'><figcaption><span></span></figcaption><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class='line-number'>1</span>
<span class='line-number'>2</span>
</pre></td><td class='code'><pre><code class='scheme'><span class='line'><span class="p">(</span><span class="k">let </span><span class="p">((</span><span class="k">if </span><span class="p">(</span><span class="k">lambda </span><span class="p">(</span><span class="nf">a</span> <span class="nv">b</span> <span class="nv">c</span><span class="p">)</span> <span class="nv">b</span><span class="p">)))</span>
</span><span class='line'>  <span class="p">(</span><span class="k">or </span><span class="no">#f</span> <span class="mi">5</span><span class="p">))</span>
</span></code></pre></td></tr></table></div></figure>


<p>We&rsquo;ll use our variable-renaming expander and see that we end up with
the following program:</p>

<figure class='code'><figcaption><span></span></figcaption><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class='line-number'>1</span>
<span class='line-number'>2</span>
<span class='line-number'>3</span>
</pre></td><td class='code'><pre><code class='scheme'><span class='line'><span class="p">(</span><span class="k">let </span><span class="p">((</span><span class="k">if </span><span class="p">(</span><span class="k">lambda </span><span class="p">(</span><span class="nf">a</span> <span class="nv">b</span> <span class="nv">c</span><span class="p">)</span> <span class="nv">b</span><span class="p">)))</span>
</span><span class='line'>  <span class="p">(</span><span class="k">let </span><span class="p">((</span><span class="nf">t</span><span class="o">.</span><span class="mi">1</span> <span class="no">#f</span><span class="p">))</span>
</span><span class='line'>    <span class="p">(</span><span class="k">if </span><span class="nv">t</span><span class="o">.</span><span class="mi">1</span> <span class="nv">t</span><span class="o">.</span><span class="mi">1</span> <span class="mi">5</span><span class="p">)))</span>
</span></code></pre></td></tr></table></div></figure>


<p>This program once again evaluates to <code>#f</code> instead of <code>5</code> like we&rsquo;d
like it too. This illustrates the second, and more subtle, class of
hygiene error. The problem is that the programmer&rsquo;s definition of <code>if</code>
has captured the <code>if</code> used by the expansion of <code>or</code>. Now, many
languages treat keywords like <code>if</code> specially and don&rsquo;t let you name
your variables after them. Enforcing a rule like that in Scheme would
solve this particular case, but at the cost of a lot of the power that
Scheme programmers love. The proper solution is to find some way of
tracking what <code>if</code> was bound to when the <code>or</code> macro was defined and
using that version in the expansion of <code>or</code>.</p>

<p>Properly maintaining hygiene in macro systems turns out to be really
tricky. The reward is worth it, however, as programmers can then
reason much more easily about the behavior of macros and start to rely
on them in large projects. Macros are a powerful feature of
programming languages, and many newer language have some form of macro
system. Sadly, these are often not hygienic, or they so &ldquo;Oh, we&rsquo;ll add
hygiene later.&rdquo; Given how important hygiene is, and how tricky it is
to get right, language designers should really implemented hygiene in
their macro systems from the very beginning.</p>

<p><section class="related"></p>

<h2>You might also like&hellip;</h2>

<ul>
<li><a href="http://blog.theincredibleholk.org/blog/2012/12/02/a-look-at-macros-in-scheme/">A Look at Macros in Scheme</a></li>
<li><a href="http://blog.theincredibleholk.org/blog/2013/02/11/matching-patterns-with-scheme/">Matching Patterns with Scheme</a></li>
</ul>


<p></section></p>
]]></content>
    </entry>
  
    <entry>
      




<title type="html"><![CDATA[Some Simple GPU Optimizations]]></title>
<link href="http://blog.theincredibleholk.org/blog/2013/06/10/some-simple-gpu-optimizations/"/>
<updated>2013-06-10T15:31:00-06:00</updated>
<id>http://blog.theincredibleholk.org/blog/2013/06/10/some-simple-gpu-optimizations</id>

      <content type="html"><![CDATA[<p>One of the goals of designing a high level GPU programming language is
to allow the compiler to perform optimizations on your code. One
optimization we&rsquo;ve been doing for a while in Harlan is one I&rsquo;ve been
calling &ldquo;kernel fusion.&rdquo; This is a pretty obvious transformation to
do, and many other GPU languages do it. However, kernel fusion comes
in several different variants that I&rsquo;d like to discuss.</p>

<!-- MORE -->


<h2>Kernel Inlining</h2>

<p>Let&rsquo;s consider an example program written in Harlan:</p>

<figure class='code'><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class='line-number'>1</span>
<span class='line-number'>2</span>
<span class='line-number'>3</span>
<span class='line-number'>4</span>
</pre></td><td class='code'><pre><code class=''><span class='line'>(let ((temp* (kernel ((x x*) (y y*))
</span><span class='line'>               (+ (* x x) (* y y)))))
</span><span class='line'>  (kernel ((t temp*))
</span><span class='line'>    (sqrt t)))</span></code></pre></td></tr></table></div></figure>


<p>I believe this is the first time I&rsquo;ve included Harlan code snippets in
my blog, so some introduction is in order. Harlan is a high level
language for data parallel computing that I&rsquo;ve been working on for a
while now as part of my Ph.D. research. You can read our initial paper
about Harlan <a href="http://www.cs.indiana.edu/~eholk/papers/parco2011.pdf">here</a>, although it will be immediately apparent from the
drastically different syntax that the language has evolved
significantly in that time.</p>

<p>The primary construct for parallelism in Harlan is the <code>kernel</code>
form. Syntactically, this looks like Scheme&rsquo;s <code>let</code>, in that it takes
a list of variable names and expressions to bind the names to, and
then a body describing what computation to do on those variables. The
difference is that each of the bound expressions must be a vector, and
the result of a kernel is another vector. Furthermore, each input
vector must be of the same length, and the result of the kernel will
have the same length as its inputs. Kernels operate on parallel, with
the body being applied to each corresponding set of elements in the
inputs. You can think of kernels as a parallel map or zipWith.</p>

<p>In the example above, if we imagine <code>x*</code> and <code>y*</code> as containing the X
and Y coordinates of a set of 2D vectors, then the code snippet
calculates the magnitude of each of these vectors. As written,
however, we have a couple of obvious inefficiencies. This example
allocates space for the result of the first kernel, <code>temp*</code>, which
could be significant if <code>x*</code> and <code>y*</code> are very long. Secondly, both
kernels are tiny&mdash;that is, they have low arithmetic
intensity. Arithmetic intensity is a measure of how many arithmetic
operations we have for each memory reference. If it is low, we won&rsquo;t
be using the full computational power of the GPU.</p>

<p>Fortunately, we can solve both of these problems with the first form
of fusion, which I feel is better called <em>kernel inlining</em>. With
kernel inlining, we would transform our program into the following:</p>

<figure class='code'><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class='line-number'>1</span>
<span class='line-number'>2</span>
<span class='line-number'>3</span>
</pre></td><td class='code'><pre><code class=''><span class='line'>(kernel ((x x*) (y y*))
</span><span class='line'>  (let ((t (+ (* x x) (* y y))))
</span><span class='line'>    (sqrt t)))</span></code></pre></td></tr></table></div></figure>


<p>In case it&rsquo;s not clear what happened, we took the body of the first
kernel and inserted it into the second kernel. In doing this, we also
got rid of the parameters to the second kernel and replaced them with
the parameters from the first kernel. Now there is only one kernel,
the <code>temp*</code> vector is completely gone, and the arithmetic intensity of
the newly fused kernel is higher than we had before.</p>

<p>This transformation is usually a win when it&rsquo;s possible. We may not
want to do it if, for example, the <code>temp*</code> array is also referenced
somewhere else. Also, while kernels with higher arithmetic intensity
tend to work better because there is more computation for the GPU to
use to hide memory access latency, it can also increase the number of
registers needed by each thread and thereby limit how much parallelism
is possible.</p>

<h2>2D Kernel Fusion</h2>

<p>For the next type of fusion, let&rsquo;s consider the following code that
would appear in a hypothetical program to render the Mandelbrot set:</p>

<figure class='code'><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class='line-number'>1</span>
<span class='line-number'>2</span>
<span class='line-number'>3</span>
</pre></td><td class='code'><pre><code class=''><span class='line'>(kernel ((i (iota height)))
</span><span class='line'>  (kernel ((j (iota width)))
</span><span class='line'>    (compute-mandelbrot i j)))</span></code></pre></td></tr></table></div></figure>


<p>Harlan allows nested kernels like this, but Harlan compiles to OpenCL,
which does not support nested kernels. Harlan generally deals with
this by sequentializing all but one of the kernels. In this case, we
could either transform the out or inner kernel into a loop. However,
since the kernels form a grid with no dependencies between them, the
compiler can fuse these into a single, two dimensional kernel, which
OpenCL does support. The Harlan compiler is able to do this for simple
cases like this one.</p>

<h2>Kernel Concatenation</h2>

<p>Let&rsquo;s look at one more form of fusion, although the Harlan compiler
does not currently do this form. Consider a program like this:</p>

<figure class='code'><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class='line-number'>1</span>
<span class='line-number'>2</span>
<span class='line-number'>3</span>
<span class='line-number'>4</span>
<span class='line-number'>5</span>
</pre></td><td class='code'><pre><code class=''><span class='line'>(let ((a* (kernel ((i inputs))
</span><span class='line'>            (do-something i)))
</span><span class='line'>      (b* (kernel ((i inputs))
</span><span class='line'>            (do-something-else i))))
</span><span class='line'>  (write-outputs a* b*))</span></code></pre></td></tr></table></div></figure>


<p>Here we have two kernels that execute in sequence with no dependencies
between them. Furthermore, they have the same set of inputs. If we
pretend for a moment that Harlan has a multiple return value form,
like Scheme&rsquo;s <code>values</code>, we could combine these two kernels as follows:</p>

<figure class='code'><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class='line-number'>1</span>
<span class='line-number'>2</span>
<span class='line-number'>3</span>
<span class='line-number'>4</span>
<span class='line-number'>5</span>
</pre></td><td class='code'><pre><code class=''><span class='line'>(let-values (((a* b*) (kernel ((i inputs))
</span><span class='line'>                        (let ((a (do-something i))
</span><span class='line'>                              (b (do-something-else j)))
</span><span class='line'>                          (values a b)))))
</span><span class='line'>  (write-outputs a* b*))</span></code></pre></td></tr></table></div></figure>


<p>In keeping with the general theme of fusion transformations, we now
have only one kernel that performs more work instead of two smaller
kernels.</p>

<h2>Conclusion</h2>

<p>We&rsquo;ve looked at three different program transformations that could be
called fusion. A key point is that they are transformations, not
necessarily optimizations. In my experience, these sorts of
transformations almost always improve performance, but that need not
always be the case. By combining small pieces of code into larger
chunks, we reduce the overhead of launching new kernels and also give
the GPU more code to use for overlapping memory accesses with
computation. On the other hand, larger code could lead to less
efficient instruction cache usage, and also lead to more memory
accesses if each thread&rsquo;s working set can no longer fit in
registers. It&rsquo;s good to be able to do these transformations, but it&rsquo;s
important to benchmark and see what works best for each particular
program.</p>
]]></content>
    </entry>
  
    <entry>
      




<title type="html"><![CDATA[Using Scheme with Travis CI]]></title>
<link href="http://blog.theincredibleholk.org/blog/2013/05/28/using-scheme-with-travis-ci/"/>
<updated>2013-05-28T14:21:00-06:00</updated>
<id>http://blog.theincredibleholk.org/blog/2013/05/28/using-scheme-with-travis-ci</id>

      <content type="html"><![CDATA[<p>Early on in the development of the Harlan compiler, my collaborators
and I realized we were spending a lot of time writing compilers that
translate Scheme-like languages into C or C++. A lot of this code
should be common between projects, so we decided to factor some of
this code into the <a href="https://github.com/eholk/elegant-weapons">Elegant Weapons</a> project. Elegant Weapons even had
a trivial test suite. Unfortunately, because the primary consumer of
Elegant Weapons was Harlan, the design was still far to specific to
Harlan. As we realized when <a href="http://namin.github.io/">Nada Amin</a> submitted a <a href="https://github.com/eholk/elegant-weapons/pull/5">fix</a> for the
Elegant Weapons tests, we weren&rsquo;t even running our own tests
anymore. Clearly we needed to do something better if Elegant Weapons
were truly going to be a project worthy of existing on its own.</p>

<p>Lately I&rsquo;ve seen a lot of GitHub repositories that include an image
like this in their Readme:</p>

<p><a href="https://travis-ci.org/eholk/elegant-weapons"><img src="https://travis-ci.org/eholk/elegant-weapons.png" alt="Build Status" /></a></p>

<p>It turns out this is from a free service called <a href="https://travis-ci.org/">Travis CI</a>, which
makes it easy to integrate continuous integration with GitHub
projects. Basically, this means you can define a set of tests that run
every time you push a commit to GitHub, and you&rsquo;ll be notified if
something breaks.</p>

<p>Scheme was not one of the languages with first class support, but
fortunately we can make it work. Here&rsquo;s how.</p>

<!-- MORE -->


<p>Travis CI gives a lot of flexibility with the scripts you can run on
their servers. Thus, we will pretend that we are a C project, and that
our build process installs <a href="http://www.scheme.com/">Petite Chez Scheme</a>. Once that&rsquo;s done, we
run a small wrapper script which launches our Scheme test suite.</p>

<p>First, we need a script to install Petite. You can find our script
<a href="https://github.com/eholk/elegant-weapons/blob/master/travis/install-petite.sh">here</a>,
but I&rsquo;ve included the contents below.</p>

<figure class='code'><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class='line-number'>1</span>
<span class='line-number'>2</span>
<span class='line-number'>3</span>
<span class='line-number'>4</span>
<span class='line-number'>5</span>
<span class='line-number'>6</span>
<span class='line-number'>7</span>
<span class='line-number'>8</span>
<span class='line-number'>9</span>
<span class='line-number'>10</span>
<span class='line-number'>11</span>
</pre></td><td class='code'><pre><code class=''><span class='line'>#!/bin/bash
</span><span class='line'>
</span><span class='line'>wget http://www.scheme.com/download/pcsv8.4-a6le.tar.gz
</span><span class='line'>
</span><span class='line'>tar xzf pcsv8.4-a6le.tar.gz
</span><span class='line'>
</span><span class='line'>cd csv8.4/custom
</span><span class='line'>
</span><span class='line'>./configure
</span><span class='line'>make
</span><span class='line'>sudo make install</span></code></pre></td></tr></table></div></figure>


<p>Notice that we can use sudo within our script. This is because Travis
CI gives you your very own virtual machine and allows passwordless
sudo. Once your tests are done, the VM is destroyed and your next
build will happen on an entirely new instance.</p>

<p>Next, lets assume our test suite is called <code>run-tests.scm</code>. We can
then create a simple <a href="https://github.com/eholk/elegant-weapons/blob/master/run-tests">wrapper script</a> to execute these. Be sure to
mark it as executable before you check it in.</p>

<figure class='code'><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class='line-number'>1</span>
<span class='line-number'>2</span>
<span class='line-number'>3</span>
</pre></td><td class='code'><pre><code class=''><span class='line'>#!/bin/bash
</span><span class='line'>
</span><span class='line'>/usr/bin/petite --libdirs lib --script run-tests.scm</span></code></pre></td></tr></table></div></figure>


<p>Some of the parameters will be specific to your project. For example,
the Elegant Weapons code lives in the <code>lib</code> subdirectory, so we are
sure to add this to the compiler&rsquo;s load path.</p>

<p>Finally, with all the necessary pieces, we can create the
<a href="https://github.com/eholk/elegant-weapons/blob/master/.travis.yml"><code>.travis.yml</code></a>
file that will actually enable our project with Travis CI:</p>

<figure class='code'><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class='line-number'>1</span>
<span class='line-number'>2</span>
<span class='line-number'>3</span>
<span class='line-number'>4</span>
<span class='line-number'>5</span>
<span class='line-number'>6</span>
<span class='line-number'>7</span>
</pre></td><td class='code'><pre><code class=''><span class='line'>language: c
</span><span class='line'>
</span><span class='line'>compiler:
</span><span class='line'> - gcc
</span><span class='line'>
</span><span class='line'>install: ./travis/install-petite.sh
</span><span class='line'>script: ./run-tests</span></code></pre></td></tr></table></div></figure>


<p>Once you add this file to your repository, push it to GitHub, create
your Travis CI account and enable Travis CI for your repository, you
should have your tests automatically running.</p>

<p>Obviously, these instructions have been unique to Petite Chez
Scheme. It should not take too much trouble to adapt these to your
Scheme implementation of choice. In the future, I hope to learn how to
automatically run the tests on a variety of Scheme implementations to
help keep us honest as far as portability is concerned.</p>
]]></content>
    </entry>
  
    <entry>
      




<title type="html"><![CDATA[Some Picky Presentation Tips]]></title>
<link href="http://blog.theincredibleholk.org/blog/2013/05/24/some-picky-presentation-tips/"/>
<updated>2013-05-24T22:06:00-06:00</updated>
<id>http://blog.theincredibleholk.org/blog/2013/05/24/some-picky-presentation-tips</id>

      <content type="html"><![CDATA[<p>I just spent the last week at <a href="http://www.ipdps.org/">IPDPS</a> in Boston. It was a good time. I
got to meet a few new people, and connect with a lot of friends who
are now living in the Boston area. I also presented our work on
<a href="http://blog.theincredibleholk.org/blog/2012/12/05/compiling-rust-for-gpus/">Rust for the GPU</a> at <a href="https://computation.llnl.gov/hips2013/">HIPS</a>. In the course of watching a lot of
presentations, I came up with a few tips. I admit I did not follow all
of these in my own presentation, but hopefully all of us can
learn from these.</p>

<!-- MORE -->


<h2>I know where I am.</h2>

<p>I was amazed at how many people felt the need to put &ldquo;IPDPS 2013&rdquo; on
every single slide in their presentation, as if I might forget what
conference I&rsquo;m attending. I suspect this is the result of some Beamer
template being helpful, but really it&rsquo;s unnecessary. Some helpful
presenters even thought it&rsquo;d be useful to remind me on <em>every slide</em>
that I&rsquo;m in Boston. Even if I was somehow unaware of my location, the
fact the we are in Boston likely does nothing to help me understand
the information you are trying to convey to me. In the same way, if I
wanted to know the date, I&rsquo;d look at the cell phone in my pocket,
rather than at your slides.</p>

<p>That said, often slides are posted on web sites afterwards for
archival purposes, and in this case the conference name and date
actually are somewhat useful. Putting this information on the title
slide is sufficient. It doesn&rsquo;t need to go on every slide.</p>

<h2>I have probably forgotten what your talk is about.</h2>

<p>One piece of information that I realized would actually be useful to
have on every slide is the title of the talk. I was surprised at how
many times we were on about slide 3, chest deep in technical details
and I realized I had completely forgot what the point of it all
was. The fact is, most of us in the audience are not giving you our
complete attention. Make it easy for us to try to tune back in by
keeping some kind of a reminder of the big picture on every slide.</p>

<h2>If I wanted to read, I&rsquo;d read your paper.</h2>

<p>Put as little text as possible on your slides. If you can, come up
with a clear graphic or tasteful animation that makes your main point
clear. The sooner I can understand your main point, the easier it will
be for me to put the rest of your slides in context and avoid the
existential crisis mentioned in the previous point.</p>

<h2>I don&rsquo;t want to think about your graphs either.</h2>

<p>Make your graphs easy to understand. Often they will have some of the
smallest text in your presentation, which means I won&rsquo;t be able to
read it. That&rsquo;s fine, since graphs are about presenting information
visually. Generally, the only thing I&rsquo;m looking for in your graphs is
how much better your system is than the ones you compare against. Make
this easy for me. Clearly indicate whether up is better or down is
better. Even better, be consistent and set up all your graphs so that
up is always better or down is always better. Make it really clear
which bars are lines are your system. Make them a brighter color, bold
your system&rsquo;s name in the legend, or put an asterisk next to it. This
is particularly important if you are comparing a bunch of systems
whose names are all acronyms.</p>

<p>You don&rsquo;t need to include every graph from your paper. Focus on the
key ideas. What are the major results that will make me want to read
the rest of your paper? I can get all the nitty gritty details once I
decide to read your paper.</p>

<p>For each graph you do include, spend some time telling me what I am
supposed to see in it. Remember the saying, a picture is worth a
thousand words. You can&rsquo;t adequately explain your picture with a dozen
words. You need to say more than &ldquo;here is our memory usage.&rdquo; Why is
this result significant? Does it show your system if far faster than
anything else? Does it show you have excellent scaling
characteristics? Does the chart show you are only marginally worse
than other systems, but yours provides other important benefits to
make it worth the sacrifice? You should be trying to convince me to
adopt your ideas in my own work. How does your graph help convince me
of that?</p>
]]></content>
    </entry>
  
    <entry>
      




<title type="html"><![CDATA[Data Parallel Operators]]></title>
<link href="http://blog.theincredibleholk.org/blog/2013/05/14/data-parallel-operators/"/>
<updated>2013-05-14T14:05:00-06:00</updated>
<id>http://blog.theincredibleholk.org/blog/2013/05/14/data-parallel-operators</id>

      <content type="html"><![CDATA[<p>In my <a href="http://blog.theincredibleholk.org/blog/2013/04/11/data-parallel-data-structures/">previous post</a>, we discussed some of the data structures that
support data parallel programming. Now we&rsquo;ll turn our attention to the
common operators that manipulate these data structures. I&rsquo;ll discuss
several of them: map, reduce, scan, permute, back-permute and filter.</p>

<!--MORE-->


<h3>Map</h3>

<p>Map takes as input a vector and an operation, and returns a new vector
that is the result of applying the operation on each element of the
vector. For example, we can map <code>add1</code> over a vector to increment each
element by 1:</p>

<figure class='code'><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class='line-number'>1</span>
<span class='line-number'>2</span>
<span class='line-number'>3</span>
</pre></td><td class='code'><pre><code class=''><span class='line'>map(add1, [1 2 3 4])
</span><span class='line'>
</span><span class='line'>=&gt; [2 3 4 5]</span></code></pre></td></tr></table></div></figure>


<p>Related to this is <code>zip-with</code>, which takes some number of equal-length
vectors and applies an operator to corresponding elements in each
vector. For example, we can add two vectors using the <code>+</code> operation:</p>

<figure class='code'><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class='line-number'>1</span>
<span class='line-number'>2</span>
<span class='line-number'>3</span>
</pre></td><td class='code'><pre><code class=''><span class='line'>zip-with(+, [1 2 3 4], [0 1 0 1])
</span><span class='line'>
</span><span class='line'>=&gt; [1 3 3 5]</span></code></pre></td></tr></table></div></figure>


<p>Map and zip-with can work on segmented vectors as well, and the result
will have the same shape as the original:</p>

<figure class='code'><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class='line-number'>1</span>
<span class='line-number'>2</span>
<span class='line-number'>3</span>
</pre></td><td class='code'><pre><code class=''><span class='line'>map(add1, [[1 2 3] [4 5]])
</span><span class='line'>
</span><span class='line'>=&gt; [[2 3 4] [5 6]]</span></code></pre></td></tr></table></div></figure>


<p>However, if we try to apply <code>zip-with</code> on vectors with differing
shapes, we will end up with an error.</p>

<h3>Reduce</h3>

<p>Reduce takes a vector of elements and combines them into a single
element using a given operation. The operator \(\oplus\) must be
an associative binary operator. It helps if the operator has a 0
element, that is, one for which \(a \oplus 0 = a\) and \(0 \oplus
a = a \). We can do even better if we know the operator is
commutative.</p>

<p>Below is an example of how to add all elements in a vector:</p>

<figure class='code'><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class='line-number'>1</span>
<span class='line-number'>2</span>
<span class='line-number'>3</span>
</pre></td><td class='code'><pre><code class=''><span class='line'>reduce(+, [1 1 1 1 1 1 1 1])
</span><span class='line'>
</span><span class='line'>=&gt; 8</span></code></pre></td></tr></table></div></figure>


<p>We can similarly define a segmented variant of reduce, which results
in a vector containing the reduction of each segment. For example:</p>

<figure class='code'><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class='line-number'>1</span>
<span class='line-number'>2</span>
<span class='line-number'>3</span>
</pre></td><td class='code'><pre><code class=''><span class='line'>reduce(+, [[1 1 1 1] [1 1 1 1]]
</span><span class='line'>
</span><span class='line'>=&gt; [4 4]</span></code></pre></td></tr></table></div></figure>


<p>The two operators we&rsquo;ve seen before are very powerful. We can define
many linear algebra operations in terms of them, and they all form the
basis for MapReduce. Still, there are cases where more operators are
extremely helpful.</p>

<h3>Scan</h3>

<p>Scan is like reduce, except that it returns a vector of all of the
intermediate results. Here&rsquo;s an example:</p>

<figure class='code'><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class='line-number'>1</span>
<span class='line-number'>2</span>
<span class='line-number'>3</span>
</pre></td><td class='code'><pre><code class=''><span class='line'>scan(+, [1 1 1 1 1 1 1 1])
</span><span class='line'>
</span><span class='line'>=&gt; [0 1 2 3 4 5 6 7]</span></code></pre></td></tr></table></div></figure>


<p>Each entry in the resulting vector is the sum of all of the elements
to the left of the corresponding element in the source vector.</p>

<p>Scan is often handy as an intermediate step in a larger
computation. For example, we may have a vector of vectors where the
lengths of the inner vectors vary wildly. In this case, scan can be
used to help balance the work between some number of worker threads.</p>

<h3>Permute</h3>

<p>Permute is used to rearrange elements in a vector. It takes a vector,
of course, and also a function that maps indices to indices. The
element at the input index will be stored in the output index. Here&rsquo;s
a basic example that reverses a vector:</p>

<figure class='code'><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class='line-number'>1</span>
<span class='line-number'>2</span>
<span class='line-number'>3</span>
</pre></td><td class='code'><pre><code class=''><span class='line'>permute([1 2 3 4], i -&gt; 4 - i)
</span><span class='line'>
</span><span class='line'>=&gt; [4 3 2 1]</span></code></pre></td></tr></table></div></figure>


<p>There are a couple of caveats. Suppose our mapping function were <code>i -&gt;
0</code>. If we apply this to the vector <code>[1 2 3 4]</code>, all of the elements
would be mapped to the first position and there would be nothing for
the later positions. To solve the first problem, we can provide a
combining function that is used when multiple elements map to the same
location. For example, we may want the maximum value of all elements
that map to a given location. For the second problem, we can provide a
default element that is used when none of the inputs are mapped to a
certain output location. Of course, now our permutation function is
more complicated to use and also more complicated to implement.</p>

<h3>Back-permute</h3>

<p>In many ways, back-permute is a simpler version of
permute. Back-permute also takes an input vector and an index mapping
function, but instead of mapping input indices to output indices, it
maps output indices to input indices. Because of this, we don&rsquo;t have
the problem of multiple values mapping to the same location, as the
mapping function can only produce one value. This also means we can
avoid the combining function. That said, because this function is
simpler, it is also less powerful than permute.</p>

<h3>Filter</h3>

<p>All of our operators so far except for reduce have essentially mapped
some number of inputs to the same number of outputs. Sometimes we need
to remove elements that we are working with. An example is when
writing Quicksort. We&rsquo;d need to split a vector into one vector
containing those elements greater than the pivot and another
containing those elements less than the pivot. Using filter, we could
do this as follows:</p>

<figure class='code'><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class='line-number'>1</span>
<span class='line-number'>2</span>
<span class='line-number'>3</span>
<span class='line-number'>4</span>
<span class='line-number'>5</span>
<span class='line-number'>6</span>
<span class='line-number'>7</span>
<span class='line-number'>8</span>
<span class='line-number'>9</span>
<span class='line-number'>10</span>
<span class='line-number'>11</span>
</pre></td><td class='code'><pre><code class=''><span class='line'>let pivot = 5;
</span><span class='line'>let data = [5 2 8 6 4];
</span><span class='line'>
</span><span class='line'>filter(x -&gt; x &gt; pivot, data)
</span><span class='line'>
</span><span class='line'>=&gt; [8 6]
</span><span class='line'>
</span><span class='line'>
</span><span class='line'>filter(x -&gt; x &lt;= pivot, data)
</span><span class='line'>
</span><span class='line'>=&gt; [5 2 4]</span></code></pre></td></tr></table></div></figure>


<h2>What&rsquo;s primitive?</h2>

<p>Many of these operators can be written in terms of other ones. For
example, you can get reduce from scan by reading the last element of
the result and adding the last element of the input. This version will
probably not be efficient, as it may allocate far more storage than
necessary. In designing a data parallel system, one important decision
is which operators will be considered primitive, and which will be
built in terms of others. Ideally, these decisions should consider the
characteristics of the underlying hardware. In Guy Bleloch&rsquo;s book,
<a href="http://www.cs.cmu.edu/~blelloch/papers/Ble90.pdf">Vector Models for Data-Parallel Computing</a>, he argues in favor of a
set of primitives that can execute in about the same amount of time it
would take to read the input vector from memory. Similarly, when
implementing a data parallel system for GPUs, it&rsquo;s important to
consider what operators are easy to implement. Map and back-permute
are pretty straightforward, but others are more difficult. Ideally,
upon choosing a good set of primitives, the compiler and runtime can
optimize the other operators into efficient code.</p>
]]></content>
    </entry>
  
    <entry>
      




<title type="html"><![CDATA[Data Parallel Data Structures]]></title>
<link href="http://blog.theincredibleholk.org/blog/2013/04/11/data-parallel-data-structures/"/>
<updated>2013-04-11T13:28:00-06:00</updated>
<id>http://blog.theincredibleholk.org/blog/2013/04/11/data-parallel-data-structures</id>

      <content type="html"><![CDATA[<p>Data parallelism is a style of programming where essentially the same
operation is applied to a large collection of values. This style
became popular during the 80s and early 90s as a convenient way of
programming large vector processors. Data parallelism has remained
popular, especially in light of the rise of GPGPU programming. Often,
data parallel programming is used for fine-grained parallelism, but it
works at larger granularity too. For example, <a href="http://research.google.com/archive/mapreduce.html">MapReduce</a> is a
restricted example of data parallelism.</p>

<p>Systems that support data parallelism typically provide a handful of
data structures that can be manipulated with a set of parallel
operators. These data structures include vectors, segmented vectors,
and arrays. In this post, we&rsquo;ll take a look at these different
structures and in a later post we&rsquo;ll discuss some of the parallel
operators that manipulate them.</p>

<!-- MORE -->


<p>Data parallel systems typically provide some combination of vectors,
segmented vectors and arrays. In the abstract, we&rsquo;ll consider a vector
to be an ordered, finite sequence of elements of the same types. Most
often, these are scalar values, but some systems allow vectors of
vectors. A segmented vector is a vector that has been divided into
segments (surprise!). Below is an example of a vector (<code>X</code>) and a
segmented vector (<code>Xs</code>).</p>

<figure class='code'><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class='line-number'>1</span>
<span class='line-number'>2</span>
</pre></td><td class='code'><pre><code class=''><span class='line'>X  = [1 2 3 4 5 6 7 8]
</span><span class='line'>Xs = [[1 2 3] [4 5] [6 7 8]]</span></code></pre></td></tr></table></div></figure>


<p>The first vector just includes the numbers one through eight, while
the second is made up of three segments. Notice how the segments don&rsquo;t
have to be the same length. You might also notice that the segmented
vector looks a lot like a vector of vectors of numbers. Often,
languages that support nested vectors implement them as segmented
vectors. Representing nested vectors as segmented vectors is
essentially the idea behind flattening, which is used heavily in the
implementations of <a href="http://www.cs.cmu.edu/~scandal/nesl.html">NESL</a> and <a href="http://www.haskell.org/haskellwiki/GHC/Data_Parallel_Haskell">Data Parallel Haskell</a>, for example.</p>

<p>Hardware rarely supports segmented vectors directly, so they are
instead represented as a pair of a data vector and a segment
descriptor. There are several possibilities for segment descriptors,
and different representations have different performance
characteristics depending on the hardware. One representation for the
segment descriptor is the flag vector, which we see below.</p>

<figure class='code'><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class='line-number'>1</span>
<span class='line-number'>2</span>
</pre></td><td class='code'><pre><code class=''><span class='line'>Xdata = [1 2 3 4 5 6 7 8]
</span><span class='line'>Xseg  = [1 0 0 1 0 1 0 0]</span></code></pre></td></tr></table></div></figure>


<p>This example is the same as <code>Xs</code> from above. The segment descriptor,
<code>Xseg</code>, includes a 1 in each element that starts a new segment and a 0
otherwise. This representation can simplify the implementation of some
operations, such as <em>scan</em>, but this representation also cannot
describe 0-length segments.</p>

<p>Another option is to store the length of each segment in the segment
descriptor:</p>

<figure class='code'><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class='line-number'>1</span>
<span class='line-number'>2</span>
</pre></td><td class='code'><pre><code class=''><span class='line'>Xdata = [1 2 3 4 5 6 7 8]
</span><span class='line'>Xseg  = [3 2 3]</span></code></pre></td></tr></table></div></figure>


<p>This representation makes it easy to tell how many segments there are,
and how long each one is, but it makes accessing individual elements
more difficult because it&rsquo;s not obvious where a segment
starts. Instead, we could have the segment descriptor store the
indexes of the beginning of each segment:</p>

<figure class='code'><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class='line-number'>1</span>
<span class='line-number'>2</span>
</pre></td><td class='code'><pre><code class=''><span class='line'>Xdata = [1 2 3 4 5 6 7 8]
</span><span class='line'>Xseg  = [0 3 5]</span></code></pre></td></tr></table></div></figure>


<p>As one final option (I&rsquo;m sure there are many more), we&rsquo;ll consider a
case where each element in the segment descriptor indicates the
segment that the data lies in:</p>

<figure class='code'><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class='line-number'>1</span>
<span class='line-number'>2</span>
</pre></td><td class='code'><pre><code class=''><span class='line'>Xdata = [1 2 3 4 5 6 7 8]
</span><span class='line'>Xseg  = [0 0 0 1 1 2 2 2]</span></code></pre></td></tr></table></div></figure>


<p>Each of these representations has different tradeoffs. Some take more
space, others enable more efficient implementation of certain
operations. Which operations are important will help inform the best
choice of representation. Fortunately, many of these can converted to
another form with a constant number of operations. You might notice
some similarities between these representations and various
<a href="http://en.wikipedia.org/wiki/Sparse_matrix#Storing_a_sparse_matrix">sparse matrix formats</a>.</p>

<p>Finally, we consider the array type. This is similar to a segmented
vector in which each segment has the same size. Below is an example.</p>

<figure class='code'><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class='line-number'>1</span>
<span class='line-number'>2</span>
<span class='line-number'>3</span>
</pre></td><td class='code'><pre><code class=''><span class='line'>[1 2 3]
</span><span class='line'>A = [4 5 6]
</span><span class='line'>[7 8 9]</span></code></pre></td></tr></table></div></figure>


<p>We can certainly represent this exactly as we would a segmented
vector, but we can do many operations more efficiently if we instead
store the array as a data vector and a list of dimensions. For example:</p>

<figure class='code'><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class='line-number'>1</span>
<span class='line-number'>2</span>
</pre></td><td class='code'><pre><code class=''><span class='line'>Adata = [1 2 3 4 5 6 7 8 9]
</span><span class='line'>Adim  = [3 3]</span></code></pre></td></tr></table></div></figure>


<p>These structures provide a solid basis for data parallel
computing. Though there are many possible representations, one
commonality is that the actual data elements are stored in a
contiguous block of memory. This feature means that many data parallel
operators, such as those we&rsquo;ll discuss in an upcoming post, can be
implemented in a way that maps efficiently onto data parallel
processors like GPUs.</p>
]]></content>
    </entry>
  
    <entry>
      




<title type="html"><![CDATA[Beware the Logarithms]]></title>
<link href="http://blog.theincredibleholk.org/blog/2013/04/02/beware-the-logarithms/"/>
<updated>2013-04-02T14:24:00-06:00</updated>
<id>http://blog.theincredibleholk.org/blog/2013/04/02/beware-the-logarithms</id>

      <content type="html"><![CDATA[<p>Logarithms are great. They let you talk about incredibly wide ranges
of numbers, and they transform multiplication into
addition. Algorithms with logarithmic running times are so fast they
might as well be constant time algorithms. Logarithmic scales on
graphs can also make your results look much better. Let&rsquo;s see how.</p>

<!-- MORE -->


<p>On a recent paper submission, I included a graph comparing Harlan to
CUBLAS on a trivial vector addition benchmark. Below is a recreation
of that graph.</p>

<div id="add-vec-linear"></div>


<p>The green dots represent the execution time for the Harlan program,
while the blue dots indicate the same for the CUBLAS version. This
graph clearly shows that the Harlan version takes about twice as long
as the CUBLAS version. On the one hand, being slower than a highly
tuned linear algebra library by only a factor of two isn&rsquo;t bad. On the
other hand, this graph doesn&rsquo;t cast Harlan in a great light.</p>

<p>Let&rsquo;s see what happens if we put the Y axis on a logarithmic scale.</p>

<div id="add-vec-log"></div>


<p>This looks much better! Here it looks like we only have an additive
factor overhead, and there&rsquo;s really not that much room between the
Harlan and CUBLAS points. We also get the nice logarithmic shape,
which makes it look like the time doesn&rsquo;t get that much worse as the
vectors get larger.</p>

<p>As a third option, we can use a logarithmic scale on the X axis as well.</p>

<div id="add-vec-loglog"></div>


<p>Now we see two lines, which suggests the execution time for each
program increases linearly in the size of the vector, which is
true. But now, the lines are essentially parallel, again with only an
additive factor between the two. As we move to the right, the lines
seem to converge, suggesting that for extremely large vectors, Harlan
might even outperform CUBLAS. Not bad!</p>

<p>The point of all this is that I have presented the exact same data in
all three cases, yet the choice of scales leads us visually to very
different conclusions. In this case, I believe the linear-linear plot
is the most honest. For other data sets, one of the other plots may be
more appropriate. As an author, it&rsquo;s important to choose visual
presentations of data that faithfully highlight the key features of
the data. Likewise, as a reader, it&rsquo;s important to pay attention to
the scales and understand what the shape of a graph means in light of
those scales.</p>

<script>
var harlan = [
[1000000 ,  23077  ],
[2000000 ,  48684  ],
[3000000 ,  65515  ],
[4000000 ,  78188  ],
[5000000 ,  105106 ],
[6000000 ,  108419 ],
[7000000 ,  125485 ],
[8000000 ,  106143 ],
[9000000 ,  144744 ],
[10000000,  159353 ],
[11000000,  161706 ],
[12000000,  170757 ],
[13000000,  179616 ],
[14000000,  187397 ],
[15000000,  196132 ],
[16000000,  206307 ],
[17000000,  275525 ],
[18000000,  285997 ],
[19000000,  296748 ],
[20000000,  302022 ],
[21000000,  308845 ],
[22000000,  316751 ],
[23000000,  326517 ],
[24000000,  334805 ],
[25000000,  342043 ],
[26000000,  350458 ],
[27000000,  360741 ],
[28000000,  368774 ],
[29000000,  376220 ],
[30000000,  384946 ],
[31000000,  391921 ],
[32000000,  400844 ],
[33000000,  408938 ],
[34000000,  542333 ],
[35000000,  548890 ],
[36000000,  555630 ],
[37000000,  562858 ],
[38000000,  577735 ],
[39000000,  581910 ],
[40000000,  590120 ],
[41000000,  601395 ],
[42000000,  605323 ],
[43000000,  630901 ],
[44000000,  634690 ],
[45000000,  632842 ],
[46000000,  640262 ],
[47000000,  648560 ],
[48000000,  635637 ],
[49000000,  569287 ],
[50000000,  589661 ],
[51000000,  611440 ],
[52000000,  595433 ],
[53000000,  595200 ],
[54000000,  602066 ],
[55000000,  613101 ],
[56000000,  725412 ],
[57000000,  734666 ],
[58000000,  740548 ],
[59000000,  750494 ],
[60000000,  757265 ],
[61000000,  768061 ],
[62000000,  775407 ],
[63000000,  785751 ],
[64000000,  791515 ],
[65000000,  802256 ],
[66000000,  808854 ],
[67000000,  818214 ],
[68000000,  1064549],
[69000000,  1076275],
[70000000,  1084976],
[71000000,  951381 ],
[72000000,  951877 ],
[73000000,  1105713],
[74000000,  1125657],
[75000000,  1007359],
[76000000,  986703 ],
[77000000,  1058398],
[78000000,  987392 ],
[79000000,  1021013],
[80000000,  1009507],
[81000000,  1008894],
[82000000,  1020269],
[83000000,  1025176],
[84000000,  1053568],
[85000000,  1069070],
[86000000,  1044142],
[87000000,  1086830],
[88000000,  1103057],
[89000000,  1092646]
];

var cublas = [
[ 1000000,  13606 ],
[ 2000000,  20028 ],
[ 3000000,  26616 ],
[ 4000000,  33674 ],
[ 5000000,  44516 ],
[ 6000000,  47676 ],
[ 7000000,  58095 ],
[ 8000000,  68026 ],
[ 9000000,  79605 ],
[10000000,  81697 ],
[11000000,  86043 ],
[12000000,  91144 ],
[13000000,  96901 ],
[14000000,  101678],
[15000000,  113157],
[16000000,  115044],
[17000000,  122450],
[18000000,  137815],
[19000000,  136512],
[20000000,  144248],
[21000000,  169558],
[22000000,  166937],
[23000000,  163610],
[24000000,  170082],
[25000000,  176403],
[26000000,  199547],
[27000000,  191796],
[28000000,  221804],
[29000000,  204675],
[30000000,  211186],
[31000000,  246638],
[32000000,  223958],
[33000000,  239349],
[34000000,  252776],
[35000000,  268027],
[36000000,  253459],
[37000000,  260748],
[38000000,  268572],
[39000000,  294740],
[40000000,  288621],
[41000000,  318650],
[42000000,  305181],
[43000000,  307327],
[44000000,  321227],
[45000000,  315209],
[46000000,  321323],
[47000000,  335639],
[48000000,  335366],
[49000000,  386889],
[50000000,  349890],
[51000000,  364246],
[52000000,  374258],
[53000000,  371455],
[54000000,  408985],
[55000000,  390835],
[56000000,  391368],
[57000000,  397232],
[58000000,  403077],
[59000000,  411199],
[60000000,  422391],
[61000000,  438317],
[62000000,  430081],
[63000000,  446064],
[64000000,  448848],
[65000000,  451843],
[66000000,  470267],
[67000000,  631753],
[68000000,  473640],
[69000000,  526125],
[70000000,  487006],
[71000000,  499891],
[72000000,  501403],
[73000000,  506017],
[74000000,  513806],
[75000000,  532705],
[76000000,  552288],
[77000000,  581489],
[78000000,  540496],
[79000000,  548331],
[80000000,  556008],
[81000000,  570752],
[82000000,  588361],
[83000000,  581793],
[84000000,  604275],
[85000000,  637841],
[86000000,  598621],
[87000000,  607459],
[88000000,  635888],
[89000000,  622260]
];

function make_xScale(width, border, min, max) {
    return d3.scale.linear()
        .domain([min, max])
        .range([0 + border, width - 5]);
}

function make_log_xScale(width, border, min, max) {
    return d3.scale.log()
        .domain([min, max])
        .range([0 + border, width - 5]);
}

function make_yScale(height, border, min, max) {
    return d3.scale.linear()
        .domain([min, max])
        .range([height - border, 5])
}

function make_log_yScale(height, border, min, max) {
    return d3.scale.log()
        .domain([min, max])
        .range([height - border, 5])
}

function make_graph(chart, xlog, ylog) {
    var border = 60;
    var width = 600;
    var height = 400;

    chart = d3.select(chart);

    var legend = chart.append("ul").attr("class", "legend")
        .attr("style", "position: relative; top: 0; left: 85px;");
    legend.append("li").attr("style", "color: green")
        .append("span").text("Harlan");
    legend.append("li").attr("style", "color: blue")
        .append("span").text("CUBLAS");    

    var svg = chart.append("svg")
        .attr("style", "position: relative; top: -85px; left: 0;")
        .attr("width", width)
        .attr("height", height);
    
    var harlan_min_x = d3.min(harlan, function(d) { return d[0]; });
    var harlan_max_x = d3.max(harlan, function(d) { return d[0]; });
    var harlan_min_y = d3.min(harlan, function(d) { return (d[1]) * 1e-6; });
    var harlan_max_y = d3.max(harlan, function(d) { return (d[1]) * 1e-6; });

    var cublas_min_x = d3.min(cublas, function(d) { return d[0]; });
    var cublas_max_x = d3.max(cublas, function(d) { return d[0]; });
    var cublas_min_y = d3.min(cublas, function(d) { return (d[1]) * 1e-6; });
    var cublas_max_y = d3.max(cublas, function(d) { return (d[1]) * 1e-6; });

    var min_x = Math.min(harlan_min_x, cublas_min_x);
    var max_x = Math.max(harlan_max_x, cublas_max_x);
    var min_y = Math.min(harlan_min_y, cublas_min_y);
    var max_y = Math.max(harlan_max_y, cublas_max_y);

    var xScale;
    if(xlog) {
        xScale = make_log_xScale(width, border, min_x, max_x);
    }
    else {
        xScale = make_xScale(width, border, min_x, max_x);
    }

    var yScale;
    if(ylog) {
        yScale = make_log_yScale(height, border, min_y, max_y);
    }
    else {
        yScale = make_yScale(height, border, min_y, max_y);
    }

    svg.append("g")
        .attr("class", "axis")
        .attr("transform", "translate(0, " + (height - border) + ")")
        .call(d3.svg.axis().scale(xScale).orient("bottom"));

    svg.append("g")
        .attr("class", "axis")
        .attr("transform", "translate(" + border + ", 0)")
        .call(d3.svg.axis().scale(yScale).orient("left"));

    svg.append("g").selectAll("circle")
        .data(cublas).enter().append("circle")
        .attr("cx", function(d) { return xScale(d[0]); })
        .attr("cy", function(d) { return yScale((d[1]) * 1e-6); })
        .attr("r", 2)
        .attr("fill", "blue");

    svg.append("g").selectAll("circle")
        .data(harlan).enter().append("circle")
        .attr("cx", function(d) { return xScale(d[0]); })
        .attr("cy", function(d) { return yScale((d[1]) * 1e-6); })
        .attr("r", 2)
        .attr("fill", "green");
        
    svg.append("text").text("Vector Size")
        .attr("text-anchor", "center")
        .attr("x", width / 2)
        .attr("y", height - 12);
    svg.append("text").text("Execution Time (s)")
        .attr("text-anchor", "center")
        .attr("transform", "translate(" + 15 + ", " + (height / 2 + border) + ") rotate(270)");
}

make_graph("#add-vec-linear", false, false);
make_graph("#add-vec-log", false, true);
make_graph("#add-vec-loglog", true, true);

</script>

]]></content>
    </entry>
  
    <entry>
      




<title type="html"><![CDATA[Patterns with Ellipses]]></title>
<link href="http://blog.theincredibleholk.org/blog/2013/02/12/patterns-with-ellipses/"/>
<updated>2013-02-12T15:11:00-07:00</updated>
<id>http://blog.theincredibleholk.org/blog/2013/02/12/patterns-with-ellipses</id>

      <content type="html"><![CDATA[<p><a href="http://blog.theincredibleholk.org/blog/2013/02/11/matching-patterns-with-scheme/">Last time</a>,
we talked about matching patterns in Scheme. Now we will look at how
to extend the pattern matcher and template instantiation code to
handle patterns with ellipses.</p>

<!-- MORE -->


<p>Do start with, let&rsquo;s look at an example macro. Imagine we wanted to
take the <code>or2</code> macro from last time, but we want to extend it to
support any number of arguments. Certainly, writing <code>(or2 a (or2 b
c))</code> is a bit of a pain. It&rsquo;d be much better to simply write <code>(or a b
c)</code>. We can do this by using an ellipses in our patterns, as seen in
this macro, called <code>my-or</code> to avoid clashing with the built in <code>or</code>:</p>

<figure class='code'><figcaption><span></span></figcaption><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class='line-number'>1</span>
<span class='line-number'>2</span>
<span class='line-number'>3</span>
<span class='line-number'>4</span>
<span class='line-number'>5</span>
<span class='line-number'>6</span>
</pre></td><td class='code'><pre><code class='scheme'><span class='line'><span class="p">(</span><span class="k">define-syntax </span><span class="nv">my-or</span>
</span><span class='line'>  <span class="p">(</span><span class="k">syntax-rules </span><span class="p">()</span>
</span><span class='line'>    <span class="p">((</span><span class="nf">_</span> <span class="nv">e</span><span class="p">)</span> <span class="nv">e</span><span class="p">)</span>
</span><span class='line'>    <span class="p">((</span><span class="nf">_</span> <span class="nv">e</span> <span class="nv">e*</span> <span class="o">...</span><span class="p">)</span>
</span><span class='line'>     <span class="p">(</span><span class="k">let </span><span class="p">((</span><span class="nf">t</span> <span class="nv">e</span><span class="p">))</span>
</span><span class='line'>       <span class="p">(</span><span class="k">if </span><span class="nv">t</span> <span class="nv">t</span> <span class="p">(</span><span class="nf">my-or</span> <span class="nv">e*</span> <span class="o">...</span><span class="p">))))))</span>
</span></code></pre></td></tr></table></div></figure>


<p>This macro uses two patterns. The first matches <code>my-or</code> with exactly
one argument. The second matches <code>my-or</code> with one argument and any
number of remaining arguments. The second pattern expands into a
recursive call to <code>my-or</code>, but it decreases the number of arguments
each time until we finally hit the base case. We might trace the
expansion like this:</p>

<figure class='code'><figcaption><span></span></figcaption><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class='line-number'>1</span>
<span class='line-number'>2</span>
<span class='line-number'>3</span>
<span class='line-number'>4</span>
<span class='line-number'>5</span>
<span class='line-number'>6</span>
<span class='line-number'>7</span>
<span class='line-number'>8</span>
<span class='line-number'>9</span>
<span class='line-number'>10</span>
<span class='line-number'>11</span>
<span class='line-number'>12</span>
<span class='line-number'>13</span>
<span class='line-number'>14</span>
<span class='line-number'>15</span>
<span class='line-number'>16</span>
<span class='line-number'>17</span>
<span class='line-number'>18</span>
<span class='line-number'>19</span>
<span class='line-number'>20</span>
</pre></td><td class='code'><pre><code class='scheme'><span class='line'><span class="p">(</span><span class="nf">my-or</span> <span class="nv">a</span> <span class="nv">b</span> <span class="nv">c</span><span class="p">)</span>
</span><span class='line'>
</span><span class='line'><span class="k">=&gt; </span>
</span><span class='line'>
</span><span class='line'><span class="p">(</span><span class="k">let </span><span class="p">((</span><span class="nf">t</span><span class="o">.</span><span class="mi">1</span> <span class="nv">a</span><span class="p">))</span>
</span><span class='line'>  <span class="p">(</span><span class="k">if </span><span class="nv">t</span><span class="o">.</span><span class="mi">1</span> <span class="nv">t</span><span class="o">.</span><span class="mi">1</span> <span class="p">(</span><span class="nf">my-or</span> <span class="nv">b</span> <span class="nv">c</span><span class="p">)))</span>
</span><span class='line'>
</span><span class='line'><span class="nv">=&gt;</span>
</span><span class='line'>
</span><span class='line'><span class="p">(</span><span class="k">let </span><span class="p">((</span><span class="nf">t</span><span class="o">.</span><span class="mi">1</span> <span class="nv">a</span><span class="p">))</span>
</span><span class='line'>  <span class="p">(</span><span class="k">if </span><span class="nv">t</span><span class="o">.</span><span class="mi">1</span> <span class="nv">t</span><span class="o">.</span><span class="mi">1</span>
</span><span class='line'>      <span class="p">(</span><span class="k">let </span><span class="p">((</span><span class="nf">t</span><span class="o">.</span><span class="mi">2</span> <span class="nv">b</span><span class="p">))</span>
</span><span class='line'>        <span class="p">(</span><span class="k">if </span><span class="nv">t</span><span class="o">.</span><span class="mi">2</span> <span class="nv">t</span><span class="o">.</span><span class="mi">2</span> <span class="p">(</span><span class="nf">my-or</span> <span class="nv">c</span><span class="p">)))))</span>
</span><span class='line'>
</span><span class='line'><span class="nv">=&gt;</span>
</span><span class='line'>
</span><span class='line'><span class="p">(</span><span class="k">let </span><span class="p">((</span><span class="nf">t</span><span class="o">.</span><span class="mi">1</span> <span class="nv">a</span><span class="p">))</span>
</span><span class='line'>  <span class="p">(</span><span class="k">if </span><span class="nv">t</span><span class="o">.</span><span class="mi">1</span> <span class="nv">t</span><span class="o">.</span><span class="mi">1</span>
</span><span class='line'>      <span class="p">(</span><span class="k">let </span><span class="p">((</span><span class="nf">t</span><span class="o">.</span><span class="mi">2</span> <span class="nv">b</span><span class="p">))</span>
</span><span class='line'>        <span class="p">(</span><span class="k">if </span><span class="nv">t</span><span class="o">.</span><span class="mi">2</span> <span class="nv">t</span><span class="o">.</span><span class="mi">2</span> <span class="nv">c</span><span class="p">))))</span>
</span></code></pre></td></tr></table></div></figure>


<p>Unfortunately, pattern matching and template instantiation become much
trickier in the presence of ellipses. As we will see, however, when
all is said and done we&rsquo;ll have added just one more <code>cond</code> clause to
both the matcher and the instantiation function.</p>

<p>The basic idea is that we want to keep everything more or less the
same, but when we encounter a <code>...</code>, we want to recursively match or
instantiate a pattern as many times as we can and then pick up where
we left off. In my mind, the most complicated part of this becomes
representing the environment between the matcher and instantiation
code. Last time, we represented the environment as an association
list. It&rsquo;s attractive to try to simply pair the pattern variable name
with a list of things it matched instead of just a single value, but
this doesn&rsquo;t seem to preserve quite enough information. Suppose we
matched <code>(my-or (+ 1 2) a b c)</code> against <code>(_ e e* ...)</code>, yielding the
bindings <code>((e . (+ 1 2)) (e* . (a b c)))</code>. Then let&rsquo;s see what happens
if for some reason we used this to instantiate the template <code>'((e e*)
...)</code>. We might use a rule like &ldquo;when instantiating a <code>...</code> template,
look up every variable in the environment and zip them together.&rdquo; This
rule would lead to the following instantiation:</p>

<figure class='code'><figcaption><span></span></figcaption><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class='line-number'>1</span>
</pre></td><td class='code'><pre><code class='scheme'><span class='line'><span class="o">&#39;</span><span class="p">((</span><span class="nb">+ </span><span class="nv">a</span><span class="p">)</span> <span class="p">(</span><span class="mi">1</span> <span class="nv">b</span><span class="p">)</span> <span class="p">(</span><span class="mi">2</span> <span class="nv">c</span><span class="p">))</span>
</span></code></pre></td></tr></table></div></figure>


<p>This rule wouldn&rsquo;t be too hard to implement, but it&rsquo;s also probably
not what we want. In this case, we&rsquo;ve broken apart the <code>(+ 1 2)</code>
expression into its individual components, when instead we&rsquo;d probably
like to keep this together. The semantics we&rsquo;d like instead lead to
the following result:</p>

<figure class='code'><figcaption><span></span></figcaption><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class='line-number'>1</span>
</pre></td><td class='code'><pre><code class='scheme'><span class='line'><span class="o">&#39;</span><span class="p">(((</span><span class="nb">+ </span><span class="mi">1</span> <span class="mi">2</span><span class="p">)</span> <span class="nv">a</span><span class="p">)</span> <span class="p">((</span><span class="nb">+ </span><span class="mi">1</span> <span class="mi">2</span><span class="p">)</span> <span class="nv">b</span><span class="p">)</span> <span class="p">((</span><span class="nb">+ </span><span class="mi">1</span> <span class="mi">2</span><span class="p">)</span> <span class="nv">c</span><span class="p">))</span>
</span></code></pre></td></tr></table></div></figure>


<p>Admittedly, this is a somewhat contrived example, but as we write more
macros it will become apparent that this is usually the behavior we
want.</p>

<p>In order to preserve the information we need, we&rsquo;ll represent the
environment more like a tree, which includes a <code>...</code> node that
describes pattern variables bound under a <code>...</code>. Thus, if we matched
<code>(my-or (+ 1 2) a b c)</code> against <code>(_ e e* ...)</code>, we&rsquo;d end up with the
environment <code>((e . (+ 1 2)) (... ((e* . a)) ((e* . b)) ((e*
.c))))</code>. Each <code>...</code> node in the environment includes a list of
bindings matched under a <code>...</code> pattern. The list of association lists
approach allows us to keep track of which variables were matched
together.</p>

<p>We will see that building this environment in the matcher is fairly
straightforward, but more care is needed with template
instantiation. We will define a flatten operation on environments,
which takes in one environment, strips off a level of ellipses, and
returns a list of environments. We then recursively apply the template
instantiation function to each of these new environments. To look at
our running example, <code>((e . (+ 1 2)) (... ((e* . a)) ((e* . b)) ((e*
.c))))</code> would convert to this:</p>

<figure class='code'><figcaption><span></span></figcaption><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class='line-number'>1</span>
<span class='line-number'>2</span>
<span class='line-number'>3</span>
</pre></td><td class='code'><pre><code class='scheme'><span class='line'><span class="p">(((</span><span class="nf">e</span> <span class="o">.</span> <span class="p">(</span><span class="nb">+ </span><span class="mi">1</span> <span class="mi">2</span><span class="p">))</span> <span class="p">(</span><span class="nf">e*</span> <span class="o">.</span> <span class="nv">a</span><span class="p">))</span>
</span><span class='line'> <span class="p">((</span><span class="nf">e</span> <span class="o">.</span> <span class="p">(</span><span class="nb">+ </span><span class="mi">1</span> <span class="mi">2</span><span class="p">))</span> <span class="p">(</span><span class="nf">e*</span> <span class="o">.</span> <span class="nv">b</span><span class="p">))</span>
</span><span class='line'> <span class="p">((</span><span class="nf">e</span> <span class="o">.</span> <span class="p">(</span><span class="nb">+ </span><span class="mi">1</span> <span class="mi">2</span><span class="p">))</span> <span class="p">(</span><span class="nf">e*</span> <span class="o">.</span> <span class="nv">c</span><span class="p">)))</span>
</span></code></pre></td></tr></table></div></figure>


<p>It&rsquo;s easy to see now how instantiating <code>'((e e*) ...)</code> just requires
mapping the instantiate function with <code>(e e*)</code> over this list of
environments.</p>

<p>We&rsquo;re not quite done yet though. What happens if we had the pattern
<code>(_ (a ...) (b ...))</code>, and matched this against <code>((1 2 3) (x y))</code>?
We&rsquo;d end up with this environment:</p>

<figure class='code'><figcaption><span></span></figcaption><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class='line-number'>1</span>
<span class='line-number'>2</span>
</pre></td><td class='code'><pre><code class='scheme'><span class='line'><span class="p">((</span><span class="o">...</span> <span class="p">((</span><span class="nf">a</span> <span class="o">.</span> <span class="mi">1</span><span class="p">))</span> <span class="p">((</span><span class="nf">a</span> <span class="o">.</span> <span class="mi">2</span><span class="p">))</span> <span class="p">((</span><span class="nf">a</span> <span class="o">.</span> <span class="mi">3</span><span class="p">)))</span>
</span><span class='line'> <span class="p">(</span><span class="o">...</span> <span class="p">((</span><span class="nf">b</span> <span class="o">.</span> <span class="nv">x</span><span class="p">))</span> <span class="p">((</span><span class="nf">b</span> <span class="o">.</span> <span class="nv">y</span><span class="p">))))</span>
</span></code></pre></td></tr></table></div></figure>


<p>If we try to flatten this, we&rsquo;ll see that we won&rsquo;t have enough <code>b</code>s to
correspond with each <code>a</code>. To avoid this problem, we will make the
flatten procedure take a template as input, and only consider <code>...</code>
nodes that contain pattern variables referenced by the template. This
way, we only expand the environment for either <code>a</code> or <code>b</code>, but not
both.</p>

<p>So with all of this out of the way, we can look at an exension of
<code>match</code> that handles ellipsis patterns:</p>

<figure class='code'><figcaption><span></span></figcaption><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class='line-number'>1</span>
<span class='line-number'>2</span>
<span class='line-number'>3</span>
<span class='line-number'>4</span>
<span class='line-number'>5</span>
<span class='line-number'>6</span>
<span class='line-number'>7</span>
<span class='line-number'>8</span>
<span class='line-number'>9</span>
<span class='line-number'>10</span>
<span class='line-number'>11</span>
<span class='line-number'>12</span>
<span class='line-number'>13</span>
<span class='line-number'>14</span>
<span class='line-number'>15</span>
<span class='line-number'>16</span>
<span class='line-number'>17</span>
<span class='line-number'>18</span>
<span class='line-number'>19</span>
<span class='line-number'>20</span>
<span class='line-number'>21</span>
<span class='line-number'>22</span>
<span class='line-number'>23</span>
<span class='line-number'>24</span>
<span class='line-number'>25</span>
<span class='line-number'>26</span>
<span class='line-number'>27</span>
<span class='line-number'>28</span>
<span class='line-number'>29</span>
<span class='line-number'>30</span>
</pre></td><td class='code'><pre><code class='scheme'><span class='line'><span class="p">(</span><span class="k">define </span><span class="p">(</span><span class="nf">match*</span> <span class="nv">p</span> <span class="nv">e</span> <span class="nv">sk</span> <span class="nv">fk</span><span class="p">)</span>
</span><span class='line'>  <span class="p">(</span><span class="nf">cond</span>
</span><span class='line'>    <span class="p">((</span><span class="k">and </span><span class="p">(</span><span class="nb">pair? </span><span class="nv">p</span><span class="p">)</span> <span class="p">(</span><span class="nb">pair? </span><span class="p">(</span><span class="nb">cdr </span><span class="nv">p</span><span class="p">))</span> <span class="p">(</span><span class="nb">eq? </span><span class="o">&#39;...</span> <span class="p">(</span><span class="nb">cadr </span><span class="nv">p</span><span class="p">)))</span>
</span><span class='line'>     <span class="p">(</span><span class="k">let </span><span class="nv">loop</span> <span class="p">((</span><span class="nf">e</span> <span class="nv">e</span><span class="p">)</span>
</span><span class='line'>                <span class="p">(</span><span class="nf">b</span> <span class="o">&#39;</span><span class="p">()))</span>
</span><span class='line'>       <span class="p">(</span><span class="k">if </span><span class="p">(</span><span class="nb">null? </span><span class="nv">e</span><span class="p">)</span>
</span><span class='line'>           <span class="p">(</span><span class="nf">match*</span> <span class="p">(</span><span class="nb">cddr </span><span class="nv">p</span><span class="p">)</span> <span class="nv">e</span>
</span><span class='line'>                   <span class="p">(</span><span class="k">lambda </span><span class="p">(</span><span class="nf">b^</span><span class="p">)</span> <span class="p">(</span><span class="nf">sk</span> <span class="o">`</span><span class="p">((</span><span class="o">...</span> <span class="o">.</span> <span class="o">,</span><span class="nv">b</span><span class="p">)</span> <span class="o">.</span> <span class="o">,</span><span class="nv">b^</span><span class="p">)))</span>
</span><span class='line'>                   <span class="nv">fk</span><span class="p">)</span>
</span><span class='line'>           <span class="p">(</span><span class="nf">match*</span> <span class="p">(</span><span class="nb">car </span><span class="nv">p</span><span class="p">)</span> <span class="p">(</span><span class="nb">car </span><span class="nv">e</span><span class="p">)</span>
</span><span class='line'>                   <span class="p">(</span><span class="k">lambda </span><span class="p">(</span><span class="nf">b^</span><span class="p">)</span>
</span><span class='line'>                     <span class="p">(</span><span class="nf">loop</span> <span class="p">(</span><span class="nb">cdr </span><span class="nv">e</span><span class="p">)</span> <span class="p">(</span><span class="nf">snoc</span> <span class="nv">b</span> <span class="nv">b^</span><span class="p">)))</span>
</span><span class='line'>                   <span class="p">(</span><span class="k">lambda </span><span class="p">()</span>
</span><span class='line'>                     <span class="p">(</span><span class="nf">match*</span> <span class="p">(</span><span class="nb">cddr </span><span class="nv">p</span><span class="p">)</span> <span class="nv">e</span>
</span><span class='line'>                             <span class="p">(</span><span class="k">lambda </span><span class="p">(</span><span class="nf">b^</span><span class="p">)</span> <span class="p">(</span><span class="nf">sk</span> <span class="o">`</span><span class="p">((</span><span class="o">...</span> <span class="o">.</span> <span class="o">,</span><span class="nv">b</span><span class="p">)</span> <span class="o">.</span> <span class="o">,</span><span class="nv">b^</span><span class="p">)))</span>
</span><span class='line'>                             <span class="nv">fk</span><span class="p">))))))</span>
</span><span class='line'>    <span class="p">((</span><span class="k">and </span><span class="p">(</span><span class="nb">pair? </span><span class="nv">p</span><span class="p">)</span> <span class="p">(</span><span class="nb">pair? </span><span class="nv">e</span><span class="p">))</span>
</span><span class='line'>     <span class="p">(</span><span class="nf">match*</span> <span class="p">(</span><span class="nb">car </span><span class="nv">p</span><span class="p">)</span> <span class="p">(</span><span class="nb">car </span><span class="nv">e</span><span class="p">)</span>
</span><span class='line'>             <span class="p">(</span><span class="k">lambda </span><span class="p">(</span><span class="nf">b</span><span class="p">)</span>
</span><span class='line'>               <span class="p">(</span><span class="nf">match*</span> <span class="p">(</span><span class="nb">cdr </span><span class="nv">p</span><span class="p">)</span> <span class="p">(</span><span class="nb">cdr </span><span class="nv">e</span><span class="p">)</span>
</span><span class='line'>                       <span class="p">(</span><span class="k">lambda </span><span class="p">(</span><span class="nf">b^</span><span class="p">)</span> <span class="p">(</span><span class="nf">sk</span> <span class="p">(</span><span class="nb">append </span><span class="nv">b</span> <span class="nv">b^</span><span class="p">)))</span>
</span><span class='line'>                       <span class="nv">fk</span><span class="p">))</span>
</span><span class='line'>             <span class="nv">fk</span><span class="p">))</span>
</span><span class='line'>    <span class="p">((</span><span class="nb">eq? </span><span class="nv">p</span> <span class="ss">&#39;_</span><span class="p">)</span>
</span><span class='line'>     <span class="p">(</span><span class="nf">sk</span> <span class="o">&#39;</span><span class="p">()))</span>
</span><span class='line'>    <span class="p">((</span><span class="nb">symbol? </span><span class="nv">p</span><span class="p">)</span>
</span><span class='line'>     <span class="p">(</span><span class="nf">sk</span> <span class="p">(</span><span class="nb">list </span><span class="p">(</span><span class="nb">cons </span><span class="nv">p</span> <span class="nv">e</span><span class="p">))))</span>
</span><span class='line'>    <span class="p">((</span><span class="k">and </span><span class="p">(</span><span class="nb">null? </span><span class="nv">p</span><span class="p">)</span> <span class="p">(</span><span class="nb">null? </span><span class="nv">e</span><span class="p">))</span>
</span><span class='line'>     <span class="p">(</span><span class="nf">sk</span> <span class="o">&#39;</span><span class="p">()))</span>
</span><span class='line'>    <span class="p">(</span><span class="k">else </span><span class="p">(</span><span class="nf">fk</span><span class="p">))))</span>
</span></code></pre></td></tr></table></div></figure>


<p>This is almost identical to <code>match</code> from before, but we&rsquo;ve added lines
3&mdash;16. These lines first check if the pattern is a <code>...</code> pattern, and
if so, we go into a loop where we try to match the pattern as many
times as possible. As we do this, we build up a list of environments
from each successful match. Once we get a failure, instead of calling
the failure continuation we were given, we just continue on matching
the rest of the pattern. We can try matching this against a <code>let</code>
pattern as follows:</p>

<figure class='code'><figcaption><span></span></figcaption><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class='line-number'>1</span>
<span class='line-number'>2</span>
</pre></td><td class='code'><pre><code class='scheme'><span class='line'><span class="nv">&gt;</span> <span class="p">(</span><span class="nf">match*</span> <span class="o">&#39;</span><span class="p">(</span><span class="nv">_</span> <span class="p">((</span><span class="nf">x</span> <span class="nv">e</span><span class="p">)</span> <span class="o">...</span><span class="p">)</span> <span class="nv">b</span><span class="p">)</span> <span class="o">&#39;</span><span class="p">(</span><span class="k">let </span><span class="p">((</span><span class="nf">x</span> <span class="mi">5</span><span class="p">)</span> <span class="p">(</span><span class="nf">y</span> <span class="mi">6</span><span class="p">))</span> <span class="p">(</span><span class="nb">+ </span><span class="nv">x</span> <span class="nv">y</span><span class="p">))</span> <span class="p">(</span><span class="k">lambda </span><span class="p">(</span><span class="nf">b</span><span class="p">)</span> <span class="nv">b</span><span class="p">)</span> <span class="p">(</span><span class="k">lambda </span><span class="p">()</span> <span class="no">#f</span><span class="p">))</span>
</span><span class='line'><span class="p">((</span><span class="o">...</span> <span class="p">((</span><span class="nf">x</span> <span class="o">.</span> <span class="nv">x</span><span class="p">)</span> <span class="p">(</span><span class="nf">e</span> <span class="o">.</span> <span class="mi">5</span><span class="p">))</span> <span class="p">((</span><span class="nf">x</span> <span class="o">.</span> <span class="nv">y</span><span class="p">)</span> <span class="p">(</span><span class="nf">e</span> <span class="o">.</span> <span class="mi">6</span><span class="p">)))</span> <span class="p">(</span><span class="nf">b</span> <span class="nv">+</span> <span class="nv">x</span> <span class="nv">y</span><span class="p">))</span>
</span></code></pre></td></tr></table></div></figure>


<p>Based on our previous discussion of environments, this is what we wanted.</p>

<p>Now let&rsquo;s look at how we combine a pattern with a template. We&rsquo;ll
start with a helper, called <code>extract-...</code>, which pulls out the
relevant ellipsis bindings from the environment:</p>

<figure class='code'><figcaption><span></span></figcaption><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class='line-number'>1</span>
<span class='line-number'>2</span>
<span class='line-number'>3</span>
<span class='line-number'>4</span>
<span class='line-number'>5</span>
<span class='line-number'>6</span>
<span class='line-number'>7</span>
<span class='line-number'>8</span>
<span class='line-number'>9</span>
<span class='line-number'>10</span>
<span class='line-number'>11</span>
</pre></td><td class='code'><pre><code class='scheme'><span class='line'><span class="p">(</span><span class="k">define </span><span class="p">(</span><span class="nf">extract-</span><span class="o">...</span> <span class="nv">p</span> <span class="nv">bindings</span><span class="p">)</span>
</span><span class='line'>  <span class="p">(</span><span class="k">if </span><span class="p">(</span><span class="nb">null? </span><span class="nv">bindings</span><span class="p">)</span>
</span><span class='line'>      <span class="o">&#39;</span><span class="p">()</span>
</span><span class='line'>      <span class="p">(</span><span class="k">let </span><span class="p">((</span><span class="nf">rest</span> <span class="p">(</span><span class="nf">extract-</span><span class="o">...</span> <span class="nv">p</span> <span class="p">(</span><span class="nb">cdr </span><span class="nv">bindings</span><span class="p">)))</span>
</span><span class='line'>            <span class="p">(</span><span class="nf">b</span> <span class="p">(</span><span class="nb">car </span><span class="nv">bindings</span><span class="p">)))</span>
</span><span class='line'>        <span class="p">(</span><span class="k">if </span><span class="p">(</span><span class="k">and </span><span class="p">(</span><span class="nb">eq? </span><span class="p">(</span><span class="nb">car </span><span class="nv">b</span><span class="p">)</span> <span class="o">&#39;...</span><span class="p">)</span> <span class="p">(</span><span class="nb">not </span><span class="p">(</span><span class="nb">null? </span><span class="p">(</span><span class="nb">cdr </span><span class="nv">b</span><span class="p">))))</span>
</span><span class='line'>            <span class="p">(</span><span class="k">let </span><span class="p">((</span><span class="nf">names</span> <span class="p">(</span><span class="nb">map </span><span class="nv">car</span> <span class="p">(</span><span class="nb">cadr </span><span class="nv">b</span><span class="p">))))</span>
</span><span class='line'>              <span class="p">(</span><span class="k">if </span><span class="p">(</span><span class="nf">ormap</span> <span class="p">(</span><span class="k">lambda </span><span class="p">(</span><span class="nf">x</span><span class="p">)</span> <span class="p">(</span><span class="nf">mem*</span> <span class="nv">x</span> <span class="nv">p</span><span class="p">))</span> <span class="nv">names</span><span class="p">)</span>
</span><span class='line'>                  <span class="p">(</span><span class="nb">cons </span><span class="p">(</span><span class="nb">cdr </span><span class="nv">b</span><span class="p">)</span> <span class="nv">rest</span><span class="p">)</span>
</span><span class='line'>                  <span class="nv">rest</span><span class="p">))</span>
</span><span class='line'>            <span class="nv">rest</span><span class="p">))))</span>
</span></code></pre></td></tr></table></div></figure>


<p>This function works by walking over the set of bindings. When we
encounter a <code>...</code> node, we check if any of the variables bound in that
sub-environment are referenced in the pattern (this is what the
<code>(ormap (lambda (x) (mem* x p)) names)</code> clause does). If any variables
are relevant, we include this sub-environment on the list that is
returned.</p>

<p>Doing this relies on another helper, <code>mem*</code>, which determines whether
a variable is relevant to a pattern. As its name suggest, this was
originally a blind tree walk that checked if the symbol occurs
anywhere in the pattern. Sadly, it&rsquo;s not that simple. Suppose we had
the environment <code>((... ((a . 1)) ((a . 2)) ((a . 3))) (... ((b . x))
((b . y))))</code> and we were instantiating the pattern <code>((a b ...)
...)</code>. Since there are two levels of <code>...</code>, we only want to consider
<code>a</code> on the first level, and then when we get to the second <code>...</code>, we
need to consider only <code>b</code>. This means we need to cut off search when
we see a sub-pattern with an ellipsis. That&rsquo;s what lines 3 and 4 do in
this code:</p>

<figure class='code'><figcaption><span></span></figcaption><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class='line-number'>1</span>
<span class='line-number'>2</span>
<span class='line-number'>3</span>
<span class='line-number'>4</span>
<span class='line-number'>5</span>
<span class='line-number'>6</span>
<span class='line-number'>7</span>
</pre></td><td class='code'><pre><code class='scheme'><span class='line'><span class="p">(</span><span class="k">define </span><span class="p">(</span><span class="nf">mem*</span> <span class="nv">x</span> <span class="nv">ls</span><span class="p">)</span>
</span><span class='line'>  <span class="p">(</span><span class="nf">cond</span>
</span><span class='line'>    <span class="p">((</span><span class="k">and </span><span class="p">(</span><span class="nb">pair? </span><span class="nv">ls</span><span class="p">)</span> <span class="p">(</span><span class="nb">pair? </span><span class="p">(</span><span class="nb">cdr </span><span class="nv">ls</span><span class="p">))</span> <span class="p">(</span><span class="nb">eq? </span><span class="p">(</span><span class="nb">cadr </span><span class="nv">ls</span><span class="p">)</span> <span class="o">&#39;...</span><span class="p">))</span>
</span><span class='line'>     <span class="no">#f</span><span class="p">)</span>
</span><span class='line'>    <span class="p">((</span><span class="nb">pair? </span><span class="nv">ls</span><span class="p">)</span>
</span><span class='line'>     <span class="p">(</span><span class="k">or </span><span class="p">(</span><span class="nf">mem*</span> <span class="nv">x</span> <span class="p">(</span><span class="nb">car </span><span class="nv">ls</span><span class="p">))</span> <span class="p">(</span><span class="nf">mem*</span> <span class="nv">x</span> <span class="p">(</span><span class="nb">cdr </span><span class="nv">ls</span><span class="p">))))</span>
</span><span class='line'>    <span class="p">(</span><span class="k">else </span><span class="p">(</span><span class="nb">eq? </span><span class="nv">x</span> <span class="nv">ls</span><span class="p">))))</span>
</span></code></pre></td></tr></table></div></figure>


<p>With our helper functions out of the way, we can look at the extended
version of <code>instantiate</code>:</p>

<figure class='code'><figcaption><span></span></figcaption><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class='line-number'>1</span>
<span class='line-number'>2</span>
<span class='line-number'>3</span>
<span class='line-number'>4</span>
<span class='line-number'>5</span>
<span class='line-number'>6</span>
<span class='line-number'>7</span>
<span class='line-number'>8</span>
<span class='line-number'>9</span>
<span class='line-number'>10</span>
<span class='line-number'>11</span>
<span class='line-number'>12</span>
<span class='line-number'>13</span>
<span class='line-number'>14</span>
<span class='line-number'>15</span>
<span class='line-number'>16</span>
<span class='line-number'>17</span>
<span class='line-number'>18</span>
</pre></td><td class='code'><pre><code class='scheme'><span class='line'><span class="p">(</span><span class="k">define </span><span class="p">(</span><span class="nf">instantiate*</span> <span class="nv">p</span> <span class="nv">bindings</span><span class="p">)</span>
</span><span class='line'>  <span class="p">(</span><span class="nf">cond</span>
</span><span class='line'>    <span class="p">((</span><span class="k">and </span><span class="p">(</span><span class="nb">pair? </span><span class="nv">p</span><span class="p">)</span> <span class="p">(</span><span class="nb">pair? </span><span class="p">(</span><span class="nb">cdr </span><span class="nv">p</span><span class="p">))</span> <span class="p">(</span><span class="nb">eq? </span><span class="o">&#39;...</span> <span class="p">(</span><span class="nb">cadr </span><span class="nv">p</span><span class="p">)))</span>
</span><span class='line'>     <span class="p">(</span><span class="k">let </span><span class="p">((</span><span class="nf">bindings</span><span class="o">...</span> <span class="p">(</span><span class="nf">extract-</span><span class="o">...</span> <span class="p">(</span><span class="nb">car </span><span class="nv">p</span><span class="p">)</span> <span class="nv">bindings</span><span class="p">)))</span>
</span><span class='line'>       <span class="p">(</span><span class="k">if </span><span class="p">(</span><span class="nb">null? </span><span class="nv">bindings</span><span class="o">...</span><span class="p">)</span>
</span><span class='line'>           <span class="p">(</span><span class="nf">instantiate*</span> <span class="p">(</span><span class="nb">cddr </span><span class="nv">p</span><span class="p">)</span> <span class="nv">bindings</span><span class="p">)</span>
</span><span class='line'>           <span class="p">(</span><span class="nf">append</span>
</span><span class='line'>            <span class="p">(</span><span class="nb">apply </span><span class="nv">map</span> <span class="p">(</span><span class="nb">cons </span><span class="p">(</span><span class="k">lambda </span><span class="nv">b*</span>
</span><span class='line'>                               <span class="p">(</span><span class="nf">instantiate*</span> <span class="p">(</span><span class="nb">car </span><span class="nv">p</span><span class="p">)</span>
</span><span class='line'>                                             <span class="p">(</span><span class="nb">append </span><span class="p">(</span><span class="nb">apply </span><span class="nv">append</span> <span class="nv">b*</span><span class="p">)</span>
</span><span class='line'>                                                     <span class="nv">bindings</span><span class="p">)))</span>
</span><span class='line'>                             <span class="nv">bindings</span><span class="o">...</span><span class="p">))</span>
</span><span class='line'>            <span class="p">(</span><span class="nf">instantiate*</span> <span class="p">(</span><span class="nb">cddr </span><span class="nv">p</span><span class="p">)</span> <span class="nv">bindings</span><span class="p">)))))</span>
</span><span class='line'>    <span class="p">((</span><span class="nb">pair? </span><span class="nv">p</span><span class="p">)</span>
</span><span class='line'>     <span class="p">(</span><span class="nb">cons </span><span class="p">(</span><span class="nf">instantiate*</span> <span class="p">(</span><span class="nb">car </span><span class="nv">p</span><span class="p">)</span> <span class="nv">bindings</span><span class="p">)</span>
</span><span class='line'>           <span class="p">(</span><span class="nf">instantiate*</span> <span class="p">(</span><span class="nb">cdr </span><span class="nv">p</span><span class="p">)</span> <span class="nv">bindings</span><span class="p">)))</span>
</span><span class='line'>    <span class="p">((</span><span class="nb">assq </span><span class="nv">p</span> <span class="nv">bindings</span><span class="p">)</span> <span class="k">=&gt; </span><span class="nv">cdr</span><span class="p">)</span>
</span><span class='line'>    <span class="p">(</span><span class="k">else </span><span class="nv">p</span><span class="p">)))</span>
</span></code></pre></td></tr></table></div></figure>


<p>One again, this includes the same code as before, extended with lines
3&mdash;13 to handle ellipses. This clause works by finding the relevant
sub-environments. If it finds any, we append each of the
sub-environments to the full environment and use each of these new
environments to instantiate the sub-template.</p>

<p>We can pass this the results of the our <code>match*</code> call above, and see
that it is able to reconstruct a <code>let</code> expression:</p>

<figure class='code'><figcaption><span></span></figcaption><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class='line-number'>1</span>
<span class='line-number'>2</span>
<span class='line-number'>3</span>
</pre></td><td class='code'><pre><code class='scheme'><span class='line'><span class="nv">&gt;</span> <span class="p">(</span><span class="nf">instantiate*</span> <span class="o">&#39;</span><span class="p">(</span><span class="k">let </span><span class="p">((</span><span class="nf">x</span> <span class="nv">e</span><span class="p">)</span> <span class="o">...</span><span class="p">)</span> <span class="nv">b</span><span class="p">)</span>
</span><span class='line'>                <span class="o">&#39;</span><span class="p">((</span><span class="o">...</span> <span class="p">((</span><span class="nf">x</span> <span class="o">.</span> <span class="nv">x</span><span class="p">)</span> <span class="p">(</span><span class="nf">e</span> <span class="o">.</span> <span class="mi">5</span><span class="p">))</span> <span class="p">((</span><span class="nf">x</span> <span class="o">.</span> <span class="nv">y</span><span class="p">)</span> <span class="p">(</span><span class="nf">e</span> <span class="o">.</span> <span class="mi">6</span><span class="p">)))</span> <span class="p">(</span><span class="nf">b</span> <span class="nv">+</span> <span class="nv">x</span> <span class="nv">y</span><span class="p">)))</span>
</span><span class='line'><span class="p">(</span><span class="k">let </span><span class="p">([</span><span class="nv">x</span> <span class="mi">5</span><span class="p">]</span> <span class="p">[</span><span class="nv">y</span> <span class="mi">6</span><span class="p">])</span> <span class="p">(</span><span class="nb">+ </span><span class="nv">x</span> <span class="nv">y</span><span class="p">))</span>
</span></code></pre></td></tr></table></div></figure>


<p>Of course, we can also use this to convert <code>let</code> expressions to
function applications:</p>

<figure class='code'><figcaption><span></span></figcaption><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class='line-number'>1</span>
<span class='line-number'>2</span>
<span class='line-number'>3</span>
</pre></td><td class='code'><pre><code class='scheme'><span class='line'><span class="nv">&gt;</span> <span class="p">(</span><span class="nf">instantiate*</span> <span class="o">&#39;</span><span class="p">((</span><span class="k">lambda </span><span class="p">(</span><span class="nf">x</span> <span class="o">...</span><span class="p">)</span> <span class="nv">b</span><span class="p">)</span> <span class="nv">e</span> <span class="o">...</span><span class="p">)</span>
</span><span class='line'>                <span class="o">&#39;</span><span class="p">((</span><span class="o">...</span> <span class="p">((</span><span class="nf">x</span> <span class="o">.</span> <span class="nv">x</span><span class="p">)</span> <span class="p">(</span><span class="nf">e</span> <span class="o">.</span> <span class="mi">5</span><span class="p">))</span> <span class="p">((</span><span class="nf">x</span> <span class="o">.</span> <span class="nv">y</span><span class="p">)</span> <span class="p">(</span><span class="nf">e</span> <span class="o">.</span> <span class="mi">6</span><span class="p">)))</span> <span class="p">(</span><span class="nf">b</span> <span class="nv">+</span> <span class="nv">x</span> <span class="nv">y</span><span class="p">)))</span>
</span><span class='line'><span class="p">((</span><span class="k">lambda </span><span class="p">(</span><span class="nf">x</span> <span class="nv">y</span><span class="p">)</span> <span class="p">(</span><span class="nb">+ </span><span class="nv">x</span> <span class="nv">y</span><span class="p">))</span> <span class="mi">5</span> <span class="mi">6</span><span class="p">)</span>
</span></code></pre></td></tr></table></div></figure>


<p>And, we can handle some harder cases too:</p>

<figure class='code'><figcaption><span></span></figcaption><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class='line-number'>1</span>
<span class='line-number'>2</span>
<span class='line-number'>3</span>
<span class='line-number'>4</span>
</pre></td><td class='code'><pre><code class='scheme'><span class='line'><span class="nv">&gt;</span> <span class="p">(</span><span class="nf">instantiate*</span> <span class="o">&#39;</span><span class="p">((</span><span class="nf">a</span> <span class="nv">b</span> <span class="o">...</span><span class="p">)</span> <span class="o">...</span><span class="p">)</span>
</span><span class='line'>                 <span class="o">&#39;</span><span class="p">((</span><span class="o">...</span> <span class="p">((</span><span class="nf">a</span> <span class="o">.</span> <span class="mi">1</span><span class="p">))</span> <span class="p">((</span><span class="nf">a</span> <span class="o">.</span> <span class="mi">2</span><span class="p">))</span> <span class="p">((</span><span class="nf">a</span> <span class="o">.</span> <span class="mi">3</span><span class="p">)))</span>
</span><span class='line'>                   <span class="p">(</span><span class="o">...</span> <span class="p">((</span><span class="nf">b</span> <span class="o">.</span> <span class="nv">x</span><span class="p">))</span> <span class="p">((</span><span class="nf">b</span> <span class="o">.</span> <span class="nv">y</span><span class="p">)))))</span>
</span><span class='line'><span class="p">((</span><span class="mi">1</span> <span class="nv">x</span> <span class="nv">y</span><span class="p">)</span> <span class="p">(</span><span class="mi">2</span> <span class="nv">x</span> <span class="nv">y</span><span class="p">)</span> <span class="p">(</span><span class="mi">3</span> <span class="nv">x</span> <span class="nv">y</span><span class="p">))</span>
</span></code></pre></td></tr></table></div></figure>


<p>And here&rsquo;s my personal favorite:</p>

<figure class='code'><figcaption><span></span></figcaption><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class='line-number'>1</span>
<span class='line-number'>2</span>
</pre></td><td class='code'><pre><code class='scheme'><span class='line'><span class="nv">&gt;</span> <span class="p">(</span><span class="nf">instantiate*</span> <span class="o">&#39;</span><span class="p">((</span><span class="nf">a</span> <span class="nv">a</span> <span class="o">...</span><span class="p">)</span> <span class="o">...</span><span class="p">)</span> <span class="o">&#39;</span><span class="p">((</span><span class="o">...</span> <span class="p">((</span><span class="nf">a</span> <span class="o">.</span> <span class="mi">1</span><span class="p">))</span> <span class="p">((</span><span class="nf">a</span> <span class="o">.</span> <span class="mi">2</span><span class="p">))</span> <span class="p">((</span><span class="nf">a</span> <span class="o">.</span> <span class="mi">3</span><span class="p">)))))</span>
</span><span class='line'><span class="p">((</span><span class="mi">1</span> <span class="mi">1</span> <span class="mi">2</span> <span class="mi">3</span><span class="p">)</span> <span class="p">(</span><span class="mi">2</span> <span class="mi">1</span> <span class="mi">2</span> <span class="mi">3</span><span class="p">)</span> <span class="p">(</span><span class="mi">3</span> <span class="mi">1</span> <span class="mi">2</span> <span class="mi">3</span><span class="p">))</span>
</span></code></pre></td></tr></table></div></figure>

]]></content>
    </entry>
  
    <entry>
      




<title type="html"><![CDATA[Matching Patterns with Scheme]]></title>
<link href="http://blog.theincredibleholk.org/blog/2013/02/11/matching-patterns-with-scheme/"/>
<updated>2013-02-11T11:47:00-07:00</updated>
<id>http://blog.theincredibleholk.org/blog/2013/02/11/matching-patterns-with-scheme</id>

      <content type="html"><![CDATA[<p>A while back, I wrote a
<a href="http://blog.theincredibleholk.org/blog/2012/12/02/a-look-at-macros-in-scheme/">post</a>
about macros in Scheme. Today I want to take a look at how one might
begin to implement a macro system. In Scheme, whether you use
<code>syntax-rules</code> or <code>syntax-case</code> do write your macros, at some point
you&rsquo;ll write patterns and templates. Macros match their input against
a pattern and then use this to instantiate a template. Let&rsquo;s consider
a two-way <code>or</code> macro:</p>

<figure class='code'><figcaption><span></span></figcaption><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class='line-number'>1</span>
<span class='line-number'>2</span>
<span class='line-number'>3</span>
<span class='line-number'>4</span>
<span class='line-number'>5</span>
</pre></td><td class='code'><pre><code class='scheme'><span class='line'><span class="p">(</span><span class="k">define-syntax </span><span class="nv">or2</span>
</span><span class='line'>  <span class="p">(</span><span class="k">syntax-rules </span><span class="p">()</span>
</span><span class='line'>    <span class="p">((</span><span class="nf">_</span> <span class="nv">e1</span> <span class="nv">e2</span><span class="p">)</span>
</span><span class='line'>     <span class="p">(</span><span class="k">let </span><span class="p">((</span><span class="nf">t</span> <span class="nv">e1</span><span class="p">))</span>
</span><span class='line'>       <span class="p">(</span><span class="k">if </span><span class="nv">t</span> <span class="nv">t</span> <span class="nv">e2</span><span class="p">)))))</span>
</span></code></pre></td></tr></table></div></figure>


<p>We can see how this works at the REPL by using <code>expand</code>:</p>

<figure class='code'><figcaption><span></span></figcaption><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class='line-number'>1</span>
<span class='line-number'>2</span>
</pre></td><td class='code'><pre><code class='scheme'><span class='line'><span class="nv">&gt;</span> <span class="p">(</span><span class="nf">expand</span> <span class="o">&#39;</span><span class="p">(</span><span class="nv">or2</span> <span class="mi">1</span> <span class="mi">2</span><span class="p">))</span>
</span><span class='line'><span class="p">(</span><span class="k">let </span><span class="p">([</span><span class="nv">t</span><span class="o">.</span><span class="mi">8</span> <span class="mi">1</span><span class="p">])</span> <span class="p">(</span><span class="k">if </span><span class="nv">t</span><span class="o">.</span><span class="mi">8</span> <span class="nv">t</span><span class="o">.</span><span class="mi">8</span> <span class="mi">2</span><span class="p">))</span>
</span></code></pre></td></tr></table></div></figure>


<p>This matches the expression <code>(or2 1 2)</code> against the pattern <code>(_ e1
e2)</code>. The underscore means we don&rsquo;t care what&rsquo;s in that position (it&rsquo;s
always <code>or2</code>, since this is part of the <code>or2</code> macro definition). The
names <code>e1</code> and <code>e2</code> allow us to refer to what occurs in this pattern
in the template. When instantiating the template, the macro system
replaces all references to <code>e1</code> with the chunk of syntax from that
position (<code>1</code> in this case), and likewise with <code>e2</code>. The result is the
expression, <code>(let ([t.8 1]) (if t.8 t.8 2))</code>. Here we let-bind the
first expression because we do not want to accidentally evaluate any
side effects it might contain twice. Don&rsquo;t worry too much about why
the variable name is <code>t.8</code> instead of just <code>t</code>. This has to do with
something called <em>hygiene</em>, which I will hopefully say more on in a
later post.</p>

<p>Now let&rsquo;s take a look at how we would write a function to match
patterns like this. We&rsquo;ll start with a procedure called <code>match</code> that
takes four arguments: the pattern to match against, the expression to
match, a success continuation, and a failure continuation. The success
continuation is called if the expression matches the pattern, and the
failure continuation is called otherwise. When I first started writing
this, it seemed like it would be cleaner to use success and failure
continuations, although I&rsquo;m not entirely sure that&rsquo;s true anymore. At
the moment, we will support three things for patterns:</p>

<ol>
<li>Pairs &ndash; A pair pattern matches an expression if and only if the
expression is a pair, and the car of the pattern matches the car of
the expression and the cdr of the pattern matches the cdr of the
expression.</li>
<li>Symbols &ndash; A symbol pattern binds a pattern variable. This matches
any expression, and binds that expression to the variable.</li>
<li>Null &ndash; A null pattern matches an expression if and only if the
expression is null.</li>
</ol>


<p>Scheme&rsquo;s pattern matching supports several other features, like
auxiliary keywords, which we&rsquo;re ignoring for the now for the sake of
simplicity. Using our three kinds of patterns so far, we can translate
this pretty directly into a Scheme program:</p>

<figure class='code'><figcaption><span></span></figcaption><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class='line-number'>1</span>
<span class='line-number'>2</span>
<span class='line-number'>3</span>
<span class='line-number'>4</span>
<span class='line-number'>5</span>
<span class='line-number'>6</span>
<span class='line-number'>7</span>
<span class='line-number'>8</span>
<span class='line-number'>9</span>
<span class='line-number'>10</span>
<span class='line-number'>11</span>
<span class='line-number'>12</span>
<span class='line-number'>13</span>
<span class='line-number'>14</span>
</pre></td><td class='code'><pre><code class='scheme'><span class='line'><span class="p">(</span><span class="k">define </span><span class="p">(</span><span class="nf">match</span> <span class="nv">p</span> <span class="nv">e</span> <span class="nv">sk</span> <span class="nv">fk</span><span class="p">)</span>
</span><span class='line'>  <span class="p">(</span><span class="nf">cond</span>
</span><span class='line'>    <span class="p">((</span><span class="k">and </span><span class="p">(</span><span class="nb">pair? </span><span class="nv">p</span><span class="p">)</span> <span class="p">(</span><span class="nb">pair? </span><span class="nv">e</span><span class="p">))</span>
</span><span class='line'>     <span class="p">(</span><span class="nf">match</span> <span class="p">(</span><span class="nb">car </span><span class="nv">p</span><span class="p">)</span> <span class="p">(</span><span class="nb">car </span><span class="nv">e</span><span class="p">)</span>
</span><span class='line'>            <span class="p">(</span><span class="k">lambda </span><span class="p">(</span><span class="nf">b</span><span class="p">)</span>
</span><span class='line'>              <span class="p">(</span><span class="nf">match</span> <span class="p">(</span><span class="nb">cdr </span><span class="nv">p</span><span class="p">)</span> <span class="p">(</span><span class="nb">cdr </span><span class="nv">e</span><span class="p">)</span>
</span><span class='line'>                     <span class="p">(</span><span class="k">lambda </span><span class="p">(</span><span class="nf">b^</span><span class="p">)</span> <span class="p">(</span><span class="nf">sk</span> <span class="p">(</span><span class="nb">append </span><span class="nv">b</span> <span class="nv">b^</span><span class="p">)))</span>
</span><span class='line'>                     <span class="nv">fk</span><span class="p">))</span>
</span><span class='line'>            <span class="nv">fk</span><span class="p">))</span>
</span><span class='line'>    <span class="p">((</span><span class="k">and </span><span class="p">(</span><span class="nb">symbol? </span><span class="nv">p</span><span class="p">))</span>
</span><span class='line'>     <span class="p">(</span><span class="nf">sk</span> <span class="p">(</span><span class="nb">list </span><span class="p">(</span><span class="nb">cons </span><span class="nv">p</span> <span class="nv">e</span><span class="p">))))</span>
</span><span class='line'>    <span class="p">((</span><span class="k">and </span><span class="p">(</span><span class="nb">null? </span><span class="nv">p</span><span class="p">)</span> <span class="p">(</span><span class="nb">null? </span><span class="nv">e</span><span class="p">))</span>
</span><span class='line'>     <span class="p">(</span><span class="nf">sk</span> <span class="o">&#39;</span><span class="p">()))</span>
</span><span class='line'>    <span class="p">(</span><span class="k">else </span><span class="p">(</span><span class="nf">fk</span><span class="p">))))</span>
</span></code></pre></td></tr></table></div></figure>


<p>Notice that we pass a list of bindings to the success
continuation. These will be used by the template instantiation
code. Let&rsquo;s try an example:</p>

<figure class='code'><figcaption><span></span></figcaption><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class='line-number'>1</span>
<span class='line-number'>2</span>
</pre></td><td class='code'><pre><code class='scheme'><span class='line'><span class="nv">&gt;</span> <span class="p">(</span><span class="nf">match</span> <span class="o">&#39;</span><span class="p">(</span><span class="nv">_</span> <span class="nv">e1</span> <span class="nv">e2</span><span class="p">)</span> <span class="o">&#39;</span><span class="p">(</span><span class="nv">or2</span> <span class="mi">1</span> <span class="mi">2</span><span class="p">)</span> <span class="p">(</span><span class="k">lambda </span><span class="p">(</span><span class="nf">b</span><span class="p">)</span> <span class="nv">b</span><span class="p">)</span> <span class="p">(</span><span class="k">lambda </span><span class="p">()</span> <span class="no">#f</span><span class="p">))</span>
</span><span class='line'><span class="p">((</span><span class="nf">_</span> <span class="o">.</span> <span class="nv">or2</span><span class="p">)</span> <span class="p">(</span><span class="nf">e1</span> <span class="o">.</span> <span class="mi">1</span><span class="p">)</span> <span class="p">(</span><span class="nf">e2</span> <span class="o">.</span> <span class="mi">2</span><span class="p">))</span>
</span></code></pre></td></tr></table></div></figure>


<p>For the most part, this does what we want. The matcher says the
expression matches the pattern, and tells us that <code>e1</code> was bound to
<code>1</code> and <code>e2</code> was bound to <code>2</code>. However, we also bound <code>_</code> to <code>or2</code>. In
reality, we want <code>_</code> to mean ignore, which we can do by adding one
line before the <code>symbol?</code> line in our cond clause:</p>

<figure class='code'><figcaption><span></span></figcaption><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class='line-number'>1</span>
<span class='line-number'>2</span>
</pre></td><td class='code'><pre><code class='scheme'><span class='line'><span class="p">((</span><span class="nb">eq? </span><span class="nv">p</span> <span class="ss">&#39;_</span><span class="p">)</span>
</span><span class='line'> <span class="p">(</span><span class="nf">sk</span> <span class="o">&#39;</span><span class="p">()))</span>
</span></code></pre></td></tr></table></div></figure>


<p>Now we get only the bindings we care about.</p>

<p>The next part is to use the resulting bindings to instantiate the
template. We&rsquo;ll do this with a function called <code>instantiate</code>, which
takes a template (called <code>p</code>), and a list of bindings. This is a
straightforward tree walk that tries to replace symbols with
expressions when they are bound:</p>

<figure class='code'><figcaption><span></span></figcaption><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class='line-number'>1</span>
<span class='line-number'>2</span>
<span class='line-number'>3</span>
<span class='line-number'>4</span>
<span class='line-number'>5</span>
<span class='line-number'>6</span>
<span class='line-number'>7</span>
</pre></td><td class='code'><pre><code class='scheme'><span class='line'><span class="p">(</span><span class="k">define </span><span class="p">(</span><span class="nf">instantiate</span> <span class="nv">p</span> <span class="nv">bindings</span><span class="p">)</span>
</span><span class='line'>  <span class="p">(</span><span class="nf">cond</span>
</span><span class='line'>    <span class="p">((</span><span class="nb">pair? </span><span class="nv">p</span><span class="p">)</span>
</span><span class='line'>     <span class="p">(</span><span class="nb">cons </span><span class="p">(</span><span class="nf">instantiate</span> <span class="p">(</span><span class="nb">car </span><span class="nv">p</span><span class="p">)</span> <span class="nv">bindings</span><span class="p">)</span>
</span><span class='line'>           <span class="p">(</span><span class="nf">instantiate</span> <span class="p">(</span><span class="nb">cdr </span><span class="nv">p</span><span class="p">)</span> <span class="nv">bindings</span><span class="p">)))</span>
</span><span class='line'>    <span class="p">((</span><span class="nb">assq </span><span class="nv">p</span> <span class="nv">bindings</span><span class="p">)</span> <span class="k">=&gt; </span><span class="nv">cdr</span><span class="p">)</span>
</span><span class='line'>    <span class="p">(</span><span class="k">else </span><span class="nv">p</span><span class="p">)))</span>
</span></code></pre></td></tr></table></div></figure>


<p>If we did this right, we should be able to instantiate the template
from the <code>or2</code> macro above.</p>

<figure class='code'><figcaption><span></span></figcaption><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class='line-number'>1</span>
<span class='line-number'>2</span>
<span class='line-number'>3</span>
<span class='line-number'>4</span>
<span class='line-number'>5</span>
<span class='line-number'>6</span>
</pre></td><td class='code'><pre><code class='scheme'><span class='line'><span class="nv">&gt;</span> <span class="p">(</span><span class="nf">instantiate</span>
</span><span class='line'>      <span class="o">&#39;</span><span class="p">(</span><span class="k">let </span><span class="p">((</span><span class="nf">t</span> <span class="nv">e1</span><span class="p">))</span>
</span><span class='line'>         <span class="p">(</span><span class="k">if </span><span class="nv">t</span> <span class="nv">t</span> <span class="nv">e2</span><span class="p">))</span>
</span><span class='line'>      <span class="o">&#39;</span><span class="p">((</span><span class="nf">e1</span> <span class="o">.</span> <span class="mi">1</span><span class="p">)</span> <span class="p">(</span><span class="nf">e2</span> <span class="o">.</span> <span class="mi">2</span><span class="p">)))</span>
</span><span class='line'>
</span><span class='line'><span class="p">(</span><span class="k">let </span><span class="p">([</span><span class="nv">t</span> <span class="mi">1</span><span class="p">])</span> <span class="p">(</span><span class="k">if </span><span class="nv">t</span> <span class="nv">t</span> <span class="mi">2</span><span class="p">))</span>
</span></code></pre></td></tr></table></div></figure>


<p>It looks like we got what we wanted, although without hygiene.</p>

<p>At this point we have a pattern matcher and template instantiation
function that is suitable for making a very simple macro system. One
very nice feature that we are missing at the moment is ellipsis
patterns, which allow us to match repeated patterns. For example, we
could use this to write a custom let macro like this:</p>

<figure class='code'><figcaption><span></span></figcaption><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class='line-number'>1</span>
<span class='line-number'>2</span>
<span class='line-number'>3</span>
<span class='line-number'>4</span>
</pre></td><td class='code'><pre><code class='scheme'><span class='line'><span class="p">(</span><span class="k">define-syntax </span><span class="nv">my-let</span>
</span><span class='line'>  <span class="p">(</span><span class="k">syntax-rules </span><span class="p">()</span>
</span><span class='line'>    <span class="p">((</span><span class="nf">_</span> <span class="p">((</span><span class="nf">x</span> <span class="nv">e</span><span class="p">)</span> <span class="o">...</span><span class="p">)</span> <span class="nv">b</span><span class="p">)</span>
</span><span class='line'>     <span class="p">((</span><span class="k">lambda </span><span class="p">(</span><span class="nf">x</span> <span class="o">...</span><span class="p">)</span> <span class="nv">b</span><span class="p">)</span> <span class="nv">e</span> <span class="o">...</span><span class="p">))))</span>
</span></code></pre></td></tr></table></div></figure>


<p>In a post at some point in the hopefully near future, I will show how
to handle patterns with <code>...</code> in them.</p>
]]></content>
    </entry>
  
    <entry>
      




<title type="html"><![CDATA[Access Patterns Matter, Part 2]]></title>
<link href="http://blog.theincredibleholk.org/blog/2012/12/24/access-patterns-matter-part-2/"/>
<updated>2012-12-24T14:56:00-07:00</updated>
<id>http://blog.theincredibleholk.org/blog/2012/12/24/access-patterns-matter-part-2</id>

      <content type="html"><![CDATA[<p>A couple of readers pointed out some improvements and corrections to
my last post on <a href="http://blog.theincredibleholk.org/blog/2012/12/23/access-patterns-matter/">GPU access patterns</a>. These were pretty significant,
so I thought it&rsquo;d be worth doing a follow up post to see how the
change things.</p>

<p>First of all, I meant to operate on both arrays, <code>A</code> and <code>B</code>, but
through some sloppy coding I ended up only using <code>A</code>. Incidentally, I
did some back-of-the-envelope calculations to figure out the memory
bandwidth I was getting, and I was surprised to see that I was getting
close to twice the theoretical peak for the cards I was working
with. It looks like it&rsquo;s because I was only reading half the data I
thought I was. Here are the corrected figures (the experiment is the
same other than the small corrections to my code):</p>

<table>
<thead>
<tr><td>Kernel</td><td>Tesla C1060</td><td>GeForce GTX 460</td><td>ATI Radeon HD 6750M</td></tr>
</thead>
<tbody>

<tr>
<td>MyAdd</td>
<td style="text-align:right">2.764 ms</td>
<td style="text-align:right">4.524 ms</td>
<td style="text-align:right">36.325 ms</td>
</tr>

<tr>
<td>MyAdd_2D</td>
<td style="text-align:right">10.560 ms</td>
<td style="text-align:right">0.763 ms</td>
<td style="text-align:right">4.273 ms</td>
</tr>

<tr>
<td>MyAdd_2D_unweave</td>
<td style="text-align:right">0.740 ms</td>
<td style="text-align:right">0.100 ms</td>
<td style="text-align:right">2.170 ms</td>
</tr>

<tr>
<td>MyAdd_col</td>
<td style="text-align:right">2.777 ms</td>
<td style="text-align:right">4.527 ms</td>
<td style="text-align:right">26.686 ms</td>
</tr>

<tr>
<td>MyAdd_2D_col</td>
<td style="text-align:right">10.391 ms</td>
<td style="text-align:right">0.961 ms</td>
<td style="text-align:right">7.723 ms</td>
</tr>

<tr>
<td>MyAdd_2D_unweave_col</td>
<td style="text-align:right">12.398 ms</td>
<td style="text-align:right">0.708 ms</td>
<td style="text-align:right">3.413 ms</td>
</tr>

</tbody>
</table>


<p>We&rsquo;re slower across the board, but the overall shape of the data is
about the same. Interestingly, the fastest kernels are not much slower
than the fastest kernels from before.</p>

<p>Next, reddit user <a href="http://www.reddit.com/user/ser999">ser999</a> pointed out that we could forego the branch
entirely by doing some clever arithmetic. Instead of doing</p>

<figure class='code'><figcaption><span></span></figcaption><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class='line-number'>1</span>
<span class='line-number'>2</span>
<span class='line-number'>3</span>
<span class='line-number'>4</span>
</pre></td><td class='code'><pre><code class='c'><span class='line'><span class="k">if</span> <span class="p">(</span><span class="n">i</span> <span class="o">%</span> <span class="mi">2</span> <span class="o">==</span> <span class="mi">0</span><span class="p">)</span>
</span><span class='line'>    <span class="n">get</span><span class="p">(</span><span class="n">C</span><span class="p">,</span> <span class="n">N</span><span class="p">,</span> <span class="n">i</span><span class="p">,</span> <span class="n">j</span><span class="p">)</span> <span class="o">=</span> <span class="n">get</span><span class="p">(</span><span class="n">A</span><span class="p">,</span> <span class="n">N</span><span class="p">,</span> <span class="n">i</span><span class="p">,</span> <span class="n">j</span><span class="p">)</span> <span class="o">+</span> <span class="n">get</span><span class="p">(</span><span class="n">B</span><span class="p">,</span> <span class="n">N</span><span class="p">,</span> <span class="n">i</span><span class="p">,</span> <span class="n">j</span><span class="p">);</span>
</span><span class='line'><span class="k">else</span>
</span><span class='line'>    <span class="n">get</span><span class="p">(</span><span class="n">C</span><span class="p">,</span> <span class="n">N</span><span class="p">,</span> <span class="n">i</span><span class="p">,</span> <span class="n">j</span><span class="p">)</span> <span class="o">=</span> <span class="n">get</span><span class="p">(</span><span class="n">A</span><span class="p">,</span> <span class="n">N</span><span class="p">,</span> <span class="n">i</span><span class="p">,</span> <span class="n">j</span><span class="p">)</span> <span class="o">-</span> <span class="n">get</span><span class="p">(</span><span class="n">B</span><span class="p">,</span> <span class="n">N</span><span class="p">,</span> <span class="n">i</span><span class="p">,</span> <span class="n">j</span><span class="p">);</span>
</span></code></pre></td></tr></table></div></figure>


<p>we could instead do this:</p>

<figure class='code'><figcaption><span></span></figcaption><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class='line-number'>1</span>
</pre></td><td class='code'><pre><code class='c'><span class='line'><span class="n">get</span><span class="p">(</span><span class="n">C</span><span class="p">,</span> <span class="n">N</span><span class="p">,</span> <span class="n">i</span><span class="p">,</span> <span class="n">j</span><span class="p">)</span> <span class="o">=</span> <span class="n">get</span><span class="p">(</span><span class="n">A</span><span class="p">,</span> <span class="n">N</span><span class="p">,</span> <span class="n">i</span><span class="p">,</span> <span class="n">j</span><span class="p">)</span> <span class="o">+</span> <span class="n">get</span><span class="p">(</span><span class="n">B</span><span class="p">,</span> <span class="n">N</span><span class="p">,</span> <span class="n">i</span><span class="p">,</span> <span class="n">j</span><span class="p">)</span><span class="o">*</span><span class="p">(</span><span class="mi">1</span> <span class="o">-</span> <span class="p">((</span><span class="n">i</span><span class="o">&amp;</span><span class="mi">1</span><span class="p">)</span><span class="o">&lt;&lt;</span><span class="mi">1</span><span class="p">));</span>
</span></code></pre></td></tr></table></div></figure>


<p>This isn&rsquo;t exactly as clear as the code we had before, but perhaps a
sufficiently smart compiler could perform this optimization so the
programmer could still write the nicer code. To be fair, my thread
divergence optimization wasn&rsquo;t great for code readability either. The
important this is, how does this &ldquo;no branching&rdquo; version perform? The
table below shows the performance along with the &ldquo;unweaved&rdquo; version
from before for comparison.</p>

<table>
<thead>
<tr><td>Kernel</td><td>Tesla C1060</td><td>GeForce GTX 460</td><td>ATI Radeon HD 6750M</td></tr>
</thead>
<tbody>

<tr>
<td>MyAdd_2D_unweave</td>
<td style="text-align:right">0.740 ms</td>
<td style="text-align:right">0.100 ms</td>
<td style="text-align:right">2.170 ms</td>
</tr>

<tr>
<td>MyAdd_2D_nobranch</td>
<td style="text-align:right">9.969 ms</td>
<td style="text-align:right">0.731 ms</td>
<td style="text-align:right">3.381 ms</td>
</tr>

<tr>
<td>MyAdd_2D_unweave_col</td>
<td style="text-align:right">12.398 ms</td>
<td style="text-align:right">0.708 ms</td>
<td style="text-align:right">3.413 ms</td>
</tr>

<tr>
<td>MyAdd_2D_col_nobranch</td>
<td style="text-align:right">9.982 ms</td>
<td style="text-align:right">0.729 ms</td>
<td style="text-align:right">3.465 ms</td>
</tr>
</tbody>
</table>


<p>For row-wise access, the &ldquo;unweave&rdquo; variant always wins. For
column-wise access, the &ldquo;nobranch&rdquo; version wins on the C1060, while
the GTX 460 and the ATI card do better with the &ldquo;unweave&rdquo; variant. In
both the ATI and GTX 460 column-wise case, however, the two perform
basically the same.</p>

<p>So why is this? Branches are often pretty expensive, especially when
they create thread divergence. However, by removing the branch we
always have to do a multiplication, and we also have to convert an
integer value into a floating point value. Multiplication in
particular is a fairly expensive operation. In the case, the branch
isn&rsquo;t so bad.</p>

<p>As before, the code from this post is available at
<a href="https://github.com/eholk/bench-thread-diverge">https://github.com/eholk/bench-thread-diverge</a>.</p>
]]></content>
    </entry>
  
    <entry>
      




<title type="html"><![CDATA[Access patterns matter]]></title>
<link href="http://blog.theincredibleholk.org/blog/2012/12/23/access-patterns-matter/"/>
<updated>2012-12-23T01:01:00-07:00</updated>
<id>http://blog.theincredibleholk.org/blog/2012/12/23/access-patterns-matter</id>

      <content type="html"><![CDATA[<p>One of the oft cited difficulties of GPU programming is dealing with
memory layout and access patterns. In order to achieve maximum memory
bandwidth, it is important to structure your application so that
different threads do not access the same bank of memory at the same
time. In other words, you need to avoid <em>bank conflicts</em>.</p>

<p>I was recently given a test question that presented a simple CUDA
kernel and asked us to rewrite it to increase parallelism and to
minimize thread divergence. Translated to OpenCL, the kernel looks
like this:</p>

<figure class='code'><figcaption><span></span></figcaption><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class='line-number'>1</span>
<span class='line-number'>2</span>
<span class='line-number'>3</span>
<span class='line-number'>4</span>
<span class='line-number'>5</span>
<span class='line-number'>6</span>
<span class='line-number'>7</span>
<span class='line-number'>8</span>
<span class='line-number'>9</span>
<span class='line-number'>10</span>
<span class='line-number'>11</span>
<span class='line-number'>12</span>
</pre></td><td class='code'><pre><code class='c'><span class='line'><span class="cp">#define get(A, N, i, j) ((A)[((i) * (N)) + (j)])</span>
</span><span class='line'>
</span><span class='line'><span class="n">__kernel</span> <span class="kt">void</span> <span class="nf">MyAdd</span><span class="p">(</span><span class="k">const</span> <span class="kt">double</span> <span class="n">__global</span> <span class="o">*</span><span class="n">A</span><span class="p">,</span> <span class="k">const</span> <span class="kt">double</span> <span class="n">__global</span> <span class="o">*</span><span class="n">B</span><span class="p">,</span> <span class="kt">double</span> <span class="n">__global</span> <span class="o">*</span><span class="n">C</span><span class="p">,</span> <span class="kt">unsigned</span> <span class="kt">int</span> <span class="n">N</span><span class="p">)</span> <span class="p">{</span>
</span><span class='line'>    <span class="kt">unsigned</span> <span class="kt">int</span> <span class="n">i</span> <span class="o">=</span> <span class="n">get_global_id</span><span class="p">(</span><span class="mi">0</span><span class="p">);</span>
</span><span class='line'>
</span><span class='line'>    <span class="k">for</span><span class="p">(</span><span class="kt">unsigned</span> <span class="kt">int</span> <span class="n">j</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">j</span> <span class="o">&lt;</span> <span class="n">N</span><span class="p">;</span> <span class="n">j</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
</span><span class='line'>        <span class="k">if</span> <span class="p">(</span><span class="n">i</span> <span class="o">%</span> <span class="mi">2</span> <span class="o">==</span> <span class="mi">0</span><span class="p">)</span>
</span><span class='line'>            <span class="n">get</span><span class="p">(</span><span class="n">C</span><span class="p">,</span> <span class="n">N</span><span class="p">,</span> <span class="n">i</span><span class="p">,</span> <span class="n">j</span><span class="p">)</span> <span class="o">=</span> <span class="n">get</span><span class="p">(</span><span class="n">A</span><span class="p">,</span> <span class="n">N</span><span class="p">,</span> <span class="n">i</span><span class="p">,</span> <span class="n">j</span><span class="p">)</span> <span class="o">+</span> <span class="n">get</span><span class="p">(</span><span class="n">A</span><span class="p">,</span> <span class="n">N</span><span class="p">,</span> <span class="n">i</span><span class="p">,</span> <span class="n">j</span><span class="p">);</span>
</span><span class='line'>        <span class="k">else</span>
</span><span class='line'>            <span class="n">get</span><span class="p">(</span><span class="n">C</span><span class="p">,</span> <span class="n">N</span><span class="p">,</span> <span class="n">i</span><span class="p">,</span> <span class="n">j</span><span class="p">)</span> <span class="o">=</span> <span class="n">get</span><span class="p">(</span><span class="n">A</span><span class="p">,</span> <span class="n">N</span><span class="p">,</span> <span class="n">i</span><span class="p">,</span> <span class="n">j</span><span class="p">)</span> <span class="o">-</span> <span class="n">get</span><span class="p">(</span><span class="n">A</span><span class="p">,</span> <span class="n">N</span><span class="p">,</span> <span class="n">i</span><span class="p">,</span> <span class="n">j</span><span class="p">);</span>
</span><span class='line'>    <span class="p">}</span>
</span><span class='line'><span class="p">}</span>
</span></code></pre></td></tr></table></div></figure>


<p>Let&rsquo;s see how we can transform this program, and more importantly,
let&rsquo;s see how different variations perform.</p>

<!-- MORE -->


<p>First, let&rsquo;s try to increase parallelism. This kernel is operating
over a two dimensional array. Each thread processes one row, does a
for loop to process each column in that row. The even number threads
add the appropriate rows from <code>A</code> and <code>A</code>, while the odd number
threads subtract. There are no dependences between any iteration of
these loops, so it is straightforward to parallelize them by using a
two dimensional kernel:</p>

<p><em>EDIT: I had intended this code to operate on elements on both <code>A</code> and
<code>B</code>, but due to what was probably me being sloppy with copy and past,
the code only uses values from <code>A</code>. Thanks to reader Gasche for
pointing this out!</em></p>

<figure class='code'><figcaption><span></span></figcaption><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class='line-number'>1</span>
<span class='line-number'>2</span>
<span class='line-number'>3</span>
<span class='line-number'>4</span>
<span class='line-number'>5</span>
<span class='line-number'>6</span>
<span class='line-number'>7</span>
<span class='line-number'>8</span>
<span class='line-number'>9</span>
</pre></td><td class='code'><pre><code class='c'><span class='line'><span class="n">__kernel</span> <span class="kt">void</span> <span class="nf">MyAdd_2D</span><span class="p">(</span><span class="k">const</span> <span class="kt">double</span> <span class="n">__global</span> <span class="o">*</span><span class="n">A</span><span class="p">,</span> <span class="k">const</span> <span class="kt">double</span> <span class="n">__global</span> <span class="o">*</span><span class="n">B</span><span class="p">,</span> <span class="kt">double</span> <span class="n">__global</span> <span class="o">*</span><span class="n">C</span><span class="p">,</span> <span class="kt">unsigned</span> <span class="kt">int</span> <span class="n">N</span><span class="p">)</span> <span class="p">{</span>
</span><span class='line'>    <span class="kt">unsigned</span> <span class="kt">int</span> <span class="n">i</span> <span class="o">=</span> <span class="n">get_global_id</span><span class="p">(</span><span class="mi">0</span><span class="p">);</span>
</span><span class='line'>    <span class="kt">unsigned</span> <span class="kt">int</span> <span class="n">j</span> <span class="o">=</span> <span class="n">get_global_id</span><span class="p">(</span><span class="mi">1</span><span class="p">);</span>
</span><span class='line'>
</span><span class='line'>    <span class="k">if</span> <span class="p">(</span><span class="n">i</span> <span class="o">%</span> <span class="mi">2</span> <span class="o">==</span> <span class="mi">0</span><span class="p">)</span>
</span><span class='line'>        <span class="n">get</span><span class="p">(</span><span class="n">C</span><span class="p">,</span> <span class="n">N</span><span class="p">,</span> <span class="n">i</span><span class="p">,</span> <span class="n">j</span><span class="p">)</span> <span class="o">=</span> <span class="n">get</span><span class="p">(</span><span class="n">A</span><span class="p">,</span> <span class="n">N</span><span class="p">,</span> <span class="n">i</span><span class="p">,</span> <span class="n">j</span><span class="p">)</span> <span class="o">+</span> <span class="n">get</span><span class="p">(</span><span class="n">A</span><span class="p">,</span> <span class="n">N</span><span class="p">,</span> <span class="n">i</span><span class="p">,</span> <span class="n">j</span><span class="p">);</span>
</span><span class='line'>    <span class="k">else</span>
</span><span class='line'>        <span class="n">get</span><span class="p">(</span><span class="n">C</span><span class="p">,</span> <span class="n">N</span><span class="p">,</span> <span class="n">i</span><span class="p">,</span> <span class="n">j</span><span class="p">)</span> <span class="o">=</span> <span class="n">get</span><span class="p">(</span><span class="n">A</span><span class="p">,</span> <span class="n">N</span><span class="p">,</span> <span class="n">i</span><span class="p">,</span> <span class="n">j</span><span class="p">)</span> <span class="o">-</span> <span class="n">get</span><span class="p">(</span><span class="n">A</span><span class="p">,</span> <span class="n">N</span><span class="p">,</span> <span class="n">i</span><span class="p">,</span> <span class="n">j</span><span class="p">);</span>
</span><span class='line'><span class="p">}</span>
</span></code></pre></td></tr></table></div></figure>


<p>Obviously, you&rsquo;d need to adjust the host code to issue a 2D kernel
instead. Now, each thread processes exactly one element of each array,
which is the exact kind of parallelism that GPUs are very well suited
for. I was able to test this on three different GPUs, and for the most
part this yielded a substantial performance improvement. The table
below shows the time to execute the kernel for a 1024×1024 matrix.</p>

<table>
<thead>
<tr><td>Kernel</td><td>Tesla C1060</td><td>GeForce GTX 460</td><td>ATI Radeon HD 6750M</td></tr>
</thead>
<tbody>

<tr>
<td>MyAdd</td>
<td style="text-align:right"> 1.848 ms</td>
<td style="text-align:right"> 2.196 ms</td>
<td style="text-align:right">21.776 ms</td>
</tr>

<tr>
<td>MyAdd_2D</td>
<td style="text-align:right"> 7.130 ms</td>
<td style="text-align:right"> 0.520 ms</td>
<td style="text-align:right"> 2.238 ms</td>
</tr>

</tbody>
</table>


<p>The two consumer grade GPUs give between about a 4x and 10x
improvement. Oddly, the professional grade Tesla C1060 is nearly 4x
<em>slower</em>. This is our first example of another common frustration in
optimizing GPU codes: architectures vary wildly and behave very
differently.</p>

<p>Now let&rsquo;s see if we can do something about the thread
divergence. Right now every other thread runs a different
program. GPUs work best when threads can exectue in lock step, so
right now the situation is pretty bad. What if instead we had the
first \(\frac{N}{2}\) threads handle the even rows and the rest
handle the odd rows? If we did this, our code would look like so:</p>

<figure class='code'><figcaption><span></span></figcaption><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class='line-number'>1</span>
<span class='line-number'>2</span>
<span class='line-number'>3</span>
<span class='line-number'>4</span>
<span class='line-number'>5</span>
<span class='line-number'>6</span>
<span class='line-number'>7</span>
<span class='line-number'>8</span>
<span class='line-number'>9</span>
<span class='line-number'>10</span>
<span class='line-number'>11</span>
<span class='line-number'>12</span>
<span class='line-number'>13</span>
</pre></td><td class='code'><pre><code class='c'><span class='line'><span class="n">__kernel</span> <span class="kt">void</span> <span class="nf">MyAdd_2D_unweave</span><span class="p">(</span><span class="k">const</span> <span class="kt">double</span> <span class="n">__global</span> <span class="o">*</span><span class="n">A</span><span class="p">,</span> <span class="k">const</span> <span class="kt">double</span> <span class="n">__global</span> <span class="o">*</span><span class="n">B</span><span class="p">,</span> <span class="kt">double</span> <span class="n">__global</span> <span class="o">*</span><span class="n">C</span><span class="p">,</span> <span class="kt">unsigned</span> <span class="kt">int</span> <span class="n">N</span><span class="p">)</span> <span class="p">{</span>
</span><span class='line'>    <span class="kt">unsigned</span> <span class="kt">int</span> <span class="n">i</span> <span class="o">=</span> <span class="n">get_global_id</span><span class="p">(</span><span class="mi">0</span><span class="p">);</span>
</span><span class='line'>    <span class="kt">unsigned</span> <span class="kt">int</span> <span class="n">j</span> <span class="o">=</span> <span class="n">get_global_id</span><span class="p">(</span><span class="mi">1</span><span class="p">);</span>
</span><span class='line'>
</span><span class='line'>    <span class="k">if</span><span class="p">(</span><span class="n">i</span> <span class="o">&lt;</span> <span class="n">N</span> <span class="o">/</span> <span class="mi">2</span><span class="p">)</span> <span class="p">{</span>
</span><span class='line'>        <span class="kt">unsigned</span> <span class="kt">int</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">2</span> <span class="o">*</span> <span class="n">i</span><span class="p">;</span>
</span><span class='line'>        <span class="n">get</span><span class="p">(</span><span class="n">C</span><span class="p">,</span> <span class="n">N</span><span class="p">,</span> <span class="n">i</span><span class="p">,</span> <span class="n">j</span><span class="p">)</span> <span class="o">=</span> <span class="n">get</span><span class="p">(</span><span class="n">A</span><span class="p">,</span> <span class="n">N</span><span class="p">,</span> <span class="n">i</span><span class="p">,</span> <span class="n">j</span><span class="p">)</span> <span class="o">+</span> <span class="n">get</span><span class="p">(</span><span class="n">A</span><span class="p">,</span> <span class="n">N</span><span class="p">,</span> <span class="n">i</span><span class="p">,</span> <span class="n">j</span><span class="p">);</span>
</span><span class='line'>    <span class="p">}</span>
</span><span class='line'>    <span class="k">else</span> <span class="p">{</span>
</span><span class='line'>        <span class="kt">unsigned</span> <span class="kt">int</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">2</span> <span class="o">*</span> <span class="p">(</span><span class="n">i</span> <span class="o">-</span> <span class="n">N</span> <span class="o">/</span> <span class="mi">2</span><span class="p">)</span> <span class="o">+</span> <span class="mi">1</span><span class="p">;</span>
</span><span class='line'>        <span class="n">get</span><span class="p">(</span><span class="n">C</span><span class="p">,</span> <span class="n">N</span><span class="p">,</span> <span class="n">i</span><span class="p">,</span> <span class="n">j</span><span class="p">)</span> <span class="o">=</span> <span class="n">get</span><span class="p">(</span><span class="n">A</span><span class="p">,</span> <span class="n">N</span><span class="p">,</span> <span class="n">i</span><span class="p">,</span> <span class="n">j</span><span class="p">)</span> <span class="o">-</span> <span class="n">get</span><span class="p">(</span><span class="n">A</span><span class="p">,</span> <span class="n">N</span><span class="p">,</span> <span class="n">i</span><span class="p">,</span> <span class="n">j</span><span class="p">);</span>
</span><span class='line'>    <span class="p">}</span>
</span><span class='line'><span class="p">}</span>
</span></code></pre></td></tr></table></div></figure>


<p>It&rsquo;s mostly the same, except now we have some extra index arithmetic
to map each thread onto each row in a different order. The table below
shows the performance for this version.</p>

<table>
<thead>
<tr><td>Kernel</td><td>Tesla C1060</td><td>GeForce GTX 460</td><td>ATI Radeon HD 6750M</td></tr>
</thead>
<tbody>

<tr>
<td>MyAdd_2D_unweave</td>
<td style="text-align:right"> 0.406 ms</td>
<td style="text-align:right"> 0.092 ms</td>
<td style="text-align:right"> 1.920 ms</td>
</tr>

</tbody>
</table>


<p>All GPUs speed up on this version, but the two NVIDIA processors have
especially high performance increases. This fits well with the way
that NVIDIA GPUs divide a thread block into <em>warps</em> of 32 threads,
which each execute in lock step. The fact that we didn&rsquo;t see as
extreme of a performance improvement on the ATI GPU suggests that they
may not execute threads in warps the same way that NVIDIA GPUs do.</p>

<p>Overall, these two transformations greatly improved performance. Even
though we took a hit initially on the high end C1060, after reducing
branch divergence we have about a 4x improvement over our original
code. The GTX 460 showed a more than 20x improvement! One surprising
thing is that the fairly inexpensive GTX 460 ended up being about four
times faster than the C1060. This is a testament to how fast GPus are
improving. I suspect the main issue here is memory bandwidth. This
benchmark does very little computation, so we should once again be
memory limited. The GTX 460 has GDDR5 memory, while the C1060 only has
GDDR3. According to Wikipedia&rsquo;s handy <a href="http://en.wikipedia.org/wiki/List_of_device_bandwidths#Video_RAM">list of device bandwidths</a>,
GDDR5 should be about 2.4x faster than GDDR3, which would account for
much of the performance difference.</p>

<h2>Column-wise access</h2>

<p>At first, I didn&rsquo;t expect the thread divergence transformation to
improve the performance significantly. The reason is that GPUs read
memory blocks at a time (similar to how CPUs read in entire cache
lines at once). Before, each block was only read by one warp, whereas
afterwards two warps had to read each block. In the absence of a cache
that is big enough to hold the whole matrices, this means we&rsquo;ll need
to read twice as much data.</p>

<p>To test this hypothesis, I wrote versions that alternated on the <code>j</code>
dimension instead of the <code>i</code> dimension. To save space here, I&rsquo;ve put
all three of the column-wise kernels in a <a href="https://gist.github.com/4362602">gist</a>. The performance for
all three variants is given below.</p>

<table>
<thead>
<tr><td>Kernel</td><td>Tesla C1060</td><td>GeForce GTX 460</td><td>ATI Radeon HD 6750M</td></tr>
</thead>
<tbody>

<tr>
<td>MyAdd_col</td>
<td style="text-align:right">  1.847 ms</td>
<td style="text-align:right">  2.200 ms</td>
<td style="text-align:right"> 20.437 ms</td>
</tr>

<tr>
<td>MyAdd_2D_col</td>
<td style="text-align:right"> 6.863 ms</td>
<td style="text-align:right"> 0.712 ms</td>
<td style="text-align:right"> 3.327 ms</td>
</tr>

<tr>
<td>MyAdd_2D_unweave_col</td>
<td style="text-align:right"> 8.416 ms</td>
<td style="text-align:right"> 0.478 ms</td>
<td style="text-align:right"> 2.480 ms</td>
</tr>

</tbody>
</table>


<p>My hypothesis holds for the C1060, as it actually got slower between
the regular 2D kernel and the 2D kernel that has been reordered to
reduce branch divergence. In all cases, the sequential code performs
about the same as it did in the row-wise version. In the column-wise
kernels, the GTX 460 and the Radeon HD 6750M both improve with less
thread divergence, but the GTX 460 does not improve to nearly the same
extent as it did in the row-wise version.</p>

<h2>Summary</h2>

<p>We&rsquo;ve looked at several ways of doing very similar computations, and
seen that their performance depends greatly on the memory access
pattern and amount of thread divergence. Here&rsquo;s all the data we looked
at in one convenient table:</p>

<table>
<thead>
<tr><td>Kernel</td><td>Tesla C1060</td><td>GeForce GTX 460</td><td>ATI Radeon HD 6750M</td></tr>
</thead>
<tbody>

<tr>
<td>MyAdd</td>
<td style="text-align:right"> 1.848 ms</td>
<td style="text-align:right"> 2.196 ms</td>
<td style="text-align:right">21.776 ms</td>
</tr>

<tr>
<td>MyAdd_2D</td>
<td style="text-align:right"> 7.130 ms</td>
<td style="text-align:right"> 0.520 ms</td>
<td style="text-align:right"> 2.238 ms</td>
</tr>

<tr>
<td>MyAdd_2D_unweave</td>
<td style="text-align:right"> 0.406 ms</td>
<td style="text-align:right"> 0.092 ms</td>
<td style="text-align:right"> 1.920 ms</td>
</tr>

<tr>
<td>MyAdd_col</td>
<td style="text-align:right">  1.847 ms</td>
<td style="text-align:right">  2.200 ms</td>
<td style="text-align:right"> 20.437 ms</td>
</tr>

<tr>
<td>MyAdd_2D_col</td>
<td style="text-align:right"> 6.863 ms</td>
<td style="text-align:right"> 0.712 ms</td>
<td style="text-align:right"> 3.327 ms</td>
</tr>

<tr>
<td>MyAdd_2D_unweave_col</td>
<td style="text-align:right"> 8.416 ms</td>
<td style="text-align:right"> 0.478 ms</td>
<td style="text-align:right"> 2.480 ms</td>
</tr>

</tbody>
</table>


<p>One thing I still have not figured out is why the C1060 takes such a
performance hit when going to the 2D kernel, but the GeForce and the
Radeon do not show this same trend. In the GeForce case, I suspect it
might have something to do with the new thread scheduler in the Fermi
series. I am not as familiar with the Radeon architecture, but it
seems to have interesting performance characteristics and would be
interesting to study further.</p>

<p>The code for this post is available at
<a href="https://github.com/eholk/bench-thread-diverge">https://github.com/eholk/bench-thread-diverge</a>. It&rsquo;s
written in <a href="http://www.rust-lang.org/">Rust</a>, using the <a href="https://github.com/luqmana/rust-opencl">rust-opencl bindings</a>.</p>

<p><em>UPDATE: I added a <a href="http://blog.theincredibleholk.org/blog/2012/12/24/access-patterns-matter-part-2/">follow up post</a> taking into account several suggestions from readers.</em></p>
]]></content>
    </entry>
  
    <entry>
      




<title type="html"><![CDATA[Modeling How Programmers Read Code (via Mike Hansen) &rarr;]]></title>
<link href="http://synesthesiam.com/?p=218"/>
<updated>2012-12-18T23:01:00-07:00</updated>
<id>http://blog.theincredibleholk.org/blog/2012/12/18/modeling-how-programmers-read-code</id>

      <content type="html"><![CDATA[<p>My <a href="http://blog.theincredibleholk.org/blog/2012/12/18/how-do-we-read-code/">last post</a> includes a video of my eye movements as I read and
interpret a piece of code. I mentioned that this was part of an
experiment being conducted by Mike Hansen. He just put up a new post
with more details about his work and a video of another programmer
reading a similar program. Check it out!</p>

<!-- MORE -->


<p>Here&rsquo;s the video. Be sure to check <a href="http://synesthesiam.com/?p=218">his post</a> for commentary.</p>

<iframe width="560" height="315"
         src="http://www.youtube.com/embed/VtuO9un2Vyg"
         frameborder="0" allowfullscreen>
</iframe>



<p><a rel="bookmark" href="http://blog.theincredibleholk.org/blog/2012/12/18/modeling-how-programmers-read-code/">&infin; Permalink</a></p>]]></content>
    </entry>
  
    <entry>
      




<title type="html"><![CDATA[How do we read code?]]></title>
<link href="http://blog.theincredibleholk.org/blog/2012/12/18/how-do-we-read-code/"/>
<updated>2012-12-18T16:09:00-07:00</updated>
<id>http://blog.theincredibleholk.org/blog/2012/12/18/how-do-we-read-code</id>

      <content type="html"><![CDATA[<p>I recently got to participate in a psychological experiment for
programmers. A friend of mine, <a href="http://synesthesiam.com/">Mike Hansen</a>, is doing research on how
people comprehend programs. The goal is to figure out some way of
measuring what features in programming systems help programmers
understand what they are doing, and how this can be used to make
systems that lead to higher quality software. Mike is currently
running an experiment where he shows people several short Python
programs and asks them to tell the output of the program. The test
subject is sitting in front of an eye tracker, so afterwards Mike can
see where you were looking at various times during the experiment.</p>

<p>I was one of the test subjects, and Mike was kind enough to let me
have a video of the eye tracker data superimposed on my screen. I&rsquo;ve
shared a small section for you to watch.</p>

<!-- MORE -->




<iframe width="560" height="315"
    src="http://www.youtube.com/embed/Jc8M9-LoEuo"
    frameborder="0" allowfullscreen>
</iframe>


<p>One of the things that stood out to me in watching the video was how
much my mind seems to work like a computer. First I read over the
whole program, and then I start interpreting it. The program in
question consists of two calls to a function called <code>between</code>,
followed by a call to <code>common</code>. For the first call to <code>between</code>, I
spend a lot of time moving my eyes between the call site and the
function definition. For the second call, however, I only glance up at
the function definition once.</p>

<p>In programming language terms, I seem to be doing some kind of
just-in-time compilation. The first time through, I read and interpret
every instruction. Afterwards, it seems like I remember what this
function does and am able to determine its output much
quicker. Interpreting the first call takes about 24 seconds, while I
blow through the second one in about 10 seconds.</p>

<p>Another observation is that naming  things accurately seems to help. I
was able  to work  through the  call to  <code>common</code> very  quickly. While
reading  this program,  I remember  thinking &ldquo;this  should return  the
elements that are  in both arrays.&rdquo; I read over  the program to verify
that it does what its name suggests,  and then I can do the equivalent
operation in my head rather than by interpreting the code.</p>

<p>I&rsquo;m excited to see what else Mike&rsquo;s research uncovers. One aspect he&rsquo;s
interested in is how the approach of inexperienced programmers differs
from that of experienced programmers. For example, there seems to be
some evidence that following variable naming conventions helps
experienced programmers understand the code much quicker, while
breaking these conventions leads to a severe penalty. On the other
hand, inexperienced programmers seem to take about as long regardless
of how the variables are named.</p>

<p>If you happen to live in Bloomington, consider volunteering for Mike&rsquo;s
experiment. It&rsquo;s a lot of fun, and you get $10 for
participating. He&rsquo;ll be collecting data all through next semester, and
the more people he gets, the better. If you want to participate, send
him and e-mail at <a href="mailto:mihansen@indiana.edu">mihansen@indiana.edu</a>
and he can schedule a time for you.</p>
]]></content>
    </entry>
  
    <entry>
      




<title type="html"><![CDATA[Optimizing Dot Product]]></title>
<link href="http://blog.theincredibleholk.org/blog/2012/12/10/optimizing-dot-product/"/>
<updated>2012-12-10T14:14:00-07:00</updated>
<id>http://blog.theincredibleholk.org/blog/2012/12/10/optimizing-dot-product</id>

      <content type="html"><![CDATA[<p>Lately I&rsquo;ve seen quite a few papers on GPU programming languages that
use dot product as a benchmark, including a <a href="http://www.cs.indiana.edu/~eholk/papers/parco2011.pdf">paper</a> I&rsquo;ve
written. As I&rsquo;ve thought about it some more, it seems like this may
not be the most useful benchmark. The reason is that dot product does
very little actual computation, but accesses a lot of data. Any decent
dot product implementation should be bound by the memory
bandwidth. This is true of many algorithms, but many offer
opportunities to exploit caches due to data reuse. Because dot product
only reads each value once, we do not have this benefit.</p>

<p>Theoretically, GPUs should still do quite well. According to
<a href="http://stackoverflow.com/questions/12360861/reaching-theoretical-gpu-global-memory-bandwidth">this post</a>, an NVIDIA GTX480 should be able to reach 177.4 GBps
of global memory bandwidth. On the other hand, a CPU with PC17000
memory can only theoretically hit 17 GBps of main memory
bandwidth. This means a GPU ought to be able to outperform the CPU by
about a factor of 10. Unfortunately, unless you are computing on data
that you have generated on the GPU or is part of a larger computation,
you must consider the time to copy the data from the CPU memory to the
GPU. This goes over the PCI Express bus, currently tops out at about 8
GBps.</p>

<p>This means that when you also include the time for data transfer,
besides just the computation time, it should be almost impossible for
a GPU implementation of dot product to beat a CPU implementation. To
test this, I wrote several different variants of dot product and
compared their speed. The results are below.</p>

<!-- MORE -->




<figure>
<svg id="dot-prod-results" class="barchart"></svg>
<figcaption>Execution time for dot product on 33,554,432 element vectors
(shorter bars are better).</figcaption>
</figure>


<p>The code for these tests is available on
<a href="https://github.com/eholk/bench-dot-product">my GitHub</a>. I chose vectors of 33 million
single precision floats so that the total amount of data for the two
vectors would be about 256 MB. This is large enough to take a
measurable amount of time, while being small enough to comfortably fit
in most GPU memories. As usual, these tests were conducted on a Core
i7 2600K with PC17000 memory and an NVIDIA GTX460 GPU.</p>

<p>The Simple version is the most obvious way to write a dot product:</p>

<figure class='code'><figcaption><span></span></figcaption><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class='line-number'>1</span>
<span class='line-number'>2</span>
<span class='line-number'>3</span>
<span class='line-number'>4</span>
<span class='line-number'>5</span>
<span class='line-number'>6</span>
<span class='line-number'>7</span>
<span class='line-number'>8</span>
</pre></td><td class='code'><pre><code class='c++'><span class='line'><span class="kt">float</span> <span class="n">simple_dot</span><span class="p">(</span><span class="kt">int</span> <span class="n">N</span><span class="p">,</span> <span class="kt">float</span> <span class="o">*</span><span class="n">A</span><span class="p">,</span> <span class="kt">float</span> <span class="o">*</span><span class="n">B</span><span class="p">)</span> <span class="p">{</span>
</span><span class='line'>    <span class="kt">float</span> <span class="n">dot</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
</span><span class='line'>    <span class="k">for</span><span class="p">(</span><span class="kt">int</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">i</span> <span class="o">&lt;</span> <span class="n">N</span><span class="p">;</span> <span class="o">++</span><span class="n">i</span><span class="p">)</span> <span class="p">{</span>
</span><span class='line'>        <span class="n">dot</span> <span class="o">+=</span> <span class="n">A</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">*</span> <span class="n">B</span><span class="p">[</span><span class="n">i</span><span class="p">];</span>
</span><span class='line'>    <span class="p">}</span>
</span><span class='line'>
</span><span class='line'>    <span class="k">return</span> <span class="n">dot</span><span class="p">;</span>
</span><span class='line'><span class="p">}</span>
</span></code></pre></td></tr></table></div></figure>


<p>My hypothesis was that it&rsquo;d actually be hard to do much better than
this, given how fast CPUs are relative to main memory. I was surprised
to find that you could actually do significantly better making use of
the CPU&rsquo;s SIMD instructions.</p>

<p>The bars are not actually organized in the order I tried them. The
next test I did was the SSE version. This was mostly handled by the
compiler. I just had to add <code>__attribute__ ((vector_size
(sizeof(float) * 4)))</code> to the types I was working on, and
add a few casts here and there. I checked the generated assembly code
to be sure it was generating SSE instructions.</p>

<p>Generating AVX was a little bit trickier. For the most part, I just
needed to change the vector size from 4 to 8. However, I was using a
slightly old version of GCC that I needed to update. Also, I needed to
add the <code>-march=corei7-avx</code> and <code>-mtune=corei7-avx</code> compiler
flags. The AVX version did perform slightly better, but we only gained
about a millisecond. This suggests that we&rsquo;re pretty close to maxing
out the memory bandwidth. Dividing 256 MB by 16.2 ms gives us 15.43
GBps, which is approaching the 17 GBps theoretical peak.</p>

<p>At this point, I had exhausted most of what I could think of to
improve the performance. I decided to compare with the performance of
<a href="http://math-atlas.sourceforge.net/">ATLAS</a>. ATLAS is a collection of linear algebra routines. The
package includes several variants of each algorithm, and at install
time the build system measures the performance of each variant on the
machine you are running. Based on this, it decides the best variant to
use. The ATLAS version performed quite a bit better than the simple
version, but not quite as well as the SSE or AVX versions.</p>

<p>I pulled out <code>gdb</code> and tracked down disassembled the core
computational loop for the ATLAS version of dot product. If you&rsquo;re
interested, the assembly is <a href="https://gist.github.com/4230942">here</a>. I noticed two
things. First, it was using the <code>prefetchnta</code> to prefetch values into
cache before they were need. Second, and more important, the loop had
been unrolled four times. I had tried a similar thing using the
<code>-funroll-loops</code> compiler flag, but this made only a marginal
difference. This version of unrolling looks like this:</p>

<figure class='code'><figcaption><span></span></figcaption><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class='line-number'>1</span>
<span class='line-number'>2</span>
<span class='line-number'>3</span>
<span class='line-number'>4</span>
</pre></td><td class='code'><pre><code class='c++'><span class='line'><span class="n">dot</span> <span class="o">+=</span> <span class="n">A</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">*</span> <span class="n">B</span><span class="p">[</span><span class="n">i</span><span class="p">];</span>
</span><span class='line'><span class="n">dot</span> <span class="o">+=</span> <span class="n">A</span><span class="p">[</span><span class="n">i</span> <span class="o">+</span> <span class="mi">1</span><span class="p">]</span> <span class="o">*</span> <span class="n">B</span><span class="p">[</span><span class="n">i</span> <span class="o">+</span> <span class="mi">1</span><span class="p">];</span>
</span><span class='line'><span class="n">dot</span> <span class="o">+=</span> <span class="n">A</span><span class="p">[</span><span class="n">i</span> <span class="o">+</span> <span class="mi">2</span><span class="p">]</span> <span class="o">*</span> <span class="n">B</span><span class="p">[</span><span class="n">i</span> <span class="o">+</span> <span class="mi">2</span><span class="p">];</span>
</span><span class='line'><span class="n">dot</span> <span class="o">+=</span> <span class="n">A</span><span class="p">[</span><span class="n">i</span> <span class="o">+</span> <span class="mi">3</span><span class="p">]</span> <span class="o">*</span> <span class="n">B</span><span class="p">[</span><span class="n">i</span> <span class="o">+</span> <span class="mi">3</span><span class="p">];</span>
</span></code></pre></td></tr></table></div></figure>


<p>The problem is, this code must still run basically sequentially
because each statement depends on the previous statement. What ATLAS
did, and what I copied in the Unrolled version, is this:</p>

<figure class='code'><figcaption><span></span></figcaption><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class='line-number'>1</span>
<span class='line-number'>2</span>
<span class='line-number'>3</span>
<span class='line-number'>4</span>
</pre></td><td class='code'><pre><code class='c++'><span class='line'><span class="n">dot1</span> <span class="o">+=</span> <span class="n">A</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">*</span> <span class="n">B</span><span class="p">[</span><span class="n">i</span><span class="p">];</span>
</span><span class='line'><span class="n">dot2</span> <span class="o">+=</span> <span class="n">A</span><span class="p">[</span><span class="n">i</span> <span class="o">+</span> <span class="mi">1</span><span class="p">]</span> <span class="o">*</span> <span class="n">B</span><span class="p">[</span><span class="n">i</span> <span class="o">+</span> <span class="mi">1</span><span class="p">];</span>
</span><span class='line'><span class="n">dot3</span> <span class="o">+=</span> <span class="n">A</span><span class="p">[</span><span class="n">i</span> <span class="o">+</span> <span class="mi">2</span><span class="p">]</span> <span class="o">*</span> <span class="n">B</span><span class="p">[</span><span class="n">i</span> <span class="o">+</span> <span class="mi">2</span><span class="p">];</span>
</span><span class='line'><span class="n">dot4</span> <span class="o">+=</span> <span class="n">A</span><span class="p">[</span><span class="n">i</span> <span class="o">+</span> <span class="mi">3</span><span class="p">]</span> <span class="o">*</span> <span class="n">B</span><span class="p">[</span><span class="n">i</span> <span class="o">+</span> <span class="mi">3</span><span class="p">];</span>
</span></code></pre></td></tr></table></div></figure>


<p>This meant there was no longer any dependency between statements, and
the CPU&rsquo;s out of order execution engine could run all four of these as
soon as the data was available. In a sense, the CPU is vectorizing
loops on the fly.</p>

<p>Based on these observations, I tried to see if we could do better
using this unrolling technique and prefetching on the AVX
version. This got about another 0.1 milliseconds, but clearly we are
at the point of diminishing returns.</p>

<p>Finally, I decided to see what a GPU implementation would be like. I
used NVIDIA&rsquo;s CUBLAS library, which should be a very highly tuned
implementation for NVIDIA GPUs. Using the library was pretty easy. I
just had to remember to allocate GPU buffers and copy data into
them. As I predicted, however, the CUBLAS version was far slower than
even the simple version.</p>

<p>This exercise proved interesting for me. It confirmed one of my
hypotheses, that it should be very hard if not impossible to make a
GPU implementation of dot product outperform the CPU when the data
starts in the CPU memory. I was surprised, however, to find that you
<em>can</em> do better than the simple version on the CPU. It was also
interesting to see how minor changes in the C++ source code can cause
the compiler to generate very different assembly language.</p>

<script type="text/javascript">
function make_dot_prod_chart() {
    var data = [
        ["Simple", 0.0303998],
        ["Simple with Prefetching", 0.0296495],
        ["Unrolled", 0.0195647],
        ["SSE", 0.0172545],
        ["AVX", 0.0162481],
        ["Unrolled AVX", 0.0161166],
        ["Unrolled AVX with Prefetching", 0.0160725],
        ["ATLAS", 0.0183641],
        ["CUBLAS", 0.0541215],
    ];
    
    var height = 25;
    var bar_height = 20;
    var label_width = 250;
    
    var chart = d3.select("#dot-prod-results");
    chart.attr("width", 450 + label_width);
    chart.attr("height", height * data.length + 10);
    
    var x = d3.scale.linear()
        .domain([0, 0.06])
        .range([0, 420]);
        
    chart.selectAll("rect").data(data).enter().append("rect")
        .attr("x", label_width + 3)
        .attr("y", function(d, i) { return i * height + 5; })
        .attr("width", function(d) { return x(d[1]); } )
        .attr("height", bar_height);

    chart.selectAll("text").data(data).enter().append("text")
        .attr("x", function(d) { return x(d[1]) + label_width; })
        .attr("y", function(d, i) { return i * height + bar_height + 2; })
        .attr("dx", 7)
        .text(function(d) { return Math.round(d[1] * 10000) / 10 + " ms"; });

    chart.selectAll("text.label").data(data).enter().append("text")
        .attr("class", "label")
        .attr("x", label_width - 5)
        .attr("y", function(d, i) { return i * height + bar_height + 2; })
        .attr("text-anchor", "end")
        .text(function(d) { return d[0]; });
}

make_dot_prod_chart();
</script>

]]></content>
    </entry>
  
    <entry>
      




<title type="html"><![CDATA[Compiling Rust for GPUs]]></title>
<link href="http://blog.theincredibleholk.org/blog/2012/12/05/compiling-rust-for-gpus/"/>
<updated>2012-12-05T11:45:00-07:00</updated>
<id>http://blog.theincredibleholk.org/blog/2012/12/05/compiling-rust-for-gpus</id>

      <content type="html"><![CDATA[<p>A couple of days back, I <a href="https://twitter.com/theinedibleholk/status/275713435000520704">tweeted</a> that I had just ran code written in
Rust on the GPU. It&rsquo;s about time I provided some more details. This is
a project I worked on with <a href="http://milinda.pathirage.org/">Milinda Pathirage</a>, a fellow student at
IU. I should emphasize that this is very much in the proof of concept
stage. I doubt it will work well enough to do anything useful, but it
does work well enough to do something and it would certainly be
possible to extend this. That said, I will include links to our code
so the valiant hackers out there can try it out if they wish. For
posterity&rsquo;s sake, here is, to my knowledge, the first fragment of Rust
code to ever execute on a GPU:</p>

<figure class='code'><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class='line-number'>1</span>
<span class='line-number'>2</span>
<span class='line-number'>3</span>
<span class='line-number'>4</span>
</pre></td><td class='code'><pre><code class=''><span class='line'>#[kernel]
</span><span class='line'>fn add_float(x: &float, y: &float, z: &mut float) {
</span><span class='line'>    *z = *x + *y;
</span><span class='line'>}</span></code></pre></td></tr></table></div></figure>


<p>There are two main parts to this project. The first is compiling Rust
code into something suitable for running on the GPU. We do this using
the PTX backend that is part of LLVM. The second part is loading and
executing the kernel. For this, we use OpenCL and its
<a href="http://www.khronos.org/registry/cl/sdk/1.0/docs/man/xhtml/clCreateProgramWithBinary.html"><code>clCreateProgramWithBinary</code></a> API. In this
post, I&rsquo;ll focus on the issues encountered with generating PTX code.</p>

<!-- MORE -->


<p>The bulk of the work to generate PTX code was already done by the
NVPTX backend that was <a href="http://lists.cs.uiuc.edu/pipermail/llvmdev/2012-April/049215.html">recently</a> contributed to
LLVM by NVIDIA. We started out with a very manual process. First we
used the <code>--emit-llvm</code> flag for <code>rustc</code> to save the generated LLVM
bitcode. From there, we attempt to compile as PTX using <code>llc</code>:</p>

<figure class='code'><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class='line-number'>1</span>
</pre></td><td class='code'><pre><code class=''><span class='line'>llc -march=nvptx -mcpu=sm_13 trivial-kernel.ll -o trivial-kernel.ptx</span></code></pre></td></tr></table></div></figure>


<p>I wasn&rsquo;t terribly surprised to see this fail with one of LLVM&rsquo;s
typically opaque error messages. You can see it <a href="https://gist.github.com/4217726">here</a> if
you wish. Basically, Rust was generating code that the NVPTX backend
didn&rsquo;t know how to handle. This makes sense; I expect NVIDIA primarily
tests the backend on code generated by CUDA, which looks different
from code that Rust generates. The next step was to pare down the <a href="https://gist.github.com/4217710">generated LLVM</a> to something a little more manageable:</p>

<figure class='code'><figcaption><span></span></figcaption><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class='line-number'>1</span>
<span class='line-number'>2</span>
<span class='line-number'>3</span>
<span class='line-number'>4</span>
<span class='line-number'>5</span>
<span class='line-number'>6</span>
<span class='line-number'>7</span>
<span class='line-number'>8</span>
<span class='line-number'>9</span>
<span class='line-number'>10</span>
<span class='line-number'>11</span>
<span class='line-number'>12</span>
<span class='line-number'>13</span>
<span class='line-number'>14</span>
<span class='line-number'>15</span>
<span class='line-number'>16</span>
<span class='line-number'>17</span>
<span class='line-number'>18</span>
<span class='line-number'>19</span>
<span class='line-number'>20</span>
<span class='line-number'>21</span>
<span class='line-number'>22</span>
<span class='line-number'>23</span>
<span class='line-number'>24</span>
<span class='line-number'>25</span>
<span class='line-number'>26</span>
<span class='line-number'>27</span>
<span class='line-number'>28</span>
<span class='line-number'>29</span>
<span class='line-number'>30</span>
</pre></td><td class='code'><pre><code class='llvm'><span class='line'><span class="k">target</span> <span class="k">datalayout</span> <span class="p">=</span> <span class="s">&quot;e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v64:64:64-v128:128:128-a0:0:64-s0:64:64-f80:128:128-n8:16:32:64&quot;</span>
</span><span class='line'><span class="k">target</span> <span class="k">triple</span> <span class="p">=</span> <span class="s">&quot;x86_64-apple-darwin&quot;</span>
</span><span class='line'>
</span><span class='line'><span class="nv">%&quot;~enum intrinsic::TyDesc[#0]&quot;</span> <span class="p">=</span> <span class="k">type</span> <span class="p">{</span> <span class="p">[</span><span class="m">16</span> <span class="k">x</span> <span class="k">i8</span><span class="p">]</span> <span class="p">}</span>
</span><span class='line'><span class="nv">%tydesc</span> <span class="p">=</span> <span class="k">type</span> <span class="p">{</span> <span class="k">i64</span><span class="p">,</span> <span class="k">i64</span><span class="p">,</span> <span class="kt">void</span> <span class="p">(</span><span class="k">i1</span><span class="p">*,</span> <span class="k">i1</span><span class="p">*,</span> <span class="nv">%tydesc</span><span class="p">**,</span> <span class="k">i8</span><span class="p">*)*,</span> <span class="kt">void</span> <span class="p">(</span><span class="k">i1</span><span class="p">*,</span> <span class="k">i1</span><span class="p">*,</span> <span class="nv">%tydesc</span><span class="p">**,</span> <span class="k">i8</span><span class="p">*)*,</span> <span class="kt">void</span> <span class="p">(</span><span class="k">i1</span><span class="p">*,</span> <span class="k">i1</span><span class="p">*,</span> <span class="nv">%tydesc</span><span class="p">**,</span> <span class="k">i8</span><span class="p">*)*,</span> <span class="kt">void</span> <span class="p">(</span><span class="k">i1</span><span class="p">*,</span> <span class="k">i1</span><span class="p">*,</span> <span class="nv">%tydesc</span><span class="p">**,</span> <span class="k">i8</span><span class="p">*)*,</span> <span class="k">i8</span><span class="p">*,</span> <span class="k">i8</span><span class="p">*</span> <span class="p">}</span>
</span><span class='line'>
</span><span class='line'><span class="k">define</span> <span class="kt">void</span> <span class="vg">@_ZN9add_float16_cb9e1b436595b333_00E</span><span class="p">(</span><span class="k">i1</span><span class="p">*,</span> <span class="p">{</span> <span class="k">i64</span><span class="p">,</span> <span class="nv">%tydesc</span><span class="p">*,</span> <span class="k">i8</span><span class="p">*,</span> <span class="k">i8</span><span class="p">*,</span> <span class="k">i8</span> <span class="p">}</span> <span class="k">addrspace</span><span class="p">(</span><span class="m">1</span><span class="p">)*,</span> <span class="kt">double</span><span class="p">*,</span> <span class="kt">double</span><span class="p">*,</span> <span class="kt">double</span><span class="p">*)</span> <span class="err">uwtable</span> <span class="p">{</span>
</span><span class='line'><span class="nl">static_allocas:</span>
</span><span class='line'>  <span class="nv-Anonymous">%5</span> <span class="p">=</span> <span class="k">alloca</span> <span class="kt">double</span><span class="p">*</span>
</span><span class='line'>  <span class="nv-Anonymous">%6</span> <span class="p">=</span> <span class="k">alloca</span> <span class="kt">double</span><span class="p">*</span>
</span><span class='line'>  <span class="nv-Anonymous">%7</span> <span class="p">=</span> <span class="k">alloca</span> <span class="kt">double</span><span class="p">*</span>
</span><span class='line'>  <span class="k">br</span> <span class="kt">label</span> <span class="nv-Anonymous">%8</span>
</span><span class='line'>
</span><span class='line'><span class="nl">return:</span>                                           <span class="c">; preds = %8</span>
</span><span class='line'>  <span class="k">ret</span> <span class="kt">void</span>
</span><span class='line'>
</span><span class='line'><span class="c">; &lt;label&gt;:8                                       ; preds = %static_allocas</span>
</span><span class='line'>  <span class="k">store</span> <span class="kt">double</span><span class="p">*</span> <span class="nv-Anonymous">%2</span><span class="p">,</span> <span class="kt">double</span><span class="p">**</span> <span class="nv-Anonymous">%5</span>
</span><span class='line'>  <span class="k">store</span> <span class="kt">double</span><span class="p">*</span> <span class="nv-Anonymous">%3</span><span class="p">,</span> <span class="kt">double</span><span class="p">**</span> <span class="nv-Anonymous">%6</span>
</span><span class='line'>  <span class="k">store</span> <span class="kt">double</span><span class="p">*</span> <span class="nv-Anonymous">%4</span><span class="p">,</span> <span class="kt">double</span><span class="p">**</span> <span class="nv-Anonymous">%7</span>
</span><span class='line'>  <span class="k">call</span> <span class="kt">void</span> <span class="k">asm</span> <span class="s">&quot;# *z = *x + *y; (trivial-kernel.rs:3:4: 3:16)&quot;</span><span class="p">,</span> <span class="s">&quot;&quot;</span><span class="p">()</span>
</span><span class='line'>  <span class="nv-Anonymous">%9</span> <span class="p">=</span> <span class="k">load</span> <span class="kt">double</span><span class="p">**</span> <span class="nv-Anonymous">%5</span>
</span><span class='line'>  <span class="nv-Anonymous">%10</span> <span class="p">=</span> <span class="k">load</span> <span class="kt">double</span><span class="p">**</span> <span class="nv-Anonymous">%6</span>
</span><span class='line'>  <span class="nv-Anonymous">%11</span> <span class="p">=</span> <span class="k">load</span> <span class="kt">double</span><span class="p">*</span> <span class="nv-Anonymous">%9</span>
</span><span class='line'>  <span class="nv-Anonymous">%12</span> <span class="p">=</span> <span class="k">load</span> <span class="kt">double</span><span class="p">*</span> <span class="nv-Anonymous">%10</span>
</span><span class='line'>  <span class="nv-Anonymous">%13</span> <span class="p">=</span> <span class="k">fadd</span> <span class="kt">double</span> <span class="nv-Anonymous">%11</span><span class="p">,</span> <span class="nv-Anonymous">%12</span>
</span><span class='line'>  <span class="nv-Anonymous">%14</span> <span class="p">=</span> <span class="k">load</span> <span class="kt">double</span><span class="p">**</span> <span class="nv-Anonymous">%7</span>
</span><span class='line'>  <span class="k">store</span> <span class="kt">double</span> <span class="nv-Anonymous">%13</span><span class="p">,</span> <span class="kt">double</span><span class="p">*</span> <span class="nv-Anonymous">%14</span>
</span><span class='line'>  <span class="k">br</span> <span class="kt">label</span> <span class="nv">%return</span>
</span><span class='line'><span class="p">}</span>
</span></code></pre></td></tr></table></div></figure>


<p>Of course, LLVM still fails:</p>

<figure class='code'><figcaption><span></span></figcaption><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class='line-number'>1</span>
</pre></td><td class='code'><pre><code class='llvm'><span class='line'><span class="err">Assertion</span> <span class="err">failed:</span> <span class="p">(</span><span class="nv">!isLiteral</span><span class="p">()</span> <span class="err">&amp;&amp;</span> <span class="s">&quot;Literal structs never have names&quot;</span><span class="p">),</span> <span class="err">function</span> <span class="err">getName</span><span class="p">,</span> <span class="err">file</span> <span class="err">/usr/local/sr</span><span class="k">c</span><span class="err">/llvm/lib/VMCore/Type</span><span class="p">.</span><span class="err">cpp</span><span class="p">,</span> <span class="err">li</span><span class="k">ne</span> <span class="m">605</span><span class="p">.</span>
</span></code></pre></td></tr></table></div></figure>


<p>It seems that NVPTX was having trouble with the anonymous struct in
the function arguments (<code>{ i64, %tydesc*, i8*, i8*, i8 }</code>). To test
this theory, I replaced that type with an <code>i8 *</code>. The argument was
ignored anyway, so this shouldn&rsquo;t cause problems. With this change, we
ended up with a <a href="https://gist.github.com/4217791">PTX file</a>.</p>

<p>At point, we could either hack the Rust compiler to avoid generating
code that the NVPTX backend couldn&rsquo;t handle, or we could improve the
NVPTX backend. I opted for the latter, and ended up submitting my
first ever <a href="http://lists.cs.uiuc.edu/pipermail/llvm-commits/Week-of-Mon-20121126/157139.html">patch to LLVM</a>.</p>

<p>After another minor fix or two, it became clear that we were going to
have to modify the way Rust generates code as well. For example, the
PTX code I linked to above does not include a <code>.entry</code> line, which is
required to indicate where a kernel function begins. One option would
be to add a new PTX target for Rust, and basically set it up as a
cross compiler. This isn&rsquo;t quite what we want. We don&rsquo;t want to run
all of Rust on the GPU, just a few portions of a program. Other than
the code generator, we want to PTX code to agree with the
architectural details of the host system. Instead, I added a <code>-Zptx</code>
flag to <code>rustc</code> and started making minor changes to the translation
pass. Functions that have the <code>#[kernel]</code> attribute get compiled to
use the <code>ptx_kernel</code> calling convention, which tells NVPTX to add the
<code>.entry</code> line. According to <a href="http://pcwalton.github.com/">Patrick</a>, we should probably use a new
ABI setting instead, as arbitrary attributes aren&rsquo;t part of the
function&rsquo;s type.</p>

<p>At any rate, we could now pretty reliably go from Rust to PTX without
any manual intervention. The next challenge was to execute the
kernel. When we first tried to load the PTX file, OpenCL complained
about an &ldquo;invalid binary.&rdquo; We had previously been able to load a PTX
file generated with OpenCL and extracted using
<a href="http://www.khronos.org/registry/cl/sdk/1.0/docs/man/xhtml/clGetProgramInfo.html"><code>clGetProgramInfo</code></a>, so we decided to compare the
Rust-generated code with the OpenCL-generated code. It turns out that
the parameters to the kernel were not being annotated with an address
space. We manually added <code>.global</code> to the parameters in the
Rust-generated code, and we were able to load and execute the
kernel. Furthermore, we could manually annotate the LLVM code with
<code>addrspace(1)</code> to get the same behavior.</p>

<p>For some types, Rust would have the <code>addrspace(1)</code> annotation, but for
others it wouldn&rsquo;t. It turns out Rust was already using address spaces
for something related to garbage collection. Unfortunately, Rust and
NVPTX disagree on what these address spaces mean. To work around this,
I had Rust generate different address spaces when the <code>-Zptx</code> flag is
given. At the moment these changes only take effect for <code>&amp;</code>
pointers. Others, such as <code>@</code> pointers will be more difficult to get
working.</p>

<p>The final missing piece on the code generation side of things is to
have threads be able to do different things. This means providing
equivalents of the <code>blockIdx</code>, <code>blockDim</code> and <code>threadIdx</code>
variables. These show up in LLVM as intrinsic functions, so all we
need to do is expose those as new Rust intrinsics. We expect to have
this part working soon.</p>

<p>Our work here shows it&rsquo;s possible to compile Rust to run on the
GPU. We support an extremely limited subset of Rust at the
moment. Most of the remaining challenges have to do with the way data
is arranged in memory and how Rust provides safety at runtime. Rust
uses a lot of pointer structures, and moving these between host and
device memory can be difficult. Perhaps the best thing to do for now
is simply to be careful about what data types we use in GPU code. Even
if we use relatively flat types, however, we will still need to handle
a few more things. For example, Rust does array bounds checks at
runtime. If we want to allow arbitrary array indexing safely in GPU
code, we&rsquo;ll need a way to do bounds checks and report failures from
kernel code. There are clearly a lot of design issues left, but the
initial results for compiling Rust to run on the GPU seem very
promising.</p>

<p>If you want to try it out, here are links to the code.</p>

<p><a href="https://github.com/eholk/rust/tree/nvptx">https://github.com/eholk/rust/tree/nvptx</a> <br>
<a href="https://github.com/eholk/llvm/tree/nvptx-rust">https://github.com/eholk/llvm/tree/nvptx-rust</a></p>
]]></content>
    </entry>
  
    <entry>
      




<title type="html"><![CDATA[A Look at Macros in Scheme]]></title>
<link href="http://blog.theincredibleholk.org/blog/2012/12/02/a-look-at-macros-in-scheme/"/>
<updated>2012-12-02T15:00:00-07:00</updated>
<id>http://blog.theincredibleholk.org/blog/2012/12/02/a-look-at-macros-in-scheme</id>

      <content type="html"><![CDATA[<p>One of the features that sets Scheme apart as a programming language
is its powerful macro system. In the same way that procedures allow
you to reuse bits of code, macros allow you to reuse <em>syntax</em>. Macros
and procedures can express many of the same things, but macros are
particularly useful when you want to be careful about control flow and
effects. Consider the following program.</p>

<figure class='code'><figcaption><span></span></figcaption><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class='line-number'>1</span>
</pre></td><td class='code'><pre><code class='scheme'><span class='line'><span class="p">(</span><span class="nf">choose-arg</span> <span class="mi">1</span> <span class="p">(</span><span class="nf">factorial</span> <span class="mi">5</span><span class="p">)</span> <span class="p">(</span><span class="nb">+ </span><span class="mi">1</span> <span class="mi">2</span><span class="p">)</span> <span class="p">(</span><span class="nf">fibonacci</span> <span class="mi">40</span><span class="p">))</span>
</span></code></pre></td></tr></table></div></figure>


<p>Here, <code>choose-arg</code> takes a number and some arguments. The whole
expression should evaluate to the argument selected by the number
argument. The example above should return 3. As a procedure,
<code>choose-arg</code> might be written as follows:</p>

<figure class='code'><figcaption><span></span></figcaption><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class='line-number'>1</span>
<span class='line-number'>2</span>
<span class='line-number'>3</span>
<span class='line-number'>4</span>
<span class='line-number'>5</span>
</pre></td><td class='code'><pre><code class='scheme'><span class='line'><span class="p">(</span><span class="k">define </span><span class="nv">choose-arg</span>
</span><span class='line'>  <span class="p">(</span><span class="k">lambda </span><span class="p">(</span><span class="nf">i</span> <span class="o">.</span> <span class="nv">args</span><span class="p">)</span>
</span><span class='line'>    <span class="p">(</span><span class="k">if </span><span class="p">(</span><span class="nb">zero? </span><span class="nv">i</span><span class="p">)</span>
</span><span class='line'>        <span class="p">(</span><span class="nb">car </span><span class="nv">args</span><span class="p">)</span>
</span><span class='line'>        <span class="p">(</span><span class="nb">apply </span><span class="nv">choose-arg</span> <span class="p">(</span><span class="nb">- </span><span class="nv">i</span> <span class="mi">1</span><span class="p">)</span> <span class="p">(</span><span class="nb">cdr </span><span class="nv">args</span><span class="p">)))))</span>
</span></code></pre></td></tr></table></div></figure>


<p>Using this definition (and appropriate definitions for <code>factorial</code> and
<code>fibonacci</code>), we see that our starting example returns 3 as
expected. Unfortunately, it takes a lot longer than we&rsquo;d like:</p>

<figure class='code'><figcaption><span></span></figcaption><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class='line-number'>1</span>
<span class='line-number'>2</span>
<span class='line-number'>3</span>
<span class='line-number'>4</span>
<span class='line-number'>5</span>
</pre></td><td class='code'><pre><code class=''><span class='line'>(time (choose-arg 1 ...))
</span><span class='line'>    no collections
</span><span class='line'>    11167 ms elapsed cpu time
</span><span class='line'>    11168 ms elapsed real time
</span><span class='line'>    144 bytes allocated</span></code></pre></td></tr></table></div></figure>


<p>That&rsquo;s over 11 seconds to decide the answer is 3. What&rsquo;s going on?</p>

<p>This takes so long because Scheme is an eager language, meaning it
must fully evaluate all procedure arguments before applying a
procedure. My version of Fibonacci takes exponential time, so
<code>(fibonacci 40)</code> is somewhat nontrivial. What we&rsquo;d really like is to
only evaluate the argument we actually selected with <code>choose-arg</code>.</p>

<p>One way to do this is by using <em>thunks</em>. We just wrap all of the
arguments in lambdas with no arguments, and wrap an extra set of
parentheses around the call to <code>choose-arg</code> to evaluate the thunk we
selected:</p>

<figure class='code'><figcaption><span></span></figcaption><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class='line-number'>1</span>
</pre></td><td class='code'><pre><code class='scheme'><span class='line'><span class="p">((</span><span class="nf">choose-arg</span> <span class="mi">1</span> <span class="p">(</span><span class="k">lambda </span><span class="p">()</span> <span class="p">(</span><span class="nf">factorial</span> <span class="mi">5</span><span class="p">))</span> <span class="p">(</span><span class="k">lambda </span><span class="p">()</span> <span class="p">(</span><span class="nb">+ </span><span class="mi">1</span> <span class="mi">2</span><span class="p">))</span> <span class="p">(</span><span class="k">lambda </span><span class="p">()</span> <span class="p">(</span><span class="nf">fibonacci</span> <span class="mi">40</span><span class="p">))))</span>
</span></code></pre></td></tr></table></div></figure>


<p>This performs much better:</p>

<figure class='code'><figcaption><span></span></figcaption><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class='line-number'>1</span>
<span class='line-number'>2</span>
<span class='line-number'>3</span>
<span class='line-number'>4</span>
<span class='line-number'>5</span>
</pre></td><td class='code'><pre><code class=''><span class='line'>(time ((choose-arg 1 ...)))
</span><span class='line'>    no collections
</span><span class='line'>    0 ms elapsed cpu time
</span><span class='line'>    0 ms elapsed real time
</span><span class='line'>    192 bytes allocated</span></code></pre></td></tr></table></div></figure>


<p>Unfortunately, the program is now much harder to read. What we&rsquo;d
really like is to write our original program, but have Scheme execute
it as if we had written something like this:</p>

<figure class='code'><figcaption><span></span></figcaption><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class='line-number'>1</span>
<span class='line-number'>2</span>
<span class='line-number'>3</span>
<span class='line-number'>4</span>
<span class='line-number'>5</span>
<span class='line-number'>6</span>
</pre></td><td class='code'><pre><code class='scheme'><span class='line'><span class="p">(</span><span class="k">if </span><span class="p">(</span><span class="nb">= </span><span class="mi">1</span> <span class="mi">0</span><span class="p">)</span>
</span><span class='line'>    <span class="p">(</span><span class="nf">factorial</span> <span class="mi">5</span><span class="p">)</span>
</span><span class='line'>    <span class="p">(</span><span class="k">if </span><span class="p">(</span><span class="nb">= </span><span class="mi">1</span> <span class="mi">1</span><span class="p">)</span>
</span><span class='line'>        <span class="p">(</span><span class="nb">+ </span><span class="mi">1</span> <span class="mi">2</span><span class="p">)</span>
</span><span class='line'>        <span class="p">(</span><span class="k">if </span><span class="p">(</span><span class="nb">= </span><span class="mi">1</span> <span class="mi">2</span><span class="p">)</span>
</span><span class='line'>            <span class="p">(</span><span class="nf">fibonacci</span> <span class="mi">40</span><span class="p">))))</span>
</span></code></pre></td></tr></table></div></figure>


<p>This is exactly what macros let us do. Macros let us define syntax
transformers, which take one part of your program as input and rewrite
it into a different program. As a macro, we could write <code>choose-arg</code>
like this:</p>

<figure class='code'><figcaption><span></span></figcaption><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class='line-number'>1</span>
<span class='line-number'>2</span>
<span class='line-number'>3</span>
<span class='line-number'>4</span>
<span class='line-number'>5</span>
<span class='line-number'>6</span>
<span class='line-number'>7</span>
<span class='line-number'>8</span>
</pre></td><td class='code'><pre><code class='scheme'><span class='line'><span class="p">(</span><span class="k">define-syntax </span><span class="nv">choose-arg</span>
</span><span class='line'>  <span class="p">(</span><span class="k">syntax-rules </span><span class="p">()</span>
</span><span class='line'>    <span class="p">((</span><span class="nf">_</span> <span class="nv">i</span> <span class="nv">e</span><span class="p">)</span>
</span><span class='line'>     <span class="p">(</span><span class="k">if </span><span class="p">(</span><span class="nb">= </span><span class="nv">i</span> <span class="mi">0</span><span class="p">)</span> <span class="nv">e</span><span class="p">))</span>
</span><span class='line'>    <span class="p">((</span><span class="nf">_</span> <span class="nv">i</span> <span class="nv">e</span> <span class="nv">e*</span> <span class="o">...</span><span class="p">)</span>
</span><span class='line'>     <span class="p">(</span><span class="k">if </span><span class="p">(</span><span class="nb">= </span><span class="nv">i</span> <span class="mi">0</span><span class="p">)</span>
</span><span class='line'>         <span class="nv">e</span>
</span><span class='line'>         <span class="p">(</span><span class="nf">choose-arg</span> <span class="p">(</span><span class="nf">sub1</span> <span class="nv">i</span><span class="p">)</span> <span class="nv">e*</span> <span class="o">...</span><span class="p">)))))</span>
</span></code></pre></td></tr></table></div></figure>


<p>While a full explanation of <code>syntax-rules</code> will have to wait for
another post, let&rsquo;s take a minute to make sure we have some idea
what&rsquo;s going on. We define macros with <code>syntax-rules</code> by giving a set
of input and output patterns. Program fragments that match an input
pattern are replaced with the output pattern. Our <code>choose-arg</code> macro
has two sets of patterns. The first one matches things like
<code>(choose-arg 5 (+ 1 2))</code> and replaces it with <code>(if (= 5 0) (+ 1
2))</code>. Notice that the <code>i</code> and <code>e</code> in the output pattern are replaced
with the code fragment from the input pattern. The second pattern is
similar, except that the <code>e* ...</code> portion can match any number of
instances. The second one would take <code>(choose-arg 1 'a 'b)</code> and
transform it into <code>(if (= 1 0) 'a (choose-arg 1 'b))</code>. Macros keep
expanding, so the next time around, the application of <code>choose-arg</code>
would match the first pattern.</p>

<p>This version has the performance we would like, while also letting us
write the code as we would like. Still, performance is not the most
compelling reason to use macros. For any given problem, some languages
are better at expressing the solution than others. Rather than having
to choose the best language among many imperfect choices, macros let
the programmer write their code in a language they have created for
the program at hand. Additionally, macros make things easier for the
language implementor because they can focus on high quality
implementations for a small set of core forms and express the other
forms as macros that expand to these core forms. For example, just
<code>lambda</code> and <code>if</code> is enough to implement many more forms like <code>let</code>,
<code>let*</code>, <code>or</code>, <code>and</code>, <code>cond</code>, etc.</p>
]]></content>
    </entry>
  
    <entry>
      




<title type="html"><![CDATA[A look at GPU memory transfer]]></title>
<link href="http://blog.theincredibleholk.org/blog/2012/11/29/a-look-at-gpu-memory-transfer/"/>
<updated>2012-11-29T13:25:00-07:00</updated>
<id>http://blog.theincredibleholk.org/blog/2012/11/29/a-look-at-gpu-memory-transfer</id>

      <content type="html"><![CDATA[<p>One of the trickier things in programming with multiple devices is
managing the transfer of data between devices. This applies whether
you&rsquo;re programming a cluster or a machine with a CPU and
GPU. Transferring data takes time and the programmer must be careful
that the transfer time doesn&rsquo;t overpower any performance gains from
parallelizing your algorithm. When talking about transfer time, we
usually think of it as having two components: the time due to
<em>latency</em> and the time due to <em>bandwidth</em>. The total time to transfer
the data is then,</p>

<p>$$
T_\mathit{total} = T_L + T_B
$$</p>

<p>where \(T_L\) is the time due to latency and \(T_B\) is the time
due to bandwidth. Typically, the \(T_L\) term is a constant. For
example, when talking about two computers on the Internet, the latency
term might be something like 35ms. When talking about the latency
between main memory and the CPU, this term is on the order of hundreds
of nanoseconds.</p>

<p>The \(T_B\) term normally depends on the size of the data being transferred. So, if the size of the data is \(S\) and the bandwidth is \(B\), we&rsquo;d have,</p>

<p>$$
T_B = \frac{S}{B}
$$</p>

<p>Sometimes there is a minimum amount of data that you can transfer. For
example, many hard drives have a 512 byte sector size. These hard
drives transfer data in units of 512 bytes, so even if you only need 4
bytes off of the disk you will still have to spend as much time as you
would to copy 512 bytes.</p>

<p>My research group had a hypothesis that there is a similar minimum
unit of data transfer for GPUs. Furthermore, we suspected this was a
fairly large amount. This would mean for GPU programs we&rsquo;d want to try
to combine transfers to pay the latency overhead as little as
possible. It would mean that in some cases we could get away with
transfering more data than necessary in order to minimize the number
of transfer operations.</p>

<p>In order to test this hypothesis, I wrote a simple program that copies
data between the CPU and GPU in varying sizes. We expected to see a
line that was basically flat up to a certain threshold size and then
see the transfer time increase linearly. Here&rsquo;s an example of what we
saw in practice.</p>

<script type="text/javascript" src="http://blog.theincredibleholk.org//ajax.googleapis.com/ajax/static/modules/gviz/1.0/chart.js"> {"dataSourceUrl":"//docs.google.com/spreadsheet/tq?key=0AmDOavun7LTxdGR6THpCUXdXc3JDVGt0ODJObUFXdmc&transpose=0&headers=0&range=A1%3AB290&gid=7&pub=1","options":{"titleTextStyle":{"bold":true,"color":"#000","fontSize":16},"vAxes":[{"title":"Time (\u00b5s)","useFormatFromData":false,"formatOptions":{"source":"inline"},"minValue":null,"viewWindowMode":"pretty","format":"0.##","logScale":true,"viewWindow":{"min":null,"max":10},"maxValue":10},{"useFormatFromData":true,"minValue":null,"viewWindow":{"min":null,"max":null},"maxValue":null}],"series":{"0":{"pointSize":2}},"title":"Hivequeen","booleanRole":"certainty","animation":{"duration":0},"pointSize":7,"legend":"right","lineWidth":0,"hAxis":{"title":"Data Size","useFormatFromData":false,"formatOptions":{"source":"inline"},"minValue":null,"format":"0.##","viewWindow":{"min":null,"max":null},"logScale":true,"gridlines":{"count":"10"},"maxValue":null},"tooltip":{},"width":750,"height":441},"state":{},"view":{},"chartType":"ScatterChart","chartName":"Chart 1"} </script>


<p>This is on a Core i7-2600K with an NVIDIA GTX460 GPU.</p>

<p>We see the general shape we expected to see. Up until about 8K, all
transfers take around 6 or 7 microseconds. Afterwards, the transfer
time increases linearly.</p>

<p>Though we saw what we expected, in some ways many of the expected
implications do not hold. We expected the threshold to be in the range
of several megabytes. Instead, the threshold was at 8K. It seems
unlikely that your code will benefit from running on the GPU if you
only have 8K of data. The second conclusion we expected to make was
that it was okay to over-approximate the data to transfer. This is
also invalid, because the size of data your program will typically be
working at is so far above the threshold that you actually want to
minimize the amount of data transfered to minimize the contribution of
the bandwidth term to the total transfer time.</p>

<p>Another important lesson from this is that it&rsquo;s important to test your
intuition before basing design decisions on it. This test was pretty
easy to write, and yet the decisions we would have made based on our
assumptions might have had expensive and long-lasting consequences.</p>
]]></content>
    </entry>
  
    <entry>
      




<title type="html"><![CDATA[Hello, World!]]></title>
<link href="http://blog.theincredibleholk.org/blog/2012/11/27/hello-world/"/>
<updated>2012-11-27T19:40:00-07:00</updated>
<id>http://blog.theincredibleholk.org/blog/2012/11/27/hello-world</id>

      <content type="html"><![CDATA[<p>I&rsquo;ve decided to try entering the brave new world of
<a href="http://octopress.org">Octopress</a>. My
<a href="http://theincredibleholk.wordpress.com">old blog</a> was hosted by
WordPress, which is a perfectly fine blogging framework. However, I
found that it seems to have a lot of features for large teams of
writers that I don&rsquo;t really need. More importantly, I found writing
about code snippets really tedious, since I had to do the HTML myself
and avoid the WYSIWYG editor. In reality, I ended up writing my posts
in Markdown and then pasting the generated HTML into WordPress. This
would require further tweaking to make sure everything would still
look nice after the import. Since I was using Markdown anyway, it
makes sense to try a blogging framework based around that.</p>

<p>This should let me do some nice new things as well. For example, I
could use <a href="http://d3js.org">D3</a> to generate nice inline graphs, or I
could use <a href="http://www.mathjax.org/">MathJax</a> to render nice inline
math when needed.</p>

<p>Anyway, here goes my experiment with Octopress.</p>
]]></content>
    </entry>
  
</feed>
