• Optimizing Dot Product

    Lately I’ve seen quite a few papers on GPU programming languages that use dot product as a benchmark, including a paper I’ve written. As I’ve thought about it some more, it seems like this may not be the most useful benchmark. The reason is that dot product does very little actual computation, but accesses a lot of data. Any decent dot product implementation should be bound by the memory bandwidth. This is true of many algorithms, but many offer opportunities to exploit caches due to data reuse. Because dot product only reads each value once, we do not have this benefit.

  • Compiling Rust for GPUs

    A couple of days back, I tweeted that I had just ran code written in Rust on the GPU. It’s about time I provided some more details. This is a project I worked on with Milinda Pathirage, a fellow student at IU. I should emphasize that this is very much in the proof of concept stage. I doubt it will work well enough to do anything useful, but it does work well enough to do something and it would certainly be possible to extend this. That said, I will include links to our code so the valiant hackers out there can try it out if they wish. For posterity’s sake, here is, to my knowledge, the first fragment of Rust code to ever execute on a GPU:

  • A Look at Macros in Scheme

    One of the features that sets Scheme apart as a programming language is its powerful macro system. In the same way that procedures allow you to reuse bits of code, macros allow you to reuse syntax. Macros and procedures can express many of the same things, but macros are particularly useful when you want to be careful about control flow and effects. Consider the following program.

  • A look at GPU memory transfer

    One of the trickier things in programming with multiple devices is managing the transfer of data between devices. This applies whether you’re programming a cluster or a machine with a CPU and GPU. Transferring data takes time and the programmer must be careful that the transfer time doesn’t overpower any performance gains from parallelizing your algorithm. When talking about transfer time, we usually think of it as having two components: the time due to latency and the time due to bandwidth. The total time to transfer the data is then,

  • Hello, World!

    I’ve decided to try entering the brave new world of Octopress. My old blog was hosted by WordPress, which is a perfectly fine blogging framework. However, I found that it seems to have a lot of features for large teams of writers that I don’t really need. More importantly, I found writing about code snippets really tedious, since I had to do the HTML myself and avoid the WYSIWYG editor. In reality, I ended up writing my posts in Markdown and then pasting the generated HTML into WordPress. This would require further tweaking to make sure everything would still look nice after the import. Since I was using Markdown anyway, it makes sense to try a blogging framework based around that.