A couple of readers pointed out some improvements and corrections to my last post on GPU access patterns. These were pretty significant, so I thought it’d be worth doing a follow up post to see how the change things.

First of all, I meant to operate on both arrays, A and B, but through some sloppy coding I ended up only using A. Incidentally, I did some back-of-the-envelope calculations to figure out the memory bandwidth I was getting, and I was surprised to see that I was getting close to twice the theoretical peak for the cards I was working with. It looks like it’s because I was only reading half the data I thought I was. Here are the corrected figures (the experiment is the same other than the small corrections to my code):

 Kernel Tesla C1060 GeForce GTX 460 ATI Radeon HD 6750M MyAdd 2.764 ms 4.524 ms 36.325 ms MyAdd_2D 10.560 ms 0.763 ms 4.273 ms MyAdd_2D_unweave 0.740 ms 0.100 ms 2.170 ms MyAdd_col 2.777 ms 4.527 ms 26.686 ms MyAdd_2D_col 10.391 ms 0.961 ms 7.723 ms MyAdd_2D_unweave_col 12.398 ms 0.708 ms 3.413 ms

We’re slower across the board, but the overall shape of the data is about the same. Interestingly, the fastest kernels are not much slower than the fastest kernels from before.

Next, reddit user ser999 pointed out that we could forego the branch entirely by doing some clever arithmetic. Instead of doing

if (i % 2 == 0)
get(C, N, i, j) = get(A, N, i, j) + get(B, N, i, j);
else
get(C, N, i, j) = get(A, N, i, j) - get(B, N, i, j);


get(C, N, i, j) = get(A, N, i, j) + get(B, N, i, j)*(1 - ((i&1)<<1));