# Optimizing Dot Product

Lately I’ve seen quite a few papers on GPU programming languages that use dot product as a benchmark, including a paper I’ve written. As I’ve thought about it some more, it seems like this may not be the most useful benchmark. The reason is that dot product does very little actual computation, but accesses a lot of data. Any decent dot product implementation should be bound by the memory bandwidth. This is true of many algorithms, but many offer opportunities to exploit caches due to data reuse. Because dot product only reads each value once, we do not have this benefit.

Theoretically, GPUs should still do quite well. According to this post, an NVIDIA GTX480 should be able to reach 177.4 GBps of global memory bandwidth. On the other hand, a CPU with PC17000 memory can only theoretically hit 17 GBps of main memory bandwidth. This means a GPU ought to be able to outperform the CPU by about a factor of 10. Unfortunately, unless you are computing on data that you have generated on the GPU or is part of a larger computation, you must consider the time to copy the data from the CPU memory to the GPU. This goes over the PCI Express bus, currently tops out at about 8 GBps.

This means that when you also include the time for data transfer, besides just the computation time, it should be almost impossible for a GPU implementation of dot product to beat a CPU implementation. To test this, I wrote several different variants of dot product and compared their speed. The results are below.

The code for these tests is available on my GitHub. I chose vectors of 33 million single precision floats so that the total amount of data for the two vectors would be about 256 MB. This is large enough to take a measurable amount of time, while being small enough to comfortably fit in most GPU memories. As usual, these tests were conducted on a Core i7 2600K with PC17000 memory and an NVIDIA GTX460 GPU.

The Simple version is the most obvious way to write a dot product:

My hypothesis was that it’d actually be hard to do much better than this, given how fast CPUs are relative to main memory. I was surprised to find that you could actually do significantly better making use of the CPU’s SIMD instructions.

The bars are not actually organized in the order I tried them. The next test I did was the SSE version. This was mostly handled by the compiler. I just had to add __attribute__ ((vector_size (sizeof(float) * 4))) to the types I was working on, and add a few casts here and there. I checked the generated assembly code to be sure it was generating SSE instructions.

Generating AVX was a little bit trickier. For the most part, I just needed to change the vector size from 4 to 8. However, I was using a slightly old version of GCC that I needed to update. Also, I needed to add the -march=corei7-avx and -mtune=corei7-avx compiler flags. The AVX version did perform slightly better, but we only gained about a millisecond. This suggests that we’re pretty close to maxing out the memory bandwidth. Dividing 256 MB by 16.2 ms gives us 15.43 GBps, which is approaching the 17 GBps theoretical peak.

At this point, I had exhausted most of what I could think of to improve the performance. I decided to compare with the performance of ATLAS. ATLAS is a collection of linear algebra routines. The package includes several variants of each algorithm, and at install time the build system measures the performance of each variant on the machine you are running. Based on this, it decides the best variant to use. The ATLAS version performed quite a bit better than the simple version, but not quite as well as the SSE or AVX versions.

I pulled out gdb and tracked down disassembled the core computational loop for the ATLAS version of dot product. If you’re interested, the assembly is here. I noticed two things. First, it was using the prefetchnta to prefetch values into cache before they were need. Second, and more important, the loop had been unrolled four times. I had tried a similar thing using the -funroll-loops compiler flag, but this made only a marginal difference. This version of unrolling looks like this:

The problem is, this code must still run basically sequentially because each statement depends on the previous statement. What ATLAS did, and what I copied in the Unrolled version, is this:

This meant there was no longer any dependency between statements, and the CPU’s out of order execution engine could run all four of these as soon as the data was available. In a sense, the CPU is vectorizing loops on the fly.

Based on these observations, I tried to see if we could do better using this unrolling technique and prefetching on the AVX version. This got about another 0.1 milliseconds, but clearly we are at the point of diminishing returns.

Finally, I decided to see what a GPU implementation would be like. I used NVIDIA’s CUBLAS library, which should be a very highly tuned implementation for NVIDIA GPUs. Using the library was pretty easy. I just had to remember to allocate GPU buffers and copy data into them. As I predicted, however, the CUBLAS version was far slower than even the simple version.

This exercise proved interesting for me. It confirmed one of my hypotheses, that it should be very hard if not impossible to make a GPU implementation of dot product outperform the CPU when the data starts in the CPU memory. I was surprised, however, to find that you can do better than the simple version on the CPU. It was also interesting to see how minor changes in the C++ source code can cause the compiler to generate very different assembly language.