In my previous post, we discussed some of the data structures that support data parallel programming. Now we’ll turn our attention to the common operators that manipulate these data structures. I’ll discuss several of them: map, reduce, scan, permute, back-permute and filter.
Map takes as input a vector and an operation, and returns a new vector
that is the result of applying the operation on each element of the
vector. For example, we can map
add1 over a vector to increment each
element by 1:
1 2 3
Related to this is
zip-with, which takes some number of equal-length
vectors and applies an operator to corresponding elements in each
vector. For example, we can add two vectors using the
1 2 3
Map and zip-with can work on segmented vectors as well, and the result will have the same shape as the original:
1 2 3
However, if we try to apply
zip-with on vectors with differing
shapes, we will end up with an error.
Reduce takes a vector of elements and combines them into a single element using a given operation. The operator \(\oplus\) must be an associative binary operator. It helps if the operator has a 0 element, that is, one for which \(a \oplus 0 = a\) and \(0 \oplus a = a \). We can do even better if we know the operator is commutative.
Below is an example of how to add all elements in a vector:
1 2 3
We can similarly define a segmented variant of reduce, which results in a vector containing the reduction of each segment. For example:
1 2 3
The two operators we’ve seen before are very powerful. We can define many linear algebra operations in terms of them, and they all form the basis for MapReduce. Still, there are cases where more operators are extremely helpful.
Scan is like reduce, except that it returns a vector of all of the intermediate results. Here’s an example:
1 2 3
Each entry in the resulting vector is the sum of all of the elements to the left of the corresponding element in the source vector.
Scan is often handy as an intermediate step in a larger computation. For example, we may have a vector of vectors where the lengths of the inner vectors vary wildly. In this case, scan can be used to help balance the work between some number of worker threads.
Permute is used to rearrange elements in a vector. It takes a vector, of course, and also a function that maps indices to indices. The element at the input index will be stored in the output index. Here’s a basic example that reverses a vector:
1 2 3
There are a couple of caveats. Suppose our mapping function were
0. If we apply this to the vector
[1 2 3 4], all of the elements
would be mapped to the first position and there would be nothing for
the later positions. To solve the first problem, we can provide a
combining function that is used when multiple elements map to the same
location. For example, we may want the maximum value of all elements
that map to a given location. For the second problem, we can provide a
default element that is used when none of the inputs are mapped to a
certain output location. Of course, now our permutation function is
more complicated to use and also more complicated to implement.
In many ways, back-permute is a simpler version of permute. Back-permute also takes an input vector and an index mapping function, but instead of mapping input indices to output indices, it maps output indices to input indices. Because of this, we don’t have the problem of multiple values mapping to the same location, as the mapping function can only produce one value. This also means we can avoid the combining function. That said, because this function is simpler, it is also less powerful than permute.
All of our operators so far except for reduce have essentially mapped some number of inputs to the same number of outputs. Sometimes we need to remove elements that we are working with. An example is when writing Quicksort. We’d need to split a vector into one vector containing those elements greater than the pivot and another containing those elements less than the pivot. Using filter, we could do this as follows:
1 2 3 4 5 6 7 8 9 10 11
Many of these operators can be written in terms of other ones. For example, you can get reduce from scan by reading the last element of the result and adding the last element of the input. This version will probably not be efficient, as it may allocate far more storage than necessary. In designing a data parallel system, one important decision is which operators will be considered primitive, and which will be built in terms of others. Ideally, these decisions should consider the characteristics of the underlying hardware. In Guy Bleloch’s book, Vector Models for Data-Parallel Computing, he argues in favor of a set of primitives that can execute in about the same amount of time it would take to read the input vector from memory. Similarly, when implementing a data parallel system for GPUs, it’s important to consider what operators are easy to implement. Map and back-permute are pretty straightforward, but others are more difficult. Ideally, upon choosing a good set of primitives, the compiler and runtime can optimize the other operators into efficient code.