Given a problem, we want to think at a higher level instead of low-level implementation details. These high-level structures work as the building blocks to assemble the solutions to complex problems. Parallel patterns tell us what to compute, not how to compute. The idea is
Focusing on ligh-level structures and patterns
Higher level patterns are built from more fundamental patterns
Let the programmers worry about how to implement the patters
Reduce
Many parallel threads need to generate a single result
order doesn’t matter to the result
associative operator (+, *, min/max, AND/OR, …)
Assuming the size of array to be $N$
Reduction takes $\log(N)$ steps to complete. Step $i$ does $N/2^i$ independent operations. Each step is done inparallel, the overall complexity is $O(\log(N))$.
The number of total operations done is $\sum_{i=1}^{\log(N)} 2^{N-i} = N-1 = O(N)$. Parallel reduction does not do more opeartions than the sequential one. It is efficient.
Typically, given a large array, we do sequential reduction on each processor first, then do in parallel. Given $p$ processors, the time complexity is $O(\frac{N}{p} + log(p))$.
Optimization of parallel reduction on GPU [1].
Scan (Prefix Sum)
Each thread outputs the sum of all elements before it, inclusively or exclusively
binary associative operator [2]
done in place
Assuming all data resides in one thread block, a single phase scan
takes $O(\log(N))$ steps
in step $i$, each thread works with another thread that is $2^{i-1}$ offset away
$N-2^{i-1}$ threads participate
overall complexity $O(\log(N))$
If the data is too large to fit into one thread block, we take a two-phase approach
on GPU, using $b$ thread blocks, $O(\log(\frac{N}{b}) + \log(b)) = O(\log(N))$:
Partition data to $N/b$ thread blocks, each block does parallel scan locally
Perform another parallel scan using the last element from each block, which is the sum of all elements within that block
Add the corresponding result from the second scan to all elements within each thread block
on multi-core CPU with $p$ processors [3], $O(\frac{N}{p} + \log(p))$:
Partition data to $N/p$ processors, does equential scan locally on each processor
Perform parallel scan using the last element from each processor, which is the sum of all elements within that processor
Add the result of parallel scan on each processor to each of its local prefix sum
Split, Compact, Expand
Split: re-order an array with all elements whose key is 1 at the beginning
Compact: remove null elements
Expand: each thread has writes to index with variable stride
All these three patterns should answer “where should each thred write its output?” It all dependes on where other threads write. The position where each thread writes can be efficiently calculated with parallel scan [4].
Segmented Scan
Scan only within regions (seperated by Key, size only known on run-time)
Regions are seperated by barriers (1s) in Key
Operations on Value don’t propagate beyonf barriers
Key and Value do scans at once (seperately)
Key scan normally using operator OR
Value only performs operation when its associated Key is 0 in current stage
Doing lots of reductions of unpredictable size at the same time is the most common use
Doing sums/max/count/any over arbitrary sub-domains of your data
Common Usage Scenarios [5]:
Determine which region/tree/group/object class an element belongs to and assign that as its new ID
Sort based on that ID
Operate on all of the regions/trees/groups/objects in parallel, no matter what their size or number
Doing divide-and-conquer type algorithms, following similar procedure like above
Sort
Sorting is useful for almost everything, we may sort the dataset by Key and process using segmented scan
Sorting is a standard approach but may not be the optimized solution
Sorting is helpful because it improves memory and execution coherence (move related data closer)
Optimizaed GPU sorting already exists
Radix sort (bucket sort) is faster than comparison-based sorts, if you can generate fixed-size Key
the resulting $l_{\min}$ and $l_{\max}$ are still bitonic, and $\max(l_{\min}) \leq \min(l_{\max})$
Bitonic Merge: turning a bitonic sequence into a sorted sequence using repeated bitonic split operations
$BM(p, p) = O(\log p)$
The right arrows in the figure above refers to CompareExchange - compare two values and move smaller one along the arrow
\(\begin{equation}
\begin{split}
T(p, p) &= BM(2, 2) + BM(4, 4) + ... + BM(p, p) \\
&= 1 + 2 + ... + \log p \\
&= O(\log^2 p)
\end{split}
\end{equation}\)
When $n>p$
sort $n/p$ locally on each processor - $O(\frac{n}{p} \log \frac{n}{p})$
Run bitonic sort with CompareExchange replaced by MergeSplit - merge two sorted arrays and split into two
MapReduce [6] combines sort and scan described above.
Map
User provides a function to apply to the input data: give me one input, apply some operations, and writes to one output, no communications between different instances of Map
Function can be anything which produces a (key, value) pair
value no necessraily to be a number, it can just be a pointer to arbitrary data structure
Sort
(key, value) pairs are sorted based on their keys implicitly
Creates runs of (key, value) pairs with same key
Reduce
Segmented Scan, function provided by the user, could be simple plus, max, etc.
Final result is a list of keys and final values (or arbitrary data structures)