Parallel Patterns

Apr 22, 2019, 1175 words

Why Parallel with Patterns?

Given a problem, we want to think at a higher level instead of low-level implementation details. These high-level structures work as the building blocks to assemble the solutions to complex problems. Parallel patterns tell us what to compute, not how to compute. The idea is

  • Focusing on ligh-level structures and patterns
  • Higher level patterns are built from more fundamental patterns
  • Let the programmers worry about how to implement the patters

Reduce

Many parallel threads need to generate a single result

  • order doesn’t matter to the result
  • associative operator (+, *, min/max, AND/OR, …)

Assuming the size of array to be $N$

  • Reduction takes $\log(N)$ steps to complete. Step $i$ does $N/2^i$ independent operations. Each step is done inparallel, the overall complexity is $O(\log(N))$.
  • The number of total operations done is $\sum_{i=1}^{\log(N)} 2^{N-i} = N-1 = O(N)$. Parallel reduction does not do more opeartions than the sequential one. It is efficient.
  • Typically, given a large array, we do sequential reduction on each processor first, then do in parallel. Given $p$ processors, the time complexity is $O(\frac{N}{p} + log(p))$.

Optimization of parallel reduction on GPU [1].

Scan (Prefix Sum)

Each thread outputs the sum of all elements before it, inclusively or exclusively

  • binary associative operator [2]
  • done in place

Assuming all data resides in one thread block, a single phase scan

  • takes $O(\log(N))$ steps
  • in step $i$, each thread works with another thread that is $2^{i-1}$ offset away
  • $N-2^{i-1}$ threads participate
  • overall complexity $O(\log(N))$

If the data is too large to fit into one thread block, we take a two-phase approach

  • on GPU, using $b$ thread blocks, $O(\log(\frac{N}{b}) + \log(b)) = O(\log(N))$:
    • Partition data to $N/b$ thread blocks, each block does parallel scan locally
    • Perform another parallel scan using the last element from each block, which is the sum of all elements within that block
    • Add the corresponding result from the second scan to all elements within each thread block
  • on multi-core CPU with $p$ processors [3], $O(\frac{N}{p} + \log(p))$:
    • Partition data to $N/p$ processors, does equential scan locally on each processor
    • Perform parallel scan using the last element from each processor, which is the sum of all elements within that processor
    • Add the result of parallel scan on each processor to each of its local prefix sum

Split, Compact, Expand

  • Split: re-order an array with all elements whose key is 1 at the beginning
  • Compact: remove null elements
  • Expand: each thread has writes to index with variable stride

All these three patterns should answer “where should each thred write its output?” It all dependes on where other threads write. The position where each thread writes can be efficiently calculated with parallel scan [4].

Segmented Scan

Scan only within regions (seperated by Key, size only known on run-time)

  • Regions are seperated by barriers (1s) in Key
  • Operations on Value don’t propagate beyonf barriers
  • Key and Value do scans at once (seperately)
    • Key scan normally using operator OR
    • Value only performs operation when its associated Key is 0 in current stage

  • Doing lots of reductions of unpredictable size at the same time is the most common use
  • Doing sums/max/count/any over arbitrary sub-domains of your data
  • Common Usage Scenarios [5]:
    • Determine which region/tree/group/object class an element belongs to and assign that as its new ID
    • Sort based on that ID
    • Operate on all of the regions/trees/groups/objects in parallel, no matter what their size or number
  • Doing divide-and-conquer type algorithms, following similar procedure like above

Sort

  • Sorting is useful for almost everything, we may sort the dataset by Key and process using segmented scan
  • Sorting is a standard approach but may not be the optimized solution
  • Sorting is helpful because it improves memory and execution coherence (move related data closer)
  • Optimizaed GPU sorting already exists
    • Radix sort (bucket sort) is faster than comparison-based sorts, if you can generate fixed-size Key

Example: Parallel Merge Sort

\[\begin{equation} \begin{split} T(n, p) &= T(\frac{n}{2}, \frac{p}{2}) + O(n) \\ &= T(\frac{n}{4}, \frac{p}{4}) + O(\frac{n}{2}) + O(n) \\ &= ... \\ &= T(\frac{n}{p}, 1) + O(n + \frac{n}{2} + ... + \frac{n}{p/2}) \\ &= O(\frac{n}{p} \log \frac{n}{p} + n) \end{split} \end{equation}\]

Example: Bitonic Sort

  • Bitonic Sequence
    • increases first then decreases, OR
    • decreases first then increases, OR
    • you can shift the sequence to make the first two cases true
  • Bitonic Split: split a bitonic sequence $l$ into two half, $l_{\min}$ and $l_{\max}$
    • $l_{\min} = \min(x_0, x_{n/2}), \min(x_1, x_{n/2+1}), …, \min(x_{n/2-1}, x_{n-1})$
    • $l_{\max} = \max(x_0, x_{n/2}), \max(x_1, x_{n/2+1}), …, \max(x_{n/2-1}, x_{n-1})$
    • the resulting $l_{\min}$ and $l_{\max}$ are still bitonic, and $\max(l_{\min}) \leq \min(l_{\max})$
  • Bitonic Merge: turning a bitonic sequence into a sorted sequence using repeated bitonic split operations
    • $BM(p, p) = O(\log p)$

The right arrows in the figure above refers to CompareExchange - compare two values and move smaller one along the arrow \(\begin{equation} \begin{split} T(p, p) &= BM(2, 2) + BM(4, 4) + ... + BM(p, p) \\ &= 1 + 2 + ... + \log p \\ &= O(\log^2 p) \end{split} \end{equation}\)

When $n>p$

  • sort $n/p$ locally on each processor - $O(\frac{n}{p} \log \frac{n}{p})$
  • Run bitonic sort with CompareExchange replaced by MergeSplit - merge two sorted arrays and split into two
  • $O(\frac{n}{p} \log \frac{n}{p} + \frac{n}{p} \log^2 p)$

MapReduce

MapReduce [6] combines sort and scan described above.

  • Map
    • User provides a function to apply to the input data: give me one input, apply some operations, and writes to one output, no communications between different instances of Map
    • Function can be anything which produces a (key, value) pair
      • value no necessraily to be a number, it can just be a pointer to arbitrary data structure
  • Sort
    • (key, value) pairs are sorted based on their keys implicitly
    • Creates runs of (key, value) pairs with same key
  • Reduce
    • Segmented Scan, function provided by the user, could be simple plus, max, etc.
    • Final result is a list of keys and final values (or arbitrary data structures)

Reference

[1]. https://developer.download.nvidia.com/assets/cuda/files/reduction.pdf

[2]. http://mathonline.wikidot.com/associativity-and-commutativity-of-binary-operations

[3]. CSE 6220 - High Performance Computing, Georgia Tech

[4]. https://wrf.ecse.rpi.edu/Teaching/parallel-s2018/stanford/lectures/lecture_6/parallel_patterns_1.pdf

[5]. https://wrf.ecse.rpi.edu/Teaching/parallel-s2018/stanford/lectures/lecture_7/parallel_patterns_2.pdf

[6]. https://research.google.com/archive/mapreduce-osdi04-slides/index-auto-0007.html