GPU Performance Measurements and Limiters
Jan 12, 2020, 1873 wordsExample - Square Matrix-Matrix Multiplication
AB(i, j)
, which is the product of row i
from matrix A
and column j
from matrix B
.// Instruction/Bytes ratio is 0.25 (limited by memory bandwidth)
__global__ void mm_mul1(float *A, float *B, float *AB, int N) {
// get the global row and column indices
int row = blockIdx.y*blockDim.y + threadIdx.y;
int col = blockIdx.x*blockDim.x + threadIdx.x;
float local = 0; // accumulate local temp summation of AB(i, j)
// dot product
// 2*N FP32 operation: (1 mul + 1 add) * N
// 2*N FP32 loads: 2 * 4 Bytes * N
// Instruction/Bytes ratio: 2/8 = 0.25
for (int i = 0; i < N; i++) local += A[row*N+i] * B[i*N+col];
AB[row*N+col] = local; // assign local results to to AB(i, j)
}
A
and B
into __shared__
memory insead of loading single element. Reuse the pre-loaded tiles to save memery bandwidth [1].// Instruction/Bytes ratio is 4 (improve memory bandwidth usage by using shared memory)
#define TILE_SIZE 16
__global__ void mm_mul2(float *A, float *B, float *AB, int N) {
// allocate shared memory for tiles of A and B
__shared__ float sharedA[TILE_SIZE][TILE_SIZE];
__shared__ float sharedB[TILE_SIZE][TILE_SIZE];
// get the global row and column indices
int row = blockIdx.y*blockDim.y + threadIdx.y;
int col = blockIdx.x*blockDim.x + threadIdx.x;
float local = 0; // accumulate local temp summation of AB(i, j)
// loop over matrices A and B tile by tile
// 2*N FP32 operation per iteration: (1 mul + 1 add) * TILE_SIZE * N/TILE_SIZE
// 2*N/TILE_SIZE FP32 loads per iteration: 2 * 4 Bytes * N/TILE_SIZE
// Instruction/Bytes ratio: (2*N)/(8*N/TILE_SIZE) = TILE_SIZE/4 = 4
for (int i = 0; i < N/TILE_SIZE; i++) {
// work with other thread in warp to load tiles into shared memory
sharedA[threadIdx.y][threadIdx.x] = A[row*N + (i*TILE_SIZE + threadIdx.x)];
sharedB[threadIdx.y][threadIdx.x] = B[(i*TILE_SIZE + threadIdx.y)*N + col];
__syncthreads();
// dot product purely using shared memory
for (int j = 0; j < TILE_SIZE; j++) local += sharedA[threadIdx.y][j] * sharedB[j][threadIdx.x];
__syncthreads();
}
AB[row*N+col] = local;
}
The figure below shows the SIMT architecture of Nvidia GPU. Left side (SI) runs in serial while the right side (MT) runs in parallel [2].
How to schedule blocks onto SMs
How to schedule warps
The hardware has its limits, e.g., FP32 ops/cycle, memory bandwidth. So you get your “Supply” for any given limit from your hardware, you pay for it. You can do worse (your “Demand”), but can never go faster than that number.
\[\begin{equation} \begin{split} Demand & \leq Supply \\ \rightarrow \lambda_{TB} \times N_t \times N_{op} & \leq \lambda_{op} \\ \rightarrow \lambda_{TB} & \leq \lambda_{op} / (N_t \times N_{op}) \end{split} \end{equation}\]where
For any type of operation unit, i.e., load unit, add unit, multiply unit, etc., you can write a limitor equation. You can never exceed the limit you get from this equation.
The most constraining limiter is called the critical limiter. The ideal case is that every limitor is the critical limitor, so you are not wasting any resources. You will want to adjust your algorithm to distribute the loads so everything is uniformly used.
There are three major types of limits on the performance of SM.
I got a kernel and many thread blocks pending to run, how fast should I push these thread blocks to the machine?
// Compute vector sum C = A+B
// FP32 ADD: 1, INT ADD: 1, INT MUL: 1
// LOAD: 2, STORE: 1
__global__ void vecAdd(float* A, float* B, float* C) {
int i = threadIdx.x + blockDim.x * blockIdx.x;
C[i] = A[i] + B[i];
}
SM also has limited resources on its space, i.e., warp count, register file, and shared memory size. These space resources are allocated when you launch the thread blocks (based on what you ask for), and deallocaetd on their completion. The comsumption is calculated using Little’s Law.
Little’s law (queue theory)
The number of staffs in a queue or “in flight” equals the rate they arrive times the latency of the server polling out from the queue.
\[\begin{equation} \begin{split} & N = \lambda \times L \\ \rightarrow & \lambda \leq N / L \end{split} \end{equation}\]where
Example: assuming your device can have maximum of 48 warps at a given time, so $N$ is 48. If a thread block takes 1000 cycles to execute, $L$ is 1000. So the throughput can not exceed $48/1000$.
log
)-maxrregcountoption
, or __launch_bounds__
)[1]. https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#shared-memory
[2]. Programming Massively Parallel Processors with CUDA, https://wrf.ecse.rpi.edu/Teaching/parallel-s2019/stanford/lectures/lecture_11/the_fermi_architecture.pdf