AMD GPU Programming Primer
1. The execution hierarchy: grid → workgroup → wave → thread
A GPU kernel launch is a hierarchy of work units. Bigger units contain smaller ones.
| AMD term | NVIDIA term | What it is |
|---|---|---|
| Grid | Grid | The whole kernel launch — covers the entire problem. |
| Workgroup | Block / threadblock | A group of threads on one Compute Unit (CU). Shares LDS (shared memory). Can synchronize via __syncthreads(). |
| Wavefront (wave) | Warp | 64 threads (AMD CDNA) executing the same instruction simultaneously (SIMT). |
| Thread (work-item) | Thread | One lane in a wave. Has its own thread ID and register state. |
2. Lane vs thread
"Lane" and "thread" are two views of the same physical execution slot.
- Lane = a hardware ALU slot inside a SIMD unit. There are exactly 64 lanes per SIMD on AMD CDNA.
- Thread = the software view of one lane. Has its own thread ID and private registers.
One lane runs one thread at a time. They are 1:1 within an executing wave.
1 wave (= 64 threads) running on 1 SIMD:
Lane 0 ↔ Thread 0 (running my_function with tid=0)
Lane 1 ↔ Thread 1 (running my_function with tid=1)
Lane 2 ↔ Thread 2
...
Lane 63 ↔ Thread 63
All 64 lanes execute the same instruction at the same cycle.
3. Hardware: GPU → CU → SIMD → lane
Below the software hierarchy is the physical hardware:
- GPU contains many CUs (Compute Units). Example: MI300X has 304 CUs.
- CU contains 4 SIMD units. The 4 SIMDs in a CU operate in parallel.
- SIMD contains 64 lanes (ALUs) and a register file that can hold up to 8 resident waves.
GPU ├─ CU 0 │ ├─ SIMD 0 (64 lanes, ≤ 8 resident waves) │ ├─ SIMD 1 (64 lanes, ≤ 8 resident waves) │ ├─ SIMD 2 (64 lanes, ≤ 8 resident waves) │ ├─ SIMD 3 (64 lanes, ≤ 8 resident waves) │ └─ LDS (shared memory, 64 KB) ├─ CU 1 ├─ ... └─ CU 303 ← total 304 CUs on MI300X
4. Registers, VGPRs & occupancy
Register types
| Type | Size | Scope | Notes |
|---|---|---|---|
| VGPR (vector GPR) | 32-bit (4 B) | Private per lane | Up to 256 per lane per wave. Each lane sees its own VGPR. |
| SGPR (scalar GPR) | 32-bit (4 B) | Shared by 64 lanes | Used for scalar values like loop counters, addresses. |
| AGPR (accumulator GPR) | 32-bit (4 B) | Private per lane | CDNA-only. Used as MFMA accumulators. |
How much register memory does one lane have?
1 lane × 256 VGPRs × 4 bytes = 1 KB per lane 1 wave (64 lanes) × 1 KB = 64 KB total register file used by one wave
Occupancy
Occupancy = number of waves resident on a SIMD (1 to 8). Higher occupancy enables better latency hiding.
If a wave uses 256 VGPRs/lane → only 1 wave fits in SIMD → occupancy 1 If a wave uses 128 VGPRs/lane → 2 waves fit → occupancy 2 If a wave uses 32 VGPRs/lane → 8 waves fit → occupancy 8 (max) More resident waves = SIMD can switch when one wave stalls on memory.
5. Memory hierarchy: registers → LDS → cache → HBM
GPU memory has multiple levels, similar to CPU cache hierarchy:
| Level | Size (per CU / total) | Latency (cycles) | Managed by | CPU analogue |
|---|---|---|---|---|
| Registers (VGPR/SGPR) | ~256 KB / CU | ~1 | Compiler | CPU registers |
| LDS (shared memory) | 64 KB / CU | ~10–30 | Software (explicit loads/stores) | Scratchpad / fast SRAM |
| L1 cache | 16 KB / CU | ~30 | Hardware (transparent) | L1 cache |
| L2 cache | ~16 MB total | ~150 | Hardware | L2 cache |
| Infinity / L3 | ~256 MB | ~300 | Hardware | L3 cache |
| HBM (global memory) | 192 GB | ~500–1000 | HW + software | DRAM |
6. Kernel launch <<< grid, block >>>
HIP/CUDA kernel launch syntax:
add_kernel<<< grid, block >>>(A, B, C, N);
| Parameter | Meaning | Example |
|---|---|---|
block (a.k.a. blockSize) | Threads per workgroup | 256 → 4 waves per workgroup |
grid | Number of workgroups | 4 → 4 workgroups total |
| (implicit) | Wave count = blockSize / 64 | 256 / 64 = 4 waves per workgroup |
You don't pick the wave count directly — it is derived from blockSize. The hardware always groups threads into waves of 64 on CDNA.
add_kernel<<< grid=4, block=256 >>>(...) Total threads = 4 × 256 = 1024 Total waves = 1024 / 64 = 16 Total workgroups = 4 Each workgroup → one CU Each wave → one SIMD inside that CU
7. Vector loads & the 16-byte rule
A single load instruction can pull up to 16 bytes into a thread's registers. This is the hardware limit on AMD CDNA.
The number of elements per load (called vec size) depends on the data type:
| Data type | Size (B) | vec=1 | vec=2 | vec=4 | vec=8 | vec=16 |
|---|---|---|---|---|---|---|
| fp16 / bf16 | 2 | 2 B | 4 B | 8 B | 16 B (max) | — |
| fp32 | 4 | 4 B | 8 B | 16 B (max) | — | — |
| int8 | 1 | 1 B | 2 B | 4 B | 8 B | 16 B (max) |
vec × sizeof(dtype) ≤ 16 byte. Larger vec means fewer load instructions to move the same amount of data — faster.
Example GPU instructions
fp16, vec=1 (2 B): global_load_ushort v0, v[1:2] ; load 2 bytes (1 fp16) fp16, vec=4 (8 B): global_load_dwordx2 v[0:1], v[2:3] ; load 8 bytes (4 fp16) fp16, vec=8 (16 B): global_load_dwordx4 v[0:3], v[4:5] ; load 16 bytes (8 fp16) - MAX fp32, vec=4 (16 B): global_load_dwordx4 v[0:3], v[4:5] ; load 16 bytes (4 fp32) - MAX
8. Walking through a simple kernel
__global__ void add_kernel(float* A, float* B, float* C, int N) {
int tid = blockIdx.x * blockDim.x + threadIdx.x; // global thread ID
if (tid < N) {
float a = A[tid]; // load (HBM → register)
float b = B[tid]; // load (HBM → register)
float c = a + b; // ALU (register-to-register, ~1 cycle)
C[tid] = c; // store (register → HBM)
}
}
add_kernel<<< 4, 256 >>>(A, B, C, 1024);
What happens per cycle (assuming occupancy 1, the worst case):
cycle: 1 2..99 100 101..199 200 201
---- ------ ---- -------- ---- -----
inst: load wait load wait add store
A (idle) B (idle)
↓
ALU only busy 2 cycles out of 200.
Instructions in this kernel: 4. Actual cycles: ~200. The reason: each HBM load takes ~100 cycles to complete, even though issuing it takes 1 cycle. With occupancy 1, the SIMD has nothing else to do but wait.
9. MFMA: cooperative matrix multiply
MFMA (Matrix Fused Multiply-Add) instructions are wave-cooperative: all 64 lanes work together to compute a small matrix multiply (e.g., 16×16).
v_mfma_f32_16x16x16_f16 acc, a_frag, b_frag, c_frag
64 lanes cooperatively compute D = A × B + C
where A is 16×16 fp16, B is 16×16 fp16, D is 16×16 fp32
Each lane holds a small piece of A, B, and accumulates a small piece of D.
The hardware exchanges data between lanes during execution.
Latency: ~8–32 cycles (NOT 1 cycle), but throughput is enormous:
16×16×16 = 4,096 multiply-adds per instruction per wave
CDNA vs RDNA
- CDNA (data-center: gfx90a, gfx942, gfx950): MFMA available.
- RDNA (consumer: gfx11, gfx12): no MFMA. Has WMMA (Wave Matrix Multiply-Accumulate) instead with similar idea.
10. Tiles & how a wave fills a tile (X0, Y0, X1, Y1)
What is a tile?
A tile is a 2D chunk of a matrix that one workgroup (or one wave) processes. GPU kernels divide a big problem into many small tiles.
Big matrix (e.g., 1024 × 1024)
divided into tiles of 64 × 64:
X →
┌────────────────────────┐
Y │ t0 t1 t2 ... t15 │
↓ │ t16 t17 ... │ 16 × 16 = 256 tiles
│ ... │ each handled by one workgroup
│ t240 ... t255 │
└────────────────────────┘
How does a 64-thread wave fill a 64×64 tile?
One 64×64 tile = 4,096 elements. One wave = 64 threads. Each thread is responsible for 4096/64 = 64 elements.
Those 64 elements per thread are split between two axes:
| Symbol | Meaning |
|---|---|
| X1 (= vec) | Number of elements one thread loads in one instruction (X direction). |
| X0 | Number of threads placed along the X axis. |
| Y0 | Number of threads placed along the Y axis. |
| Y1 | Number of times each thread iterates along the Y axis. |
Constraints:
X0 × Y0 = 64 ← total threads (wave size) X0 × X1 = XPerTile ← X axis fully covered Y0 × Y1 = YPerTile ← Y axis fully covered (with iteration) X1 × sizeof ≤ 16 byte ← hardware load limit
Why the X axis can use vec loads
Memory is 1D, but we view it as 2D (row-major):
memory: [a][b][c][d] [e][f][g][h] [i][j][k][l] [m][n][o][p]
─────────── ─────────── ─────────── ───────────
row 0 row 1 row 2 row 3
X direction: addresses +1 (contiguous) → one instruction can load 4/8/16 bytes
Y direction: addresses +width (strided) → needs separate instructions per row
Worked example: 64×64 tile, fp16, vec=4
X1 = 4 (vec) X0 = XPerTile / X1 = 64 / 4 = 16 Y0 = wave / X0 = 64 / 16 = 4 Y1 = YPerTile / Y0 = 64 / 4 = 16 Per thread: X1 × Y1 = 4 × 16 = 64 elements Per wave: 64 threads × 64 elements = 4,096 elements ✓ Load count: 16 (per thread) → 1024 total vec loads → 4,096 elements
X axis (16 threads × 4 vec = 64 cols)
┌─────────────────────────────────────┐
row 0..3 │ T0 T1 T2 T3 T4 T5 ... T14 T15 │ ← Y0=0, Y1=0..3
row 4..7 │ T0 T1 T2 T3 ... │ ← Y0=0, Y1=4..7
... Y1 iterates 16 times
row 16.. │ T16 T17 ... │ ← Y0=1
row 32.. │ T32 T33 ... │ ← Y0=2
row 48..63 │ T48 T49 ... T63 │ ← Y0=3
└─────────────────────────────────────┘
Choosing vec size
| vec | X0 | Y0 | Y1 | Loads / thread | Quality |
|---|---|---|---|---|---|
| 1 | 64 | 1 | 64 | 64 | worst (no vec) |
| 2 | 32 | 2 | 32 | 32 | poor |
| 4 | 16 | 4 | 16 | 16 | good |
| 8 | 8 | 8 | 8 | 8 | best (fp16 max) |
11. Tile distribution patterns: thread / warp / block raked
The same tile can be distributed across threads in several ways. The choice depends on the algorithm, the data layout, and the matrix instruction (MFMA) shape.
| Pattern | Who covers one tile | Wave layout inside tile |
|---|---|---|
thread_raked | 1 wave (64 threads) | n/a (single wave) |
warp_raked | Multiple waves cooperate | 1D stack (waves stripe along one axis) |
block_raked | All waves of the workgroup | 2D grid (waves arranged in a grid) |
warp_raked — 1D wave layout (4 waves stacked along Y)
X axis (XPerTile)
┌─────────────────────────────────┐
│ Wave 0 │ each wave covers full X width
├─────────────────────────────────┤
│ Wave 1 │ 1/4 of Y
├─────────────────────────────────┤
│ Wave 2 │
├─────────────────────────────────┤
│ Wave 3 │
└─────────────────────────────────┘
block_raked — 2D wave layout (4 waves in 2×2 grid)
X axis
┌───────────────┬─────────────┐
│ Wave 0 │ Wave 1 │
│ (X 0..63) │ (X 64..127) │
├───────────────┼─────────────┤
│ Wave 2 │ Wave 3 │
│ (X 0..63) │ (X 64..127) │
└───────────────┴─────────────┘
| warp_raked (1D) | block_raked (2D) | |
|---|---|---|
| Wave layout | 1 axis (Y) | 2 axes (X × Y) |
| Sub-tile per wave | XPerTile × (YPerTile/N) | (XPerTile/M) × (YPerTile/M) |
| X coverage by one wave | full | partial |
12. Cheat sheet
| Concept | Definition |
|---|---|
| Grid | The whole kernel launch. |
| Workgroup (block) | Group of threads on one CU. Shares LDS. |
| Wave (warp) | 64 threads (CDNA) executing in lockstep (SIMT). |
| Thread (work-item) | One software unit; runs on one lane. |
| Lane | One hardware ALU slot in a SIMD. 64 per SIMD. |
| SIMD | Hardware execution unit. 4 per CU. Can hold up to 8 resident waves. |
| CU | Compute Unit. Contains 4 SIMDs and 64 KB LDS. ~304 per MI300X. |
| VGPR | Vector register, private per lane. 32-bit. Up to 256 per lane. |
| SGPR | Scalar register, shared by all 64 lanes in a wave. |
| AGPR | Accumulator register (CDNA), used for MFMA output. |
| LDS | Local Data Share = software-managed shared memory. 64 KB per CU. |
| HBM | Global memory. Large but slow (~500+ cycle latency). |
| Occupancy | Number of resident waves on a SIMD (1–8). Higher = better latency hiding. |
| Latency hiding | SIMD switches to another resident wave while one waits on memory. |
| SIMT | Single Instruction Multiple Threads. All 64 lanes run the same instruction. |
| MFMA | Matrix Fused Multiply-Add. Wave-cooperative matrix multiply (CDNA). |
| WMMA | RDNA equivalent of MFMA. |
| Tile | 2D chunk of a matrix processed by one workgroup. |
| XPerTile / YPerTile | Tile dimensions in elements (algorithm-defined). |
| vec / X1 | Elements per thread per load. Constrained by 16-byte limit. |
| X0, Y0 | Number of threads placed along X / Y axes. |
| Y1 | Y-axis iteration count per thread. |
| Coalescing | Adjacent threads accessing adjacent memory → one wide HBM transaction. |
| Kernel launch | kernel<<< grid, block >>>(...) — grid = workgroups, block = threads/wg. |
| tile_distribution_pattern | How threads/waves are distributed across one tile (thread/warp/block raked). |
- Threads run in waves of 64; one instruction = one wave-step.
- Memory is the slow part. High occupancy hides memory latency.
- Vector loads (up to 16 B) reduce instruction count dramatically.
- Tile dimensions are chosen by the algorithm; thread layout is derived from them.
- MFMA is wave-cooperative — cannot be split across waves.