5/06/2026

AMD GPU Programming Primer — Threads, Waves, Tiles & Vector Loads

AMD GPU Programming Primer

Threads · Waves · Memory · Tile Distribution · Vector Loads · MFMA

1. The execution hierarchy: grid → workgroup → wave → thread

A GPU kernel launch is a hierarchy of work units. Bigger units contain smaller ones.

AMD term	NVIDIA term	What it is
Grid	Grid	The whole kernel launch — covers the entire problem.
Workgroup	Block / threadblock	A group of threads on one Compute Unit (CU). Shares LDS (shared memory). Can synchronize via `__syncthreads()`.
Wavefront (wave)	Warp	64 threads (AMD CDNA) executing the same instruction simultaneously (SIMT).
Thread (work-item)	Thread	One lane in a wave. Has its own thread ID and register state.

GRID (kernel launch — covers the whole problem)

Workgroup 0 (256 threads)

Wave 0 (T0..T63)

Wave 1 (T64..T127)

Wave 2 (T128..T191)

Wave 3 (T192..T255)

Workgroup 1 (256 threads)

Wave 0..3 (64 threads each)

Workgroup N−1

Wave 0..3

Key: 64 threads in a wave always execute the same instruction in lock-step. That is the essence of SIMT (Single Instruction Multiple Threads).

2. Lane vs thread

"Lane" and "thread" are two views of the same physical execution slot.

Lane = a hardware ALU slot inside a SIMD unit. There are exactly 64 lanes per SIMD on AMD CDNA.
Thread = the software view of one lane. Has its own thread ID and private registers.

One lane runs one thread at a time. They are 1:1 within an executing wave.

1 wave (= 64 threads) running on 1 SIMD:

   Lane 0  ↔  Thread 0    (running my_function with tid=0)
   Lane 1  ↔  Thread 1    (running my_function with tid=1)
   Lane 2  ↔  Thread 2
    ...
   Lane 63 ↔  Thread 63

All 64 lanes execute the same instruction at the same cycle.

3. Hardware: GPU → CU → SIMD → lane

Below the software hierarchy is the physical hardware:

GPU contains many CUs (Compute Units). Example: MI300X has 304 CUs.
CU contains 4 SIMD units. The 4 SIMDs in a CU operate in parallel.
SIMD contains 64 lanes (ALUs) and a register file that can hold up to 8 resident waves.

GPU
├─ CU 0
│   ├─ SIMD 0  (64 lanes, ≤ 8 resident waves)
│   ├─ SIMD 1  (64 lanes, ≤ 8 resident waves)
│   ├─ SIMD 2  (64 lanes, ≤ 8 resident waves)
│   ├─ SIMD 3  (64 lanes, ≤ 8 resident waves)
│   └─ LDS (shared memory, 64 KB)
├─ CU 1
├─ ...
└─ CU 303      ← total 304 CUs on MI300X

SIMD ≠ instant execution. A SIMD holds up to 8 waves in its register file but executes only one wave per cycle. With multiple waves resident, when one wave waits for memory, the SIMD switches to another. This is latency hiding.

4. Registers, VGPRs & occupancy

Register types

Type	Size	Scope	Notes
VGPR (vector GPR)	32-bit (4 B)	Private per lane	Up to 256 per lane per wave. Each lane sees its own VGPR.
SGPR (scalar GPR)	32-bit (4 B)	Shared by 64 lanes	Used for scalar values like loop counters, addresses.
AGPR (accumulator GPR)	32-bit (4 B)	Private per lane	CDNA-only. Used as MFMA accumulators.

How much register memory does one lane have?

1 lane × 256 VGPRs × 4 bytes = 1 KB per lane
1 wave (64 lanes) × 1 KB = 64 KB total register file used by one wave

Occupancy

Occupancy = number of waves resident on a SIMD (1 to 8). Higher occupancy enables better latency hiding.

If a wave uses 256 VGPRs/lane → only 1 wave fits in SIMD → occupancy 1
If a wave uses 128 VGPRs/lane → 2 waves fit                → occupancy 2
If a wave uses  32 VGPRs/lane → 8 waves fit                → occupancy 8 (max)

More resident waves = SIMD can switch when one wave stalls on memory.

Trade-off: using more VGPRs per thread means each thread can hold more data, but fewer waves can be resident, reducing latency hiding.

5. Memory hierarchy: registers → LDS → cache → HBM

GPU memory has multiple levels, similar to CPU cache hierarchy:

Level	Size (per CU / total)	Latency (cycles)	Managed by	CPU analogue
Registers (VGPR/SGPR)	~256 KB / CU	~1	Compiler	CPU registers
LDS (shared memory)	64 KB / CU	~10–30	Software (explicit loads/stores)	Scratchpad / fast SRAM
L1 cache	16 KB / CU	~30	Hardware (transparent)	L1 cache
L2 cache	~16 MB total	~150	Hardware	L2 cache
Infinity / L3	~256 MB	~300	Hardware	L3 cache
HBM (global memory)	192 GB	~500–1000	HW + software	DRAM

Key insight: registers are basically free (~1 cycle), HBM is very expensive (~500+ cycles). Performance comes from staging data through LDS and registers, and from hiding HBM latency with high occupancy.

6. Kernel launch `<<< grid, block >>>`

HIP/CUDA kernel launch syntax:

add_kernel<<< grid, block >>>(A, B, C, N);

Parameter	Meaning	Example
`block` (a.k.a. blockSize)	Threads per workgroup	256 → 4 waves per workgroup
`grid`	Number of workgroups	4 → 4 workgroups total
(implicit)	Wave count = blockSize / 64	256 / 64 = 4 waves per workgroup

You don't pick the wave count directly — it is derived from blockSize. The hardware always groups threads into waves of 64 on CDNA.

add_kernel<<< grid=4, block=256 >>>(...)

Total threads = 4 × 256 = 1024
Total waves   = 1024 / 64 = 16
Total workgroups = 4

Each workgroup → one CU
Each wave → one SIMD inside that CU

7. Vector loads & the 16-byte rule

A single load instruction can pull up to 16 bytes into a thread's registers. This is the hardware limit on AMD CDNA.

The number of elements per load (called vec size) depends on the data type:

Data type	Size (B)	vec=1	vec=2	vec=4	vec=8	vec=16
fp16 / bf16	2	2 B	4 B	8 B	16 B (max)	—
fp32	4	4 B	8 B	16 B (max)	—	—
int8	1	1 B	2 B	4 B	8 B	16 B (max)

Rule: vec × sizeof(dtype) ≤ 16 byte. Larger vec means fewer load instructions to move the same amount of data — faster.

Example GPU instructions

fp16, vec=1 (2 B):
   global_load_ushort  v0,    v[1:2]    ; load 2 bytes (1 fp16)
fp16, vec=4 (8 B):
   global_load_dwordx2 v[0:1], v[2:3]   ; load 8 bytes (4 fp16)
fp16, vec=8 (16 B):
   global_load_dwordx4 v[0:3], v[4:5]   ; load 16 bytes (8 fp16) - MAX
fp32, vec=4 (16 B):
   global_load_dwordx4 v[0:3], v[4:5]   ; load 16 bytes (4 fp32) - MAX

8. Walking through a simple kernel

__global__ void add_kernel(float* A, float* B, float* C, int N) {
    int tid = blockIdx.x * blockDim.x + threadIdx.x;  // global thread ID
    if (tid < N) {
        float a = A[tid];        // load (HBM → register)
        float b = B[tid];        // load (HBM → register)
        float c = a + b;         // ALU (register-to-register, ~1 cycle)
        C[tid] = c;              // store (register → HBM)
    }
}

add_kernel<<< 4, 256 >>>(A, B, C, 1024);

What happens per cycle (assuming occupancy 1, the worst case):

cycle:  1     2..99    100   101..199   200    201
        ----  ------   ----  --------   ----   -----
inst:   load   wait    load    wait     add    store
        A     (idle)   B      (idle)
                                                  ↓
                                          ALU only busy 2 cycles out of 200.

Instructions in this kernel: 4. Actual cycles: ~200. The reason: each HBM load takes ~100 cycles to complete, even though issuing it takes 1 cycle. With occupancy 1, the SIMD has nothing else to do but wait.

If occupancy were 4, the SIMD would switch to other waves during the wait, keeping the ALU busy on every cycle. This is why high occupancy matters.

9. MFMA: cooperative matrix multiply

MFMA (Matrix Fused Multiply-Add) instructions are wave-cooperative: all 64 lanes work together to compute a small matrix multiply (e.g., 16×16).

v_mfma_f32_16x16x16_f16   acc, a_frag, b_frag, c_frag

  64 lanes cooperatively compute D = A × B + C
  where A is 16×16 fp16, B is 16×16 fp16, D is 16×16 fp32

  Each lane holds a small piece of A, B, and accumulates a small piece of D.
  The hardware exchanges data between lanes during execution.

  Latency: ~8–32 cycles (NOT 1 cycle), but throughput is enormous:
    16×16×16 = 4,096 multiply-adds per instruction per wave

Important: MFMA is a wave-level instruction. It cannot be split across waves — one wave executes one MFMA. To compute a larger matrix multiply, multiple waves issue multiple MFMAs (covering different tiles).

CDNA vs RDNA

CDNA (data-center: gfx90a, gfx942, gfx950): MFMA available.
RDNA (consumer: gfx11, gfx12): no MFMA. Has WMMA (Wave Matrix Multiply-Accumulate) instead with similar idea.

10. Tiles & how a wave fills a tile (X0, Y0, X1, Y1)

What is a tile?

A tile is a 2D chunk of a matrix that one workgroup (or one wave) processes. GPU kernels divide a big problem into many small tiles.

Big matrix (e.g., 1024 × 1024)
divided into tiles of 64 × 64:

       X →
       ┌────────────────────────┐
   Y   │ t0  t1  t2  ...  t15   │
   ↓   │ t16 t17 ...            │   16 × 16 = 256 tiles
       │ ...                    │   each handled by one workgroup
       │ t240 ...          t255 │
       └────────────────────────┘

How does a 64-thread wave fill a 64×64 tile?

One 64×64 tile = 4,096 elements. One wave = 64 threads. Each thread is responsible for 4096/64 = 64 elements.

Those 64 elements per thread are split between two axes:

Symbol	Meaning
X1 (= vec)	Number of elements one thread loads in one instruction (X direction).
X0	Number of threads placed along the X axis.
Y0	Number of threads placed along the Y axis.
Y1	Number of times each thread iterates along the Y axis.

Constraints:

X0 × Y0     = 64           ← total threads (wave size)
X0 × X1     = XPerTile     ← X axis fully covered
Y0 × Y1     = YPerTile     ← Y axis fully covered (with iteration)
X1 × sizeof ≤ 16 byte      ← hardware load limit

Why the X axis can use vec loads

Memory is 1D, but we view it as 2D (row-major):

memory:  [a][b][c][d] [e][f][g][h] [i][j][k][l] [m][n][o][p]
         ─────────── ─────────── ─────────── ───────────
            row 0       row 1       row 2       row 3

X direction: addresses +1 (contiguous)  → one instruction can load 4/8/16 bytes
Y direction: addresses +width (strided) → needs separate instructions per row

Worked example: 64×64 tile, fp16, vec=4

X1 = 4 (vec)
X0 = XPerTile / X1 = 64 / 4 = 16
Y0 = wave / X0     = 64 / 16 = 4
Y1 = YPerTile / Y0 = 64 / 4  = 16

Per thread:    X1 × Y1 = 4 × 16 = 64 elements
Per wave:      64 threads × 64 elements = 4,096 elements ✓
Load count:    16 (per thread) → 1024 total vec loads → 4,096 elements

                X axis (16 threads × 4 vec = 64 cols)
              ┌─────────────────────────────────────┐
   row 0..3   │ T0  T1  T2  T3  T4  T5 ... T14 T15  │   ← Y0=0, Y1=0..3
   row 4..7   │ T0  T1  T2  T3  ...                 │   ← Y0=0, Y1=4..7
   ...                                                  Y1 iterates 16 times
   row 16..   │ T16 T17 ...                         │   ← Y0=1
   row 32..   │ T32 T33 ...                         │   ← Y0=2
   row 48..63 │ T48 T49 ... T63                     │   ← Y0=3
              └─────────────────────────────────────┘

Choosing vec size

vec	X0	Y0	Y1	Loads / thread	Quality
1	64	1	64	64	worst (no vec)
2	32	2	32	32	poor
4	16	4	16	16	good
8	8	8	8	8	best (fp16 max)

Larger vec → fewer load instructions → faster, up to the 16-byte hardware limit.

11. Tile distribution patterns: thread / warp / block raked

The same tile can be distributed across threads in several ways. The choice depends on the algorithm, the data layout, and the matrix instruction (MFMA) shape.

Pattern	Who covers one tile	Wave layout inside tile
`thread_raked`	1 wave (64 threads)	n/a (single wave)
`warp_raked`	Multiple waves cooperate	1D stack (waves stripe along one axis)
`block_raked`	All waves of the workgroup	2D grid (waves arranged in a grid)

warp_raked — 1D wave layout (4 waves stacked along Y)

                X axis (XPerTile)
              ┌─────────────────────────────────┐
              │          Wave 0                 │   each wave covers full X width
              ├─────────────────────────────────┤
              │          Wave 1                 │   1/4 of Y
              ├─────────────────────────────────┤
              │          Wave 2                 │
              ├─────────────────────────────────┤
              │          Wave 3                 │
              └─────────────────────────────────┘

block_raked — 2D wave layout (4 waves in 2×2 grid)

                X axis
              ┌───────────────┬─────────────┐
              │  Wave 0       │  Wave 1     │
              │ (X 0..63)     │ (X 64..127) │
              ├───────────────┼─────────────┤
              │  Wave 2       │  Wave 3     │
              │ (X 0..63)     │ (X 64..127) │
              └───────────────┴─────────────┘

	warp_raked (1D)	block_raked (2D)
Wave layout	1 axis (Y)	2 axes (X × Y)
Sub-tile per wave	XPerTile × (YPerTile/N)	(XPerTile/M) × (YPerTile/M)
X coverage by one wave	full	partial

The choice of pattern affects memory access patterns, MFMA fragment alignment, and register tile shapes. Each is best suited to different scenarios.

12. Cheat sheet

Concept	Definition
Grid	The whole kernel launch.
Workgroup (block)	Group of threads on one CU. Shares LDS.
Wave (warp)	64 threads (CDNA) executing in lockstep (SIMT).
Thread (work-item)	One software unit; runs on one lane.
Lane	One hardware ALU slot in a SIMD. 64 per SIMD.
SIMD	Hardware execution unit. 4 per CU. Can hold up to 8 resident waves.
CU	Compute Unit. Contains 4 SIMDs and 64 KB LDS. ~304 per MI300X.
VGPR	Vector register, private per lane. 32-bit. Up to 256 per lane.
SGPR	Scalar register, shared by all 64 lanes in a wave.
AGPR	Accumulator register (CDNA), used for MFMA output.
LDS	Local Data Share = software-managed shared memory. 64 KB per CU.
HBM	Global memory. Large but slow (~500+ cycle latency).
Occupancy	Number of resident waves on a SIMD (1–8). Higher = better latency hiding.
Latency hiding	SIMD switches to another resident wave while one waits on memory.
SIMT	Single Instruction Multiple Threads. All 64 lanes run the same instruction.
MFMA	Matrix Fused Multiply-Add. Wave-cooperative matrix multiply (CDNA).
WMMA	RDNA equivalent of MFMA.
Tile	2D chunk of a matrix processed by one workgroup.
XPerTile / YPerTile	Tile dimensions in elements (algorithm-defined).
vec / X1	Elements per thread per load. Constrained by 16-byte limit.
X0, Y0	Number of threads placed along X / Y axes.
Y1	Y-axis iteration count per thread.
Coalescing	Adjacent threads accessing adjacent memory → one wide HBM transaction.
Kernel launch	`kernel<<< grid, block >>>(...)` — grid = workgroups, block = threads/wg.
tile_distribution_pattern	How threads/waves are distributed across one tile (thread/warp/block raked).

Key takeaways

Threads run in waves of 64; one instruction = one wave-step.
Memory is the slow part. High occupancy hides memory latency.
Vector loads (up to 16 B) reduce instruction count dramatically.
Tile dimensions are chosen by the algorithm; thread layout is derived from them.
MFMA is wave-cooperative — cannot be split across waves.

5/03/2026

How to Access Korean-Only Websites from Overseas Using AWS EC2 (Seoul Region VPN)

Some Korean websites (government, financial, public institutions) block access from foreign IP addresses. This guide shows you how to create a quick VPN tunnel through AWS EC2 in Seoul to get a Korean IP address.

What you need: AWS account, AWS CLI installed, Terminal (macOS/Linux)

Step 1: Verify AWS CLI

Make sure AWS CLI is installed and configured:

aws --version
aws sts get-caller-identity --region ap-northeast-2

If you see your Account ID, you're good to go.

Step 2: Create a Key Pair

aws ec2 create-key-pair \
  --key-name kr-proxy-key \
  --region ap-northeast-2 \
  --query 'KeyMaterial' \
  --output text > ~/Desktop/kr-proxy-key.pem

chmod 400 ~/Desktop/kr-proxy-key.pem

Step 3: Create a Security Group

# Create security group
aws ec2 create-security-group \
  --group-name kr-proxy-sg \
  --description "SSH proxy for Korean IP access" \
  --region ap-northeast-2

# Allow SSH inbound (replace sg-xxxxx with your Group ID from above)
aws ec2 authorize-security-group-ingress \
  --group-id sg-xxxxx \
  --protocol tcp \
  --port 22 \
  --cidr 0.0.0.0/0 \
  --region ap-northeast-2

Step 4: Find the Latest AMI

aws ec2 describe-images \
  --owners amazon \
  --filters "Name=name,Values=al2023-ami-2023*-x86_64" "Name=state,Values=available" \
  --query 'Images | sort_by(@, &CreationDate) | [-1].ImageId' \
  --output text \
  --region ap-northeast-2

Note the AMI ID (e.g. ami-09a64de684ce1ac0e).

Step 5: Launch EC2 Instance

aws ec2 run-instances \
  --image-id ami-09a64de684ce1ac0e \
  --instance-type t2.micro \
  --key-name kr-proxy-key \
  --security-group-ids sg-xxxxx \
  --associate-public-ip-address \
  --tag-specifications 'ResourceType=instance,Tags=[{Key=Name,Value=kr-proxy}]' \
  --region ap-northeast-2 \
  --query 'Instances[0].InstanceId' \
  --output text

Note the Instance ID (e.g. i-0df42e446381ecc9a).

t2.micro is free-tier eligible (750 hours/month for 12 months).

Step 6: Get Public IP

Wait about 30 seconds, then:

aws ec2 describe-instances \
  --instance-ids i-xxxxx \
  --region ap-northeast-2 \
  --query 'Reservations[0].Instances[0].[State.Name, PublicIpAddress]' \
  --output text

You should see something like: running 13.125.x.x

Step 7: Open SSH SOCKS5 Tunnel

ssh -D 1080 -N -f \
  -o StrictHostKeyChecking=no \
  -i ~/Desktop/kr-proxy-key.pem \
  ec2-user@13.125.x.x

-D 1080 — Creates a SOCKS5 proxy on local port 1080
-N — No remote command (tunnel only)
-f — Run in background

Step 8: Verify Korean IP

curl --socks5-hostname localhost:1080 https://ifconfig.me

If it returns your EC2's Korean IP (e.g. 13.125.x.x), it's working!

Step 9: Configure Browser Proxy

Option A: macOS System Settings

Open System Settings → Network → Wi-Fi
Click Details... → Proxies
Enable SOCKS Proxy
Server: localhost / Port: 1080
Click OK

Option B: Terminal (macOS)

sudo networksetup -setsocksfirewallproxy "Wi-Fi" localhost 1080
sudo networksetup -setsocksfirewallproxystate "Wi-Fi" on

Option C: curl only

curl --socks5-hostname localhost:1080 https://www.target-website.kr

Now open your browser and access the Korean website!

Clean Up (Important!)

When you're done, clean up everything to avoid charges:

1. Close the SSH Tunnel

pkill -f "ssh -D 1080"

2. Turn Off Browser Proxy

macOS GUI: System Settings → Network → Wi-Fi → Details → Proxies → SOCKS Proxy OFF

Terminal:

sudo networksetup -setsocksfirewallproxystate "Wi-Fi" off

# Verify
networksetup -getsocksfirewallproxy "Wi-Fi"
# Should show: Enabled: No

3. Terminate EC2 Instance

aws ec2 terminate-instances \
  --instance-ids i-xxxxx \
  --region ap-northeast-2

4. Delete Security Group

Wait about 30 seconds after termination, then:

aws ec2 delete-security-group \
  --group-id sg-xxxxx \
  --region ap-northeast-2

5. Delete Key Pair

aws ec2 delete-key-pair \
  --key-name kr-proxy-key \
  --region ap-northeast-2

rm ~/Desktop/kr-proxy-key.pem

Cost Summary

Item	Cost
t2.micro (free tier)	Free (750 hrs/month, 12 months)
t2.micro (after free tier)	~$0.0116/hr (~$8.5/month)
Data transfer	Free up to 100GB/month

Tip: If you want to keep the instance for later, Stop it instead of terminating. You only pay for EBS storage (~$0.10/GB/month) while stopped.

Quick Reference: Reconnect Later

If you stopped (not terminated) the instance:

# Start the instance
aws ec2 start-instances --instance-ids i-xxxxx --region ap-northeast-2

# Wait ~30 seconds, get new public IP
aws ec2 describe-instances \
  --instance-ids i-xxxxx \
  --region ap-northeast-2 \
  --query 'Reservations[0].Instances[0].PublicIpAddress' \
  --output text

# Open tunnel
ssh -D 1080 -N -f -i ~/Desktop/kr-proxy-key.pem ec2-user@NEW-IP

# Set proxy
sudo networksetup -setsocksfirewallproxy "Wi-Fi" localhost 1080
sudo networksetup -setsocksfirewallproxystate "Wi-Fi" on

Pages

5/06/2026