5/06/2026

AMD GPU Programming Primer — Threads, Waves, Tiles & Vector Loads

AMD GPU Programming Primer

Threads · Waves · Memory · Tile Distribution · Vector Loads · MFMA

1. The execution hierarchy: grid → workgroup → wave → thread

A GPU kernel launch is a hierarchy of work units. Bigger units contain smaller ones.

AMD termNVIDIA termWhat it is
GridGridThe whole kernel launch — covers the entire problem.
WorkgroupBlock / threadblockA group of threads on one Compute Unit (CU). Shares LDS (shared memory). Can synchronize via __syncthreads().
Wavefront (wave)Warp64 threads (AMD CDNA) executing the same instruction simultaneously (SIMT).
Thread (work-item)ThreadOne lane in a wave. Has its own thread ID and register state.
GRID (kernel launch — covers the whole problem)
Workgroup 0 (256 threads)
Wave 0 (T0..T63)
Wave 1 (T64..T127)
Wave 2 (T128..T191)
Wave 3 (T192..T255)
Workgroup 1 (256 threads)
Wave 0..3 (64 threads each)
Workgroup N−1
Wave 0..3
Key: 64 threads in a wave always execute the same instruction in lock-step. That is the essence of SIMT (Single Instruction Multiple Threads).

2. Lane vs thread

"Lane" and "thread" are two views of the same physical execution slot.

  • Lane = a hardware ALU slot inside a SIMD unit. There are exactly 64 lanes per SIMD on AMD CDNA.
  • Thread = the software view of one lane. Has its own thread ID and private registers.

One lane runs one thread at a time. They are 1:1 within an executing wave.

1 wave (= 64 threads) running on 1 SIMD:

   Lane 0  ↔  Thread 0    (running my_function with tid=0)
   Lane 1  ↔  Thread 1    (running my_function with tid=1)
   Lane 2  ↔  Thread 2
    ...
   Lane 63 ↔  Thread 63

All 64 lanes execute the same instruction at the same cycle.

3. Hardware: GPU → CU → SIMD → lane

Below the software hierarchy is the physical hardware:

  • GPU contains many CUs (Compute Units). Example: MI300X has 304 CUs.
  • CU contains 4 SIMD units. The 4 SIMDs in a CU operate in parallel.
  • SIMD contains 64 lanes (ALUs) and a register file that can hold up to 8 resident waves.
GPU
├─ CU 0
│   ├─ SIMD 0  (64 lanes, ≤ 8 resident waves)
│   ├─ SIMD 1  (64 lanes, ≤ 8 resident waves)
│   ├─ SIMD 2  (64 lanes, ≤ 8 resident waves)
│   ├─ SIMD 3  (64 lanes, ≤ 8 resident waves)
│   └─ LDS (shared memory, 64 KB)
├─ CU 1
├─ ...
└─ CU 303      ← total 304 CUs on MI300X
SIMD ≠ instant execution. A SIMD holds up to 8 waves in its register file but executes only one wave per cycle. With multiple waves resident, when one wave waits for memory, the SIMD switches to another. This is latency hiding.

4. Registers, VGPRs & occupancy

Register types

TypeSizeScopeNotes
VGPR (vector GPR)32-bit (4 B)Private per laneUp to 256 per lane per wave. Each lane sees its own VGPR.
SGPR (scalar GPR)32-bit (4 B)Shared by 64 lanesUsed for scalar values like loop counters, addresses.
AGPR (accumulator GPR)32-bit (4 B)Private per laneCDNA-only. Used as MFMA accumulators.

How much register memory does one lane have?

1 lane × 256 VGPRs × 4 bytes = 1 KB per lane
1 wave (64 lanes) × 1 KB = 64 KB total register file used by one wave

Occupancy

Occupancy = number of waves resident on a SIMD (1 to 8). Higher occupancy enables better latency hiding.

If a wave uses 256 VGPRs/lane → only 1 wave fits in SIMD → occupancy 1
If a wave uses 128 VGPRs/lane → 2 waves fit                → occupancy 2
If a wave uses  32 VGPRs/lane → 8 waves fit                → occupancy 8 (max)

More resident waves = SIMD can switch when one wave stalls on memory.
Trade-off: using more VGPRs per thread means each thread can hold more data, but fewer waves can be resident, reducing latency hiding.

5. Memory hierarchy: registers → LDS → cache → HBM

GPU memory has multiple levels, similar to CPU cache hierarchy:

LevelSize (per CU / total)Latency (cycles)Managed byCPU analogue
Registers (VGPR/SGPR)~256 KB / CU~1CompilerCPU registers
LDS (shared memory)64 KB / CU~10–30Software (explicit loads/stores)Scratchpad / fast SRAM
L1 cache16 KB / CU~30Hardware (transparent)L1 cache
L2 cache~16 MB total~150HardwareL2 cache
Infinity / L3~256 MB~300HardwareL3 cache
HBM (global memory)192 GB~500–1000HW + softwareDRAM
Key insight: registers are basically free (~1 cycle), HBM is very expensive (~500+ cycles). Performance comes from staging data through LDS and registers, and from hiding HBM latency with high occupancy.

6. Kernel launch <<< grid, block >>>

HIP/CUDA kernel launch syntax:

add_kernel<<< grid, block >>>(A, B, C, N);
ParameterMeaningExample
block (a.k.a. blockSize)Threads per workgroup256 → 4 waves per workgroup
gridNumber of workgroups4 → 4 workgroups total
(implicit)Wave count = blockSize / 64256 / 64 = 4 waves per workgroup

You don't pick the wave count directly — it is derived from blockSize. The hardware always groups threads into waves of 64 on CDNA.

add_kernel<<< grid=4, block=256 >>>(...)

Total threads = 4 × 256 = 1024
Total waves   = 1024 / 64 = 16
Total workgroups = 4

Each workgroup → one CU
Each wave → one SIMD inside that CU

7. Vector loads & the 16-byte rule

A single load instruction can pull up to 16 bytes into a thread's registers. This is the hardware limit on AMD CDNA.

The number of elements per load (called vec size) depends on the data type:

Data typeSize (B)vec=1vec=2vec=4vec=8vec=16
fp16 / bf1622 B4 B8 B16 B (max)
fp3244 B8 B16 B (max)
int811 B2 B4 B8 B16 B (max)
Rule: vec × sizeof(dtype) ≤ 16 byte. Larger vec means fewer load instructions to move the same amount of data — faster.

Example GPU instructions

fp16, vec=1 (2 B):
   global_load_ushort  v0,    v[1:2]    ; load 2 bytes (1 fp16)
fp16, vec=4 (8 B):
   global_load_dwordx2 v[0:1], v[2:3]   ; load 8 bytes (4 fp16)
fp16, vec=8 (16 B):
   global_load_dwordx4 v[0:3], v[4:5]   ; load 16 bytes (8 fp16) - MAX
fp32, vec=4 (16 B):
   global_load_dwordx4 v[0:3], v[4:5]   ; load 16 bytes (4 fp32) - MAX

8. Walking through a simple kernel

__global__ void add_kernel(float* A, float* B, float* C, int N) {
    int tid = blockIdx.x * blockDim.x + threadIdx.x;  // global thread ID
    if (tid < N) {
        float a = A[tid];        // load (HBM → register)
        float b = B[tid];        // load (HBM → register)
        float c = a + b;         // ALU (register-to-register, ~1 cycle)
        C[tid] = c;              // store (register → HBM)
    }
}

add_kernel<<< 4, 256 >>>(A, B, C, 1024);

What happens per cycle (assuming occupancy 1, the worst case):

cycle:  1     2..99    100   101..199   200    201
        ----  ------   ----  --------   ----   -----
inst:   load   wait    load    wait     add    store
        A     (idle)   B      (idle)
                                                  ↓
                                          ALU only busy 2 cycles out of 200.

Instructions in this kernel: 4. Actual cycles: ~200. The reason: each HBM load takes ~100 cycles to complete, even though issuing it takes 1 cycle. With occupancy 1, the SIMD has nothing else to do but wait.

If occupancy were 4, the SIMD would switch to other waves during the wait, keeping the ALU busy on every cycle. This is why high occupancy matters.

9. MFMA: cooperative matrix multiply

MFMA (Matrix Fused Multiply-Add) instructions are wave-cooperative: all 64 lanes work together to compute a small matrix multiply (e.g., 16×16).

v_mfma_f32_16x16x16_f16   acc, a_frag, b_frag, c_frag

  64 lanes cooperatively compute D = A × B + C
  where A is 16×16 fp16, B is 16×16 fp16, D is 16×16 fp32

  Each lane holds a small piece of A, B, and accumulates a small piece of D.
  The hardware exchanges data between lanes during execution.

  Latency: ~8–32 cycles (NOT 1 cycle), but throughput is enormous:
    16×16×16 = 4,096 multiply-adds per instruction per wave
Important: MFMA is a wave-level instruction. It cannot be split across waves — one wave executes one MFMA. To compute a larger matrix multiply, multiple waves issue multiple MFMAs (covering different tiles).

CDNA vs RDNA

  • CDNA (data-center: gfx90a, gfx942, gfx950): MFMA available.
  • RDNA (consumer: gfx11, gfx12): no MFMA. Has WMMA (Wave Matrix Multiply-Accumulate) instead with similar idea.

10. Tiles & how a wave fills a tile (X0, Y0, X1, Y1)

What is a tile?

A tile is a 2D chunk of a matrix that one workgroup (or one wave) processes. GPU kernels divide a big problem into many small tiles.

Big matrix (e.g., 1024 × 1024)
divided into tiles of 64 × 64:

       X →
       ┌────────────────────────┐
   Y   │ t0  t1  t2  ...  t15   │
   ↓   │ t16 t17 ...            │   16 × 16 = 256 tiles
       │ ...                    │   each handled by one workgroup
       │ t240 ...          t255 │
       └────────────────────────┘

How does a 64-thread wave fill a 64×64 tile?

One 64×64 tile = 4,096 elements. One wave = 64 threads. Each thread is responsible for 4096/64 = 64 elements.

Those 64 elements per thread are split between two axes:

SymbolMeaning
X1 (= vec)Number of elements one thread loads in one instruction (X direction).
X0Number of threads placed along the X axis.
Y0Number of threads placed along the Y axis.
Y1Number of times each thread iterates along the Y axis.

Constraints:

X0 × Y0     = 64           ← total threads (wave size)
X0 × X1     = XPerTile     ← X axis fully covered
Y0 × Y1     = YPerTile     ← Y axis fully covered (with iteration)
X1 × sizeof ≤ 16 byte      ← hardware load limit

Why the X axis can use vec loads

Memory is 1D, but we view it as 2D (row-major):

memory:  [a][b][c][d] [e][f][g][h] [i][j][k][l] [m][n][o][p]
         ─────────── ─────────── ─────────── ───────────
            row 0       row 1       row 2       row 3

X direction: addresses +1 (contiguous)  → one instruction can load 4/8/16 bytes
Y direction: addresses +width (strided) → needs separate instructions per row

Worked example: 64×64 tile, fp16, vec=4

X1 = 4 (vec)
X0 = XPerTile / X1 = 64 / 4 = 16
Y0 = wave / X0     = 64 / 16 = 4
Y1 = YPerTile / Y0 = 64 / 4  = 16

Per thread:    X1 × Y1 = 4 × 16 = 64 elements
Per wave:      64 threads × 64 elements = 4,096 elements ✓
Load count:    16 (per thread) → 1024 total vec loads → 4,096 elements
                X axis (16 threads × 4 vec = 64 cols)
              ┌─────────────────────────────────────┐
   row 0..3   │ T0  T1  T2  T3  T4  T5 ... T14 T15  │   ← Y0=0, Y1=0..3
   row 4..7   │ T0  T1  T2  T3  ...                 │   ← Y0=0, Y1=4..7
   ...                                                  Y1 iterates 16 times
   row 16..   │ T16 T17 ...                         │   ← Y0=1
   row 32..   │ T32 T33 ...                         │   ← Y0=2
   row 48..63 │ T48 T49 ... T63                     │   ← Y0=3
              └─────────────────────────────────────┘

Choosing vec size

vecX0Y0Y1Loads / threadQuality
16416464worst (no vec)
23223232poor
41641616good
88888best (fp16 max)
Larger vec → fewer load instructions → faster, up to the 16-byte hardware limit.

11. Tile distribution patterns: thread / warp / block raked

The same tile can be distributed across threads in several ways. The choice depends on the algorithm, the data layout, and the matrix instruction (MFMA) shape.

PatternWho covers one tileWave layout inside tile
thread_raked1 wave (64 threads)n/a (single wave)
warp_rakedMultiple waves cooperate1D stack (waves stripe along one axis)
block_rakedAll waves of the workgroup2D grid (waves arranged in a grid)

warp_raked — 1D wave layout (4 waves stacked along Y)

                X axis (XPerTile)
              ┌─────────────────────────────────┐
              │          Wave 0                 │   each wave covers full X width
              ├─────────────────────────────────┤
              │          Wave 1                 │   1/4 of Y
              ├─────────────────────────────────┤
              │          Wave 2                 │
              ├─────────────────────────────────┤
              │          Wave 3                 │
              └─────────────────────────────────┘

block_raked — 2D wave layout (4 waves in 2×2 grid)

                X axis
              ┌───────────────┬─────────────┐
              │  Wave 0       │  Wave 1     │
              │ (X 0..63)     │ (X 64..127) │
              ├───────────────┼─────────────┤
              │  Wave 2       │  Wave 3     │
              │ (X 0..63)     │ (X 64..127) │
              └───────────────┴─────────────┘
warp_raked (1D)block_raked (2D)
Wave layout1 axis (Y)2 axes (X × Y)
Sub-tile per waveXPerTile × (YPerTile/N)(XPerTile/M) × (YPerTile/M)
X coverage by one wavefullpartial
The choice of pattern affects memory access patterns, MFMA fragment alignment, and register tile shapes. Each is best suited to different scenarios.

12. Cheat sheet

ConceptDefinition
GridThe whole kernel launch.
Workgroup (block)Group of threads on one CU. Shares LDS.
Wave (warp)64 threads (CDNA) executing in lockstep (SIMT).
Thread (work-item)One software unit; runs on one lane.
LaneOne hardware ALU slot in a SIMD. 64 per SIMD.
SIMDHardware execution unit. 4 per CU. Can hold up to 8 resident waves.
CUCompute Unit. Contains 4 SIMDs and 64 KB LDS. ~304 per MI300X.
VGPRVector register, private per lane. 32-bit. Up to 256 per lane.
SGPRScalar register, shared by all 64 lanes in a wave.
AGPRAccumulator register (CDNA), used for MFMA output.
LDSLocal Data Share = software-managed shared memory. 64 KB per CU.
HBMGlobal memory. Large but slow (~500+ cycle latency).
OccupancyNumber of resident waves on a SIMD (1–8). Higher = better latency hiding.
Latency hidingSIMD switches to another resident wave while one waits on memory.
SIMTSingle Instruction Multiple Threads. All 64 lanes run the same instruction.
MFMAMatrix Fused Multiply-Add. Wave-cooperative matrix multiply (CDNA).
WMMARDNA equivalent of MFMA.
Tile2D chunk of a matrix processed by one workgroup.
XPerTile / YPerTileTile dimensions in elements (algorithm-defined).
vec / X1Elements per thread per load. Constrained by 16-byte limit.
X0, Y0Number of threads placed along X / Y axes.
Y1Y-axis iteration count per thread.
CoalescingAdjacent threads accessing adjacent memory → one wide HBM transaction.
Kernel launchkernel<<< grid, block >>>(...) — grid = workgroups, block = threads/wg.
tile_distribution_patternHow threads/waves are distributed across one tile (thread/warp/block raked).
Key takeaways
  • Threads run in waves of 64; one instruction = one wave-step.
  • Memory is the slow part. High occupancy hides memory latency.
  • Vector loads (up to 16 B) reduce instruction count dramatically.
  • Tile dimensions are chosen by the algorithm; thread layout is derived from them.
  • MFMA is wave-cooperative — cannot be split across waves.

5/03/2026

How to Access Korean-Only Websites from Overseas Using AWS EC2 (Seoul Region VPN)

How to Access Korean-Only Websites from Overseas Using AWS EC2 (Seoul Region VPN)

Some Korean websites (government, financial, public institutions) block access from foreign IP addresses. This guide shows you how to create a quick VPN tunnel through AWS EC2 in Seoul to get a Korean IP address.

What you need: AWS account, AWS CLI installed, Terminal (macOS/Linux)


Step 1: Verify AWS CLI

Make sure AWS CLI is installed and configured:

aws --version
aws sts get-caller-identity --region ap-northeast-2

If you see your Account ID, you're good to go.


Step 2: Create a Key Pair

aws ec2 create-key-pair \
  --key-name kr-proxy-key \
  --region ap-northeast-2 \
  --query 'KeyMaterial' \
  --output text > ~/Desktop/kr-proxy-key.pem

chmod 400 ~/Desktop/kr-proxy-key.pem

Step 3: Create a Security Group

# Create security group
aws ec2 create-security-group \
  --group-name kr-proxy-sg \
  --description "SSH proxy for Korean IP access" \
  --region ap-northeast-2

# Allow SSH inbound (replace sg-xxxxx with your Group ID from above)
aws ec2 authorize-security-group-ingress \
  --group-id sg-xxxxx \
  --protocol tcp \
  --port 22 \
  --cidr 0.0.0.0/0 \
  --region ap-northeast-2

Step 4: Find the Latest AMI

aws ec2 describe-images \
  --owners amazon \
  --filters "Name=name,Values=al2023-ami-2023*-x86_64" "Name=state,Values=available" \
  --query 'Images | sort_by(@, &CreationDate) | [-1].ImageId' \
  --output text \
  --region ap-northeast-2

Note the AMI ID (e.g. ami-09a64de684ce1ac0e).


Step 5: Launch EC2 Instance

aws ec2 run-instances \
  --image-id ami-09a64de684ce1ac0e \
  --instance-type t2.micro \
  --key-name kr-proxy-key \
  --security-group-ids sg-xxxxx \
  --associate-public-ip-address \
  --tag-specifications 'ResourceType=instance,Tags=[{Key=Name,Value=kr-proxy}]' \
  --region ap-northeast-2 \
  --query 'Instances[0].InstanceId' \
  --output text

Note the Instance ID (e.g. i-0df42e446381ecc9a).

t2.micro is free-tier eligible (750 hours/month for 12 months).


Step 6: Get Public IP

Wait about 30 seconds, then:

aws ec2 describe-instances \
  --instance-ids i-xxxxx \
  --region ap-northeast-2 \
  --query 'Reservations[0].Instances[0].[State.Name, PublicIpAddress]' \
  --output text

You should see something like: running 13.125.x.x


Step 7: Open SSH SOCKS5 Tunnel

ssh -D 1080 -N -f \
  -o StrictHostKeyChecking=no \
  -i ~/Desktop/kr-proxy-key.pem \
  ec2-user@13.125.x.x
  • -D 1080 — Creates a SOCKS5 proxy on local port 1080
  • -N — No remote command (tunnel only)
  • -f — Run in background

Step 8: Verify Korean IP

curl --socks5-hostname localhost:1080 https://ifconfig.me

If it returns your EC2's Korean IP (e.g. 13.125.x.x), it's working!


Step 9: Configure Browser Proxy

Option A: macOS System Settings

  1. Open System SettingsNetworkWi-Fi
  2. Click Details...Proxies
  3. Enable SOCKS Proxy
  4. Server: localhost / Port: 1080
  5. Click OK

Option B: Terminal (macOS)

sudo networksetup -setsocksfirewallproxy "Wi-Fi" localhost 1080
sudo networksetup -setsocksfirewallproxystate "Wi-Fi" on

Option C: curl only

curl --socks5-hostname localhost:1080 https://www.target-website.kr

Now open your browser and access the Korean website!


Clean Up (Important!)

When you're done, clean up everything to avoid charges:

1. Close the SSH Tunnel

pkill -f "ssh -D 1080"

2. Turn Off Browser Proxy

macOS GUI: System Settings → Network → Wi-Fi → Details → Proxies → SOCKS Proxy OFF

Terminal:

sudo networksetup -setsocksfirewallproxystate "Wi-Fi" off

# Verify
networksetup -getsocksfirewallproxy "Wi-Fi"
# Should show: Enabled: No

3. Terminate EC2 Instance

aws ec2 terminate-instances \
  --instance-ids i-xxxxx \
  --region ap-northeast-2

4. Delete Security Group

Wait about 30 seconds after termination, then:

aws ec2 delete-security-group \
  --group-id sg-xxxxx \
  --region ap-northeast-2

5. Delete Key Pair

aws ec2 delete-key-pair \
  --key-name kr-proxy-key \
  --region ap-northeast-2

rm ~/Desktop/kr-proxy-key.pem

Cost Summary

Item Cost
t2.micro (free tier) Free (750 hrs/month, 12 months)
t2.micro (after free tier) ~$0.0116/hr (~$8.5/month)
Data transfer Free up to 100GB/month

Tip: If you want to keep the instance for later, Stop it instead of terminating. You only pay for EBS storage (~$0.10/GB/month) while stopped.


Quick Reference: Reconnect Later

If you stopped (not terminated) the instance:

# Start the instance
aws ec2 start-instances --instance-ids i-xxxxx --region ap-northeast-2

# Wait ~30 seconds, get new public IP
aws ec2 describe-instances \
  --instance-ids i-xxxxx \
  --region ap-northeast-2 \
  --query 'Reservations[0].Instances[0].PublicIpAddress' \
  --output text

# Open tunnel
ssh -D 1080 -N -f -i ~/Desktop/kr-proxy-key.pem ec2-user@NEW-IP

# Set proxy
sudo networksetup -setsocksfirewallproxy "Wi-Fi" localhost 1080
sudo networksetup -setsocksfirewallproxystate "Wi-Fi" on