5/06/2026

AMD GPU Programming Primer — Threads, Waves, Tiles & Vector Loads

AMD GPU Programming Primer

Threads · Waves · Memory · Tile Distribution · Vector Loads · MFMA

1. The execution hierarchy: grid → workgroup → wave → thread

A GPU kernel launch is a hierarchy of work units. Bigger units contain smaller ones.

AMD termNVIDIA termWhat it is
GridGridThe whole kernel launch — covers the entire problem.
WorkgroupBlock / threadblockA group of threads on one Compute Unit (CU). Shares LDS (shared memory). Can synchronize via __syncthreads().
Wavefront (wave)Warp64 threads (AMD CDNA) executing the same instruction simultaneously (SIMT).
Thread (work-item)ThreadOne lane in a wave. Has its own thread ID and register state.
GRID (kernel launch — covers the whole problem)
Workgroup 0 (256 threads)
Wave 0 (T0..T63)
Wave 1 (T64..T127)
Wave 2 (T128..T191)
Wave 3 (T192..T255)
Workgroup 1 (256 threads)
Wave 0..3 (64 threads each)
Workgroup N−1
Wave 0..3
Key: 64 threads in a wave always execute the same instruction in lock-step. That is the essence of SIMT (Single Instruction Multiple Threads).

2. Lane vs thread

"Lane" and "thread" are two views of the same physical execution slot.

  • Lane = a hardware ALU slot inside a SIMD unit. There are exactly 64 lanes per SIMD on AMD CDNA.
  • Thread = the software view of one lane. Has its own thread ID and private registers.

One lane runs one thread at a time. They are 1:1 within an executing wave.

1 wave (= 64 threads) running on 1 SIMD:

   Lane 0  ↔  Thread 0    (running my_function with tid=0)
   Lane 1  ↔  Thread 1    (running my_function with tid=1)
   Lane 2  ↔  Thread 2
    ...
   Lane 63 ↔  Thread 63

All 64 lanes execute the same instruction at the same cycle.

3. Hardware: GPU → CU → SIMD → lane

Below the software hierarchy is the physical hardware:

  • GPU contains many CUs (Compute Units). Example: MI300X has 304 CUs.
  • CU contains 4 SIMD units. The 4 SIMDs in a CU operate in parallel.
  • SIMD contains 64 lanes (ALUs) and a register file that can hold up to 8 resident waves.
GPU
├─ CU 0
│   ├─ SIMD 0  (64 lanes, ≤ 8 resident waves)
│   ├─ SIMD 1  (64 lanes, ≤ 8 resident waves)
│   ├─ SIMD 2  (64 lanes, ≤ 8 resident waves)
│   ├─ SIMD 3  (64 lanes, ≤ 8 resident waves)
│   └─ LDS (shared memory, 64 KB)
├─ CU 1
├─ ...
└─ CU 303      ← total 304 CUs on MI300X
SIMD ≠ instant execution. A SIMD holds up to 8 waves in its register file but executes only one wave per cycle. With multiple waves resident, when one wave waits for memory, the SIMD switches to another. This is latency hiding.

4. Registers, VGPRs & occupancy

Register types

TypeSizeScopeNotes
VGPR (vector GPR)32-bit (4 B)Private per laneUp to 256 per lane per wave. Each lane sees its own VGPR.
SGPR (scalar GPR)32-bit (4 B)Shared by 64 lanesUsed for scalar values like loop counters, addresses.
AGPR (accumulator GPR)32-bit (4 B)Private per laneCDNA-only. Used as MFMA accumulators.

How much register memory does one lane have?

1 lane × 256 VGPRs × 4 bytes = 1 KB per lane
1 wave (64 lanes) × 1 KB = 64 KB total register file used by one wave

Occupancy

Occupancy = number of waves resident on a SIMD (1 to 8). Higher occupancy enables better latency hiding.

If a wave uses 256 VGPRs/lane → only 1 wave fits in SIMD → occupancy 1
If a wave uses 128 VGPRs/lane → 2 waves fit                → occupancy 2
If a wave uses  32 VGPRs/lane → 8 waves fit                → occupancy 8 (max)

More resident waves = SIMD can switch when one wave stalls on memory.
Trade-off: using more VGPRs per thread means each thread can hold more data, but fewer waves can be resident, reducing latency hiding.

5. Memory hierarchy: registers → LDS → cache → HBM

GPU memory has multiple levels, similar to CPU cache hierarchy:

LevelSize (per CU / total)Latency (cycles)Managed byCPU analogue
Registers (VGPR/SGPR)~256 KB / CU~1CompilerCPU registers
LDS (shared memory)64 KB / CU~10–30Software (explicit loads/stores)Scratchpad / fast SRAM
L1 cache16 KB / CU~30Hardware (transparent)L1 cache
L2 cache~16 MB total~150HardwareL2 cache
Infinity / L3~256 MB~300HardwareL3 cache
HBM (global memory)192 GB~500–1000HW + softwareDRAM
Key insight: registers are basically free (~1 cycle), HBM is very expensive (~500+ cycles). Performance comes from staging data through LDS and registers, and from hiding HBM latency with high occupancy.

6. Kernel launch <<< grid, block >>>

HIP/CUDA kernel launch syntax:

add_kernel<<< grid, block >>>(A, B, C, N);
ParameterMeaningExample
block (a.k.a. blockSize)Threads per workgroup256 → 4 waves per workgroup
gridNumber of workgroups4 → 4 workgroups total
(implicit)Wave count = blockSize / 64256 / 64 = 4 waves per workgroup

You don't pick the wave count directly — it is derived from blockSize. The hardware always groups threads into waves of 64 on CDNA.

add_kernel<<< grid=4, block=256 >>>(...)

Total threads = 4 × 256 = 1024
Total waves   = 1024 / 64 = 16
Total workgroups = 4

Each workgroup → one CU
Each wave → one SIMD inside that CU

7. Vector loads & the 16-byte rule

A single load instruction can pull up to 16 bytes into a thread's registers. This is the hardware limit on AMD CDNA.

The number of elements per load (called vec size) depends on the data type:

Data typeSize (B)vec=1vec=2vec=4vec=8vec=16
fp16 / bf1622 B4 B8 B16 B (max)
fp3244 B8 B16 B (max)
int811 B2 B4 B8 B16 B (max)
Rule: vec × sizeof(dtype) ≤ 16 byte. Larger vec means fewer load instructions to move the same amount of data — faster.

Example GPU instructions

fp16, vec=1 (2 B):
   global_load_ushort  v0,    v[1:2]    ; load 2 bytes (1 fp16)
fp16, vec=4 (8 B):
   global_load_dwordx2 v[0:1], v[2:3]   ; load 8 bytes (4 fp16)
fp16, vec=8 (16 B):
   global_load_dwordx4 v[0:3], v[4:5]   ; load 16 bytes (8 fp16) - MAX
fp32, vec=4 (16 B):
   global_load_dwordx4 v[0:3], v[4:5]   ; load 16 bytes (4 fp32) - MAX

8. Walking through a simple kernel

__global__ void add_kernel(float* A, float* B, float* C, int N) {
    int tid = blockIdx.x * blockDim.x + threadIdx.x;  // global thread ID
    if (tid < N) {
        float a = A[tid];        // load (HBM → register)
        float b = B[tid];        // load (HBM → register)
        float c = a + b;         // ALU (register-to-register, ~1 cycle)
        C[tid] = c;              // store (register → HBM)
    }
}

add_kernel<<< 4, 256 >>>(A, B, C, 1024);

What happens per cycle (assuming occupancy 1, the worst case):

cycle:  1     2..99    100   101..199   200    201
        ----  ------   ----  --------   ----   -----
inst:   load   wait    load    wait     add    store
        A     (idle)   B      (idle)
                                                  ↓
                                          ALU only busy 2 cycles out of 200.

Instructions in this kernel: 4. Actual cycles: ~200. The reason: each HBM load takes ~100 cycles to complete, even though issuing it takes 1 cycle. With occupancy 1, the SIMD has nothing else to do but wait.

If occupancy were 4, the SIMD would switch to other waves during the wait, keeping the ALU busy on every cycle. This is why high occupancy matters.

9. MFMA: cooperative matrix multiply

MFMA (Matrix Fused Multiply-Add) instructions are wave-cooperative: all 64 lanes work together to compute a small matrix multiply (e.g., 16×16).

v_mfma_f32_16x16x16_f16   acc, a_frag, b_frag, c_frag

  64 lanes cooperatively compute D = A × B + C
  where A is 16×16 fp16, B is 16×16 fp16, D is 16×16 fp32

  Each lane holds a small piece of A, B, and accumulates a small piece of D.
  The hardware exchanges data between lanes during execution.

  Latency: ~8–32 cycles (NOT 1 cycle), but throughput is enormous:
    16×16×16 = 4,096 multiply-adds per instruction per wave
Important: MFMA is a wave-level instruction. It cannot be split across waves — one wave executes one MFMA. To compute a larger matrix multiply, multiple waves issue multiple MFMAs (covering different tiles).

CDNA vs RDNA

  • CDNA (data-center: gfx90a, gfx942, gfx950): MFMA available.
  • RDNA (consumer: gfx11, gfx12): no MFMA. Has WMMA (Wave Matrix Multiply-Accumulate) instead with similar idea.

10. Tiles & how a wave fills a tile (X0, Y0, X1, Y1)

What is a tile?

A tile is a 2D chunk of a matrix that one workgroup (or one wave) processes. GPU kernels divide a big problem into many small tiles.

Big matrix (e.g., 1024 × 1024)
divided into tiles of 64 × 64:

       X →
       ┌────────────────────────┐
   Y   │ t0  t1  t2  ...  t15   │
   ↓   │ t16 t17 ...            │   16 × 16 = 256 tiles
       │ ...                    │   each handled by one workgroup
       │ t240 ...          t255 │
       └────────────────────────┘

How does a 64-thread wave fill a 64×64 tile?

One 64×64 tile = 4,096 elements. One wave = 64 threads. Each thread is responsible for 4096/64 = 64 elements.

Those 64 elements per thread are split between two axes:

SymbolMeaning
X1 (= vec)Number of elements one thread loads in one instruction (X direction).
X0Number of threads placed along the X axis.
Y0Number of threads placed along the Y axis.
Y1Number of times each thread iterates along the Y axis.

Constraints:

X0 × Y0     = 64           ← total threads (wave size)
X0 × X1     = XPerTile     ← X axis fully covered
Y0 × Y1     = YPerTile     ← Y axis fully covered (with iteration)
X1 × sizeof ≤ 16 byte      ← hardware load limit

Why the X axis can use vec loads

Memory is 1D, but we view it as 2D (row-major):

memory:  [a][b][c][d] [e][f][g][h] [i][j][k][l] [m][n][o][p]
         ─────────── ─────────── ─────────── ───────────
            row 0       row 1       row 2       row 3

X direction: addresses +1 (contiguous)  → one instruction can load 4/8/16 bytes
Y direction: addresses +width (strided) → needs separate instructions per row

Worked example: 64×64 tile, fp16, vec=4

X1 = 4 (vec)
X0 = XPerTile / X1 = 64 / 4 = 16
Y0 = wave / X0     = 64 / 16 = 4
Y1 = YPerTile / Y0 = 64 / 4  = 16

Per thread:    X1 × Y1 = 4 × 16 = 64 elements
Per wave:      64 threads × 64 elements = 4,096 elements ✓
Load count:    16 (per thread) → 1024 total vec loads → 4,096 elements
                X axis (16 threads × 4 vec = 64 cols)
              ┌─────────────────────────────────────┐
   row 0..3   │ T0  T1  T2  T3  T4  T5 ... T14 T15  │   ← Y0=0, Y1=0..3
   row 4..7   │ T0  T1  T2  T3  ...                 │   ← Y0=0, Y1=4..7
   ...                                                  Y1 iterates 16 times
   row 16..   │ T16 T17 ...                         │   ← Y0=1
   row 32..   │ T32 T33 ...                         │   ← Y0=2
   row 48..63 │ T48 T49 ... T63                     │   ← Y0=3
              └─────────────────────────────────────┘

Choosing vec size

vecX0Y0Y1Loads / threadQuality
16416464worst (no vec)
23223232poor
41641616good
88888best (fp16 max)
Larger vec → fewer load instructions → faster, up to the 16-byte hardware limit.

11. Tile distribution patterns: thread / warp / block raked

The same tile can be distributed across threads in several ways. The choice depends on the algorithm, the data layout, and the matrix instruction (MFMA) shape.

PatternWho covers one tileWave layout inside tile
thread_raked1 wave (64 threads)n/a (single wave)
warp_rakedMultiple waves cooperate1D stack (waves stripe along one axis)
block_rakedAll waves of the workgroup2D grid (waves arranged in a grid)

warp_raked — 1D wave layout (4 waves stacked along Y)

                X axis (XPerTile)
              ┌─────────────────────────────────┐
              │          Wave 0                 │   each wave covers full X width
              ├─────────────────────────────────┤
              │          Wave 1                 │   1/4 of Y
              ├─────────────────────────────────┤
              │          Wave 2                 │
              ├─────────────────────────────────┤
              │          Wave 3                 │
              └─────────────────────────────────┘

block_raked — 2D wave layout (4 waves in 2×2 grid)

                X axis
              ┌───────────────┬─────────────┐
              │  Wave 0       │  Wave 1     │
              │ (X 0..63)     │ (X 64..127) │
              ├───────────────┼─────────────┤
              │  Wave 2       │  Wave 3     │
              │ (X 0..63)     │ (X 64..127) │
              └───────────────┴─────────────┘
warp_raked (1D)block_raked (2D)
Wave layout1 axis (Y)2 axes (X × Y)
Sub-tile per waveXPerTile × (YPerTile/N)(XPerTile/M) × (YPerTile/M)
X coverage by one wavefullpartial
The choice of pattern affects memory access patterns, MFMA fragment alignment, and register tile shapes. Each is best suited to different scenarios.

12. Cheat sheet

ConceptDefinition
GridThe whole kernel launch.
Workgroup (block)Group of threads on one CU. Shares LDS.
Wave (warp)64 threads (CDNA) executing in lockstep (SIMT).
Thread (work-item)One software unit; runs on one lane.
LaneOne hardware ALU slot in a SIMD. 64 per SIMD.
SIMDHardware execution unit. 4 per CU. Can hold up to 8 resident waves.
CUCompute Unit. Contains 4 SIMDs and 64 KB LDS. ~304 per MI300X.
VGPRVector register, private per lane. 32-bit. Up to 256 per lane.
SGPRScalar register, shared by all 64 lanes in a wave.
AGPRAccumulator register (CDNA), used for MFMA output.
LDSLocal Data Share = software-managed shared memory. 64 KB per CU.
HBMGlobal memory. Large but slow (~500+ cycle latency).
OccupancyNumber of resident waves on a SIMD (1–8). Higher = better latency hiding.
Latency hidingSIMD switches to another resident wave while one waits on memory.
SIMTSingle Instruction Multiple Threads. All 64 lanes run the same instruction.
MFMAMatrix Fused Multiply-Add. Wave-cooperative matrix multiply (CDNA).
WMMARDNA equivalent of MFMA.
Tile2D chunk of a matrix processed by one workgroup.
XPerTile / YPerTileTile dimensions in elements (algorithm-defined).
vec / X1Elements per thread per load. Constrained by 16-byte limit.
X0, Y0Number of threads placed along X / Y axes.
Y1Y-axis iteration count per thread.
CoalescingAdjacent threads accessing adjacent memory → one wide HBM transaction.
Kernel launchkernel<<< grid, block >>>(...) — grid = workgroups, block = threads/wg.
tile_distribution_patternHow threads/waves are distributed across one tile (thread/warp/block raked).
Key takeaways
  • Threads run in waves of 64; one instruction = one wave-step.
  • Memory is the slow part. High occupancy hides memory latency.
  • Vector loads (up to 16 B) reduce instruction count dramatically.
  • Tile dimensions are chosen by the algorithm; thread layout is derived from them.
  • MFMA is wave-cooperative — cannot be split across waves.

5/03/2026

How to Access Korean-Only Websites from Overseas Using AWS EC2 (Seoul Region VPN)

How to Access Korean-Only Websites from Overseas Using AWS EC2 (Seoul Region VPN)

Some Korean websites (government, financial, public institutions) block access from foreign IP addresses. This guide shows you how to create a quick VPN tunnel through AWS EC2 in Seoul to get a Korean IP address.

What you need: AWS account, AWS CLI installed, Terminal (macOS/Linux)


Step 1: Verify AWS CLI

Make sure AWS CLI is installed and configured:

aws --version
aws sts get-caller-identity --region ap-northeast-2

If you see your Account ID, you're good to go.


Step 2: Create a Key Pair

aws ec2 create-key-pair \
  --key-name kr-proxy-key \
  --region ap-northeast-2 \
  --query 'KeyMaterial' \
  --output text > ~/Desktop/kr-proxy-key.pem

chmod 400 ~/Desktop/kr-proxy-key.pem

Step 3: Create a Security Group

# Create security group
aws ec2 create-security-group \
  --group-name kr-proxy-sg \
  --description "SSH proxy for Korean IP access" \
  --region ap-northeast-2

# Allow SSH inbound (replace sg-xxxxx with your Group ID from above)
aws ec2 authorize-security-group-ingress \
  --group-id sg-xxxxx \
  --protocol tcp \
  --port 22 \
  --cidr 0.0.0.0/0 \
  --region ap-northeast-2

Step 4: Find the Latest AMI

aws ec2 describe-images \
  --owners amazon \
  --filters "Name=name,Values=al2023-ami-2023*-x86_64" "Name=state,Values=available" \
  --query 'Images | sort_by(@, &CreationDate) | [-1].ImageId' \
  --output text \
  --region ap-northeast-2

Note the AMI ID (e.g. ami-09a64de684ce1ac0e).


Step 5: Launch EC2 Instance

aws ec2 run-instances \
  --image-id ami-09a64de684ce1ac0e \
  --instance-type t2.micro \
  --key-name kr-proxy-key \
  --security-group-ids sg-xxxxx \
  --associate-public-ip-address \
  --tag-specifications 'ResourceType=instance,Tags=[{Key=Name,Value=kr-proxy}]' \
  --region ap-northeast-2 \
  --query 'Instances[0].InstanceId' \
  --output text

Note the Instance ID (e.g. i-0df42e446381ecc9a).

t2.micro is free-tier eligible (750 hours/month for 12 months).


Step 6: Get Public IP

Wait about 30 seconds, then:

aws ec2 describe-instances \
  --instance-ids i-xxxxx \
  --region ap-northeast-2 \
  --query 'Reservations[0].Instances[0].[State.Name, PublicIpAddress]' \
  --output text

You should see something like: running 13.125.x.x


Step 7: Open SSH SOCKS5 Tunnel

ssh -D 1080 -N -f \
  -o StrictHostKeyChecking=no \
  -i ~/Desktop/kr-proxy-key.pem \
  ec2-user@13.125.x.x
  • -D 1080 — Creates a SOCKS5 proxy on local port 1080
  • -N — No remote command (tunnel only)
  • -f — Run in background

Step 8: Verify Korean IP

curl --socks5-hostname localhost:1080 https://ifconfig.me

If it returns your EC2's Korean IP (e.g. 13.125.x.x), it's working!


Step 9: Configure Browser Proxy

Option A: macOS System Settings

  1. Open System SettingsNetworkWi-Fi
  2. Click Details...Proxies
  3. Enable SOCKS Proxy
  4. Server: localhost / Port: 1080
  5. Click OK

Option B: Terminal (macOS)

sudo networksetup -setsocksfirewallproxy "Wi-Fi" localhost 1080
sudo networksetup -setsocksfirewallproxystate "Wi-Fi" on

Option C: curl only

curl --socks5-hostname localhost:1080 https://www.target-website.kr

Now open your browser and access the Korean website!


Clean Up (Important!)

When you're done, clean up everything to avoid charges:

1. Close the SSH Tunnel

pkill -f "ssh -D 1080"

2. Turn Off Browser Proxy

macOS GUI: System Settings → Network → Wi-Fi → Details → Proxies → SOCKS Proxy OFF

Terminal:

sudo networksetup -setsocksfirewallproxystate "Wi-Fi" off

# Verify
networksetup -getsocksfirewallproxy "Wi-Fi"
# Should show: Enabled: No

3. Terminate EC2 Instance

aws ec2 terminate-instances \
  --instance-ids i-xxxxx \
  --region ap-northeast-2

4. Delete Security Group

Wait about 30 seconds after termination, then:

aws ec2 delete-security-group \
  --group-id sg-xxxxx \
  --region ap-northeast-2

5. Delete Key Pair

aws ec2 delete-key-pair \
  --key-name kr-proxy-key \
  --region ap-northeast-2

rm ~/Desktop/kr-proxy-key.pem

Cost Summary

Item Cost
t2.micro (free tier) Free (750 hrs/month, 12 months)
t2.micro (after free tier) ~$0.0116/hr (~$8.5/month)
Data transfer Free up to 100GB/month

Tip: If you want to keep the instance for later, Stop it instead of terminating. You only pay for EBS storage (~$0.10/GB/month) while stopped.


Quick Reference: Reconnect Later

If you stopped (not terminated) the instance:

# Start the instance
aws ec2 start-instances --instance-ids i-xxxxx --region ap-northeast-2

# Wait ~30 seconds, get new public IP
aws ec2 describe-instances \
  --instance-ids i-xxxxx \
  --region ap-northeast-2 \
  --query 'Reservations[0].Instances[0].PublicIpAddress' \
  --output text

# Open tunnel
ssh -D 1080 -N -f -i ~/Desktop/kr-proxy-key.pem ec2-user@NEW-IP

# Set proxy
sudo networksetup -setsocksfirewallproxy "Wi-Fi" localhost 1080
sudo networksetup -setsocksfirewallproxystate "Wi-Fi" on

3/08/2026

How to Apply for a D-U-N-S Number

1. What is a D-U-N-S Number?

A D-U-N-S Number (Data Universal Numbering System) is a unique 9-digit business identifier issued by Dun & Bradstreet (D&B). It is used globally to identify and verify businesses for purposes such as:

  • Enrolling in the Apple Developer Program as an organization
  • Applying for government contracts and grants
  • Establishing business credit profiles
  • Vendor and supplier registration
💡 Apple requires a D-U-N-S Number to verify your organization's legal identity before issuing an Apple Developer team account. Individual accounts do not need one.

2. Prerequisites — Documents to Prepare

Before starting the application, gather the following information and documents:

ItemDescription
Business NameFull legal name exactly as registered
Registration NumberTax ID or company registration number
Business AddressFull address including postal code and country
Phone NumberInternational format: +[country code][number]
CEO / Owner InfoFull legal name and title (e.g., Owner, CEO)
WebsiteCompany website URL (if available)
Date FoundedOfficial business establishment date
Employee CountTotal headcount including yourself
Registration CertificateEnglish version — PDF format preferred
⚠️ Non-English documents must be translated. Check your country's official tax authority or business registration portal — many offer an official English-language certificate export option directly.

3. Go to the D&B Application Portal

Navigate to the appropriate portal based on your use case:

Apple Developer Program

👉 https://support.dnb.com/?CUST=APPLEDEV

Use this URL if you are applying specifically to enroll in the Apple Developer Program as an organization.

General D-U-N-S Application

👉 https://www.dnb.com/duns-number/get-a-duns.html

Use this for general business registration, government contracts, or other purposes.

4. Select Your Request Type

On the Apple Developer D&B portal, you will be asked two questions:

Step 4a

Under "What are you?" → select Developer Program

Step 4b

Under "What do you need?" → select Create New DUNS

5. Search for an Existing D-U-N-S

The portal checks whether your business already has a D-U-N-S number before creating a new one.

Step 5a

Enter your business name and full address in the search fields.

Step 5b

Click "Lookup by Name / Address".

Step 5c — Expected Result

For new businesses: you will see "No Match found" — this is completely normal and expected.

Step 5d

Click "click here to submit a request to create a new D-U-N-S" to open the full application form.

✅ If a match is found, your business already has a D-U-N-S. You can claim or update it instead of creating a new one — contact D&B support directly.

6. Fill Out the Application Form

Complete all fields in the Create D-U-N-S Number form. All fields must exactly match your official registration documents.

FieldWhat to Enter
Full Legal Business NameExact name as on registration documents
Business Registration NumberYour tax ID / company ID number
Company PhoneInternational format (e.g., +358501234567)
Street AddressFull street address
CityCity name
State / ProvinceState or province (if applicable)
Postal CodeZIP / postal code
CountryYour country
Business Structuree.g., Sole Proprietorship, LLC, Corporation
CEO / Owner NameFull legal name
CEO / Owner Titlee.g., Owner, CEO, Director
WebsiteCompany website URL
Home-Based BusinessYes / No
Number of EmployeesTotal headcount (including yourself)
Date FoundedOfficial establishment date (MM/DD/YYYY)
⚠️ Exact match required. The business name, address, and registration number must match your official documents precisely. Even minor differences (abbreviations, punctuation) can cause rejection.

Click "Next" to advance to the document attachment page.

7. Attach Supporting Documents

Upload a document to verify your business registration. This is required — submissions without a document attachment are typically rejected.

Accepted Documents

✅ English Business Registration Certificate (strongly preferred)

✅ Official government-issued business license

✅ Any official document showing your business name, registration number, and address

📄 PDF format is recommended. File size is typically not an issue, but keep it under 5MB to be safe. Make sure the document is clearly legible and all key information is visible.

8. Add Notes & Submit

There is usually a free-text Additional Details / Notes field. Always fill this in — it helps the D&B reviewer understand your submission and can speed up approval.

Example — First-time application

Sample note text
This is a new D-U-N-S number request for [Your Business Name], a sole proprietorship registered in [Country] since [Year]. Business is engaged in [industry, e.g. software development]. English Business Registration Certificate is attached.

Example — Resubmission after rejection

Sample note text
Resubmitting after previous rejection (Case #XXXXXXXX) which was closed due to missing English Business Registration Certificate. The English certificate is now attached. Business registration number: [Your Number]. Business started: [Date].

After reviewing all information, click Submit.

💾 Save your case number immediately after submission — you will need it to track your application status or to reference in any resubmission.

9. What Happens Next?

After submission, D&B will review your request. Here is the typical timeline:

ImmediatelyOn-screen confirmation + case number displayed. Check your email for a confirmation from D&B (check spam).
1–2 business daysD&B reviews your submitted information and documents.
3–5 business daysD-U-N-S Number issued and emailed to you.
Up to 30 daysIn some cases D&B may request additional documentation.
📬 D&B will send your D-U-N-S Number by email. Once received, the number is also searchable via the D&B portal. Apple recommends waiting 24–48 hours after receiving your number before using it for developer enrollment.

10. Troubleshooting: Common Rejection Reasons

IssueSolution
Missing English documentObtain an official English translation of your registration certificate and resubmit
Business name mismatchEnsure the name on the form exactly matches your official documents (character for character)
Address not recognizedUse the exact address format from your registration documents
Duplicate D-U-N-S foundContact D&B to claim or update the existing record instead of creating a new one
Case closed, no responseResubmit the form and reference the previous case number in the notes field
Document unreadableRe-export or rescan the document as a clear, high-resolution PDF

11. Using Your D-U-N-S for Apple Developer Enrollment

Once you have your D-U-N-S Number:

Step 2

Select "Enroll as an Organization" (not Individual).

Step 3

Enter your D-U-N-S Number when prompted. Apple will verify it against D&B's database.

Step 4

Allow 24–48 hours after receiving your D-U-N-S before enrolling — the number needs time to propagate in D&B's system.

✅ The Apple verification process typically takes a few days after enrollment. Apple may call your registered business phone number to verify your identity as the organization's legal representative.

Final Checklist

  • Prepared English business registration certificate (PDF)
  • Navigated to the D&B portal (APPLEDEV or general)
  • Selected correct request type (Developer Program → Create New DUNS)
  • Searched for existing D-U-N-S — confirmed no match
  • Filled out all form fields matching official documents exactly
  • Uploaded supporting document (English certificate)
  • Added descriptive notes in the Additional Details field
  • Submitted the form and saved the case number
  • Waiting for D&B email confirmation (3–5 business days)

Last updated: March 2026 · For general informational purposes

Main image by Syawish Rehman on Unsplash

1/09/2026

🚙🚙 AUTOMATIC NUMBER PLATE RECOGNITION (ANPR, LPR, ALPR) solution

Step-by-Step: Building MIOpen and hipDNN with Ninja

 

Step-by-Step: Building MIOpen and hipDNN with Ninja

Step 1: Navigate to the monorepo

git clone https://github.com/ROCm/rocm-libraries.git
>cd rocm-libraries

Step 2: Create a build directory

Step 3: Configure with CMake using Ninja

What this does:
  • -GNinja - Use Ninja build system (faster than Make)
  • -DCMAKE_BUILD_TYPE=Release - Optimized build
  • -DCMAKE_INSTALL_PREFIX=/opt/rocm - Install location
  • -DCMAKE_PREFIX_PATH=/opt/rocm - Where to find ROCm dependencies
  • -DROCM_LIBS_ENABLE_COMPONENTS="miopen;hipdnn" - Build only these two projects + their dependencies

Step 4: Build both projects

Or if you want to limit parallel jobs (e.g., for memory constraints):

Step 5: Run tests (optional)

Step 6: Install (optional)


Alternative: Build with specific options

If you need more control, you can add these flags:

Useful Ninja Commands


Expected Build Time

  • Configuration (CMake)~30 seconds to 2 minutes
  • MIOpen build: ~20-40 minutes (first time)
  • hipDNN build: ~5-10 minutes
  • Total~30-50 minutes on MI250X system

What Gets Built

After successful build, you'll have:

If You Encounter Issues

  1. Missing dependencies error:
  1. Out of memory during build:
  1. See what would be built: