Simple example code for remove_cvref.
.
..
Computer Vision & Machine Learning Research Laboratory
Simple example code for remove_cvref.
.
..
Add the maximum possible remainder (TILE-1) before dividing!
Simple Example with 10:
Let's use TILE=10 for easier understanding:
Without ceiling (wrong):
11 / 10 = 1 (need 2!) ❌
19 / 10 = 1 (need 2!) ❌
With ceiling technique:
(11 + 9) / 10 = 20 / 10 = 2 ✓
(19 + 9) / 10 = 28 / 10 = 2 ✓
(20 + 9) / 10 = 29 / 10 = 2 ✓ (still 2, correct!)
(21 + 9) / 10 = 30 / 10 = 3 ✓
Why Add (TILE-1)?
Think of it like this:
If remainder = 0 (perfectly divisible):
20 / 10 = 2.0
(20 + 9) / 10 = 29 / 10 = 2.9 → 2 (same!)
If remainder > 0 (needs extra block):
21 / 10 = 2.1 (remainder 1)
(21 + 9) / 10 = 30 / 10 = 3.0 → 3 (pushed to next!)
The Magic:
Adding (TILE-1):
- Remainder 0: Adds 0.9999... → stays same integer
- Remainder ≥1: Adds enough to reach next integer
Visual Pattern:
Value | +9 | /10 | Result
------|----|----|-------
10 | 19 | 1.9| 1 ✓
11 | 20 | 2.0| 2 ✓ (jumps up!)
19 | 28 | 2.8| 2 ✓
20 | 29 | 2.9| 2 ✓
21 | 30 | 3.0| 3 ✓ (jumps up!)
Formula Summary:
// Ceiling division formula:
ceil(A/B) = (A + B - 1) / B
// For our GEMM tiles:
num_blocks = (matrix_size + tile_size - 1) / tile_size
It's simple: "Add almost one tile, then divide" - this guarantees rounding up!
Concepts Added:
Key Pattern:
// Each thread computes C[row][col]
for (int k = 0; k < K; k++) {
sum += A[row][k] * B[k][col];
}
..
Concepts:
Key Code:
struct VectorAddKernel {
__device__ void operator()(args) {
int idx = blockIdx.x * blockDim.x + threadIdx.x;
c[idx] = a[idx] + b[idx];
}
};
..
code
..
...
Result
CK Tile Tutorial - Step 1: Vector Addition
==========================================
=== Step 1: Vector Addition ===
Adding two vectors of size 1024
Launching with grid(4), block(256)
Result: CORRECT
First 5 elements: 3 3 3 3 3
=== Step 1: Vector Addition ===
Adding two vectors of size 10000
Launching with grid(40), block(256)
Result: CORRECT
First 5 elements: 3 3 3 3 3
Key Concepts Demonstrated:
1. Kernel structure: Args → Kernel → operator()
2. Grid/Block configuration
3. Each thread processes one element
4. Bounds checking for safety
# 1. Remove existing ONNX Runtime (if any)
pip uninstall -y onnxruntime onnxruntime-gpu
# 2. Install from AMD ROCm repository
# For ROCm 6.4
pip install onnxruntime-rocm -f https://repo.radeon.com/rocm/manylinux/rocm-rel-6.4/
# For ROCm 6.2
pip install onnxruntime-rocm -f https://repo.radeon.com/rocm/manylinux/rocm-rel-6.2/
# For ROCm 6.0
pip install onnxruntime-rocm -f https://repo.radeon.com/rocm/manylinux/rocm-rel-6.0/
import onnxruntime as ort
# Check available providers
print("Available providers:", ort.get_available_providers())
# Should show: ['MIGraphXExecutionProvider', 'ROCMExecutionProvider', 'CPUExecutionProvider']
import onnxruntime as ort
import numpy as np
# Load ONNX model with ROCm
providers = ['ROCMExecutionProvider', 'CPUExecutionProvider']
session = ort.InferenceSession("model.onnx", providers=providers)
# Check which provider is being used
print(f"Using: {session.get_providers()[0]}")
# Prepare input (example: batch_size=1, 3 channels, 640x640 image)
input_data = np.random.randn(1, 3, 640, 640).astype(np.float32)
# Run inference
input_name = session.get_inputs()[0].name
output = session.run(None, {input_name: input_data})
print(f"Output shape: {output[0].shape}")
# MIGraphX is AMD's optimized graph execution provider
# It can be faster than ROCMExecutionProvider for some models
providers = [
'MIGraphXExecutionProvider', # Fastest on AMD
'ROCMExecutionProvider', # Standard ROCm
'CPUExecutionProvider' # Fallback
]
session = ort.InferenceSession("model.onnx", providers=providers)
import onnxruntime as ort
import numpy as np
import cv2
def load_model(model_path, use_gpu=True):
"""Load ONNX model with ROCm support"""
if use_gpu:
providers = ['MIGraphXExecutionProvider', 'ROCMExecutionProvider', 'CPUExecutionProvider']
else:
providers = ['CPUExecutionProvider']
session = ort.InferenceSession(model_path, providers=providers)
print(f"Model loaded with: {session.get_providers()[0]}")
return session
def preprocess_image(image_path, size=640):
"""Preprocess image for inference"""
image = cv2.imread(image_path)
resized = cv2.resize(image, (size, size))
rgb = cv2.cvtColor(resized, cv2.COLOR_BGR2RGB)
normalized = rgb.astype(np.float32) / 255.0
transposed = normalized.transpose(2, 0, 1) # HWC to CHW
batched = np.expand_dims(transposed, axis=0) # Add batch dimension
return batched, image
def run_inference(session, input_data):
"""Run model inference"""
input_name = session.get_inputs()[0].name
outputs = session.run(None, {input_name: input_data})
return outputs
# Usage
model = load_model("rtdetr_fp32.onnx", use_gpu=True)
input_data, original_image = preprocess_image("test.jpg")
outputs = run_inference(model, input_data)
print(f"Detection output shape: {outputs[0].shape}")
# Check ROCm installation
import subprocess
result = subprocess.run(['rocm-smi'], capture_output=True, text=True)
print(result.stdout)
If ONNX Runtime falls back to CPU despite having ROCm:
rocm-smi
export HIP_VISIBLE_DEVICES=0
MIGraphXExecutionProvider
for best performance on AMD GPUs# Select specific GPU
export HIP_VISIBLE_DEVICES=0
# Enable verbose logging
export ORT_ROCM_VERBOSE_LEVEL=1
# Set memory limit (in MB)
export ORT_ROCM_MEM_LIMIT=4096
Provider | Relative Speed | Use Case |
---|---|---|
MIGraphXExecutionProvider | Fastest | Production, optimized models |
ROCMExecutionProvider | Fast | General purpose |
CPUExecutionProvider | Slowest | Fallback, debugging |