MareArts Computer Vision Study.
Computer Vision & Machine Learning Research Laboratory
8/25/2025
ceiling technic
Add the maximum possible remainder (TILE-1) before dividing!
Simple Example with 10:
Let's use TILE=10 for easier understanding:
Without ceiling (wrong):
11 / 10 = 1 (need 2!) ❌
19 / 10 = 1 (need 2!) ❌
With ceiling technique:
(11 + 9) / 10 = 20 / 10 = 2 ✓
(19 + 9) / 10 = 28 / 10 = 2 ✓
(20 + 9) / 10 = 29 / 10 = 2 ✓ (still 2, correct!)
(21 + 9) / 10 = 30 / 10 = 3 ✓
Why Add (TILE-1)?
Think of it like this:
If remainder = 0 (perfectly divisible):
20 / 10 = 2.0
(20 + 9) / 10 = 29 / 10 = 2.9 → 2 (same!)
If remainder > 0 (needs extra block):
21 / 10 = 2.1 (remainder 1)
(21 + 9) / 10 = 30 / 10 = 3.0 → 3 (pushed to next!)
The Magic:
Adding (TILE-1):
- Remainder 0: Adds 0.9999... → stays same integer
- Remainder ≥1: Adds enough to reach next integer
Visual Pattern:
Value | +9 | /10 | Result
------|----|----|-------
10 | 19 | 1.9| 1 ✓
11 | 20 | 2.0| 2 ✓ (jumps up!)
19 | 28 | 2.8| 2 ✓
20 | 29 | 2.9| 2 ✓
21 | 30 | 3.0| 3 ✓ (jumps up!)
Formula Summary:
// Ceiling division formula:
ceil(A/B) = (A + B - 1) / B
// For our GEMM tiles:
num_blocks = (matrix_size + tile_size - 1) / tile_size
It's simple: "Add almost one tile, then divide" - this guarantees rounding up!
CK Tile Tutorial Day 2 (AMD hip programming) - Simple GEMM.
Concepts Added:
- 2D grid/block configuration
- Matrix multiplication basics
- Each thread computes one output element
Key Pattern:
// Each thread computes C[row][col]
for (int k = 0; k < K; k++) {
sum += A[row][k] * B[k][col];
}
.8/24/2025
CK Tile Tutorial Day 1 (AMD hip programming) - Vector add.
.
Concepts:
- Basic kernel structure: Args → Kernel → operator()
- Grid/Block configuration
- One thread per element processing
Key Code:
struct VectorAddKernel {
__device__ void operator()(args) {
int idx = blockIdx.x * blockDim.x + threadIdx.x;
c[idx] = a[idx] + b[idx];
}
};
..
code
..
...
Result
CK Tile Tutorial - Step 1: Vector Addition
==========================================
=== Step 1: Vector Addition ===
Adding two vectors of size 1024
Launching with grid(4), block(256)
Result: CORRECT
First 5 elements: 3 3 3 3 3
=== Step 1: Vector Addition ===
Adding two vectors of size 10000
Launching with grid(40), block(256)
Result: CORRECT
First 5 elements: 3 3 3 3 3
Key Concepts Demonstrated:
1. Kernel structure: Args → Kernel → operator()
2. Grid/Block configuration
3. Each thread processes one element
4. Bounds checking for safety
8/22/2025
ONNX Runtime with ROCm (AMD GPU) Setup Guide
ONNX Runtime with ROCm (AMD GPU) Setup Guide
Installation
Prerequisites
- ROCm installed (6.0+ recommended)
- Python 3.8-3.10
Install ONNX Runtime with ROCm Support
# 1. Remove existing ONNX Runtime (if any)
pip uninstall -y onnxruntime onnxruntime-gpu
# 2. Install from AMD ROCm repository
# For ROCm 6.4
pip install onnxruntime-rocm -f https://repo.radeon.com/rocm/manylinux/rocm-rel-6.4/
# For ROCm 6.2
pip install onnxruntime-rocm -f https://repo.radeon.com/rocm/manylinux/rocm-rel-6.2/
# For ROCm 6.0
pip install onnxruntime-rocm -f https://repo.radeon.com/rocm/manylinux/rocm-rel-6.0/
Verify Installation
import onnxruntime as ort
# Check available providers
print("Available providers:", ort.get_available_providers())
# Should show: ['MIGraphXExecutionProvider', 'ROCMExecutionProvider', 'CPUExecutionProvider']
Simple Usage Example
import onnxruntime as ort
import numpy as np
# Load ONNX model with ROCm
providers = ['ROCMExecutionProvider', 'CPUExecutionProvider']
session = ort.InferenceSession("model.onnx", providers=providers)
# Check which provider is being used
print(f"Using: {session.get_providers()[0]}")
# Prepare input (example: batch_size=1, 3 channels, 640x640 image)
input_data = np.random.randn(1, 3, 640, 640).astype(np.float32)
# Run inference
input_name = session.get_inputs()[0].name
output = session.run(None, {input_name: input_data})
print(f"Output shape: {output[0].shape}")
Advanced: Using MIGraphX (AMD Optimized)
# MIGraphX is AMD's optimized graph execution provider
# It can be faster than ROCMExecutionProvider for some models
providers = [
'MIGraphXExecutionProvider', # Fastest on AMD
'ROCMExecutionProvider', # Standard ROCm
'CPUExecutionProvider' # Fallback
]
session = ort.InferenceSession("model.onnx", providers=providers)
Complete Example: Image Detection
import onnxruntime as ort
import numpy as np
import cv2
def load_model(model_path, use_gpu=True):
"""Load ONNX model with ROCm support"""
if use_gpu:
providers = ['MIGraphXExecutionProvider', 'ROCMExecutionProvider', 'CPUExecutionProvider']
else:
providers = ['CPUExecutionProvider']
session = ort.InferenceSession(model_path, providers=providers)
print(f"Model loaded with: {session.get_providers()[0]}")
return session
def preprocess_image(image_path, size=640):
"""Preprocess image for inference"""
image = cv2.imread(image_path)
resized = cv2.resize(image, (size, size))
rgb = cv2.cvtColor(resized, cv2.COLOR_BGR2RGB)
normalized = rgb.astype(np.float32) / 255.0
transposed = normalized.transpose(2, 0, 1) # HWC to CHW
batched = np.expand_dims(transposed, axis=0) # Add batch dimension
return batched, image
def run_inference(session, input_data):
"""Run model inference"""
input_name = session.get_inputs()[0].name
outputs = session.run(None, {input_name: input_data})
return outputs
# Usage
model = load_model("rtdetr_fp32.onnx", use_gpu=True)
input_data, original_image = preprocess_image("test.jpg")
outputs = run_inference(model, input_data)
print(f"Detection output shape: {outputs[0].shape}")
Troubleshooting
1. ROCMExecutionProvider not available
# Check ROCm installation
import subprocess
result = subprocess.run(['rocm-smi'], capture_output=True, text=True)
print(result.stdout)
2. Fallback to CPU
If ONNX Runtime falls back to CPU despite having ROCm:
- Check ROCm version compatibility
- Verify GPU is visible:
rocm-smi
- Set environment variable:
export HIP_VISIBLE_DEVICES=0
3. Performance Tips
- Use
MIGraphXExecutionProvider
for best performance on AMD GPUs - FP16 models can be faster but may have slight accuracy loss
- Batch processing improves throughput
Environment Variables
# Select specific GPU
export HIP_VISIBLE_DEVICES=0
# Enable verbose logging
export ORT_ROCM_VERBOSE_LEVEL=1
# Set memory limit (in MB)
export ORT_ROCM_MEM_LIMIT=4096
Performance Comparison
Provider | Relative Speed | Use Case |
---|---|---|
MIGraphXExecutionProvider | Fastest | Production, optimized models |
ROCMExecutionProvider | Fast | General purpose |
CPUExecutionProvider | Slowest | Fallback, debugging |
Notes
- ONNX Runtime ROCm version should match your ROCm installation
- Not all ONNX operators are supported on ROCm - unsupported ops fall back to CPU
- For best performance, export models with static shapes
7/25/2025
Understanding C++ Templates, (class, member function and this)
C++ tuple vs vector, Tutorial
-
ONNX Runtime with ROCm (AMD GPU) Setup Guide Installation Prerequisites ROCm installed (6.0+ recommended) Python 3.8-3.10 Install ONNX Runti...
-
Image size of origin is 320*240. Processing time is 30.96 second took. The result of stitching The resul...
-
Created Date : 2011.2 Language : C/C++ Tool : Microsoft Visual C++ 2010 Library & Utilized : OpenCV 2.2 Reference : Interent Refer...
-
Logistic Classifier The logistic classifier is similar to equation of the plane. W is weight vector, X is input vector and y is output...
-
In past, I wrote an articel about YUV 444, 422, 411 introduction and yuv rgb converting example code. refer to this page -> http://feel...
-
* Introduction - The solution shows panorama image from multi images. The panorama images is processing by real-time stitching algorithm...
-
fig 1. Left: set 4 points (Left Top, Right Top, Right Bottom, Left Bottom), right:warped image to (0,0) (300,0), (300,300), (0,300) Fi...
-
Created Date : 2011.10 Language : C/C++ Tool : Microsoft Visual C++ 2008 Library & Utilized : OpenCV 2.3 Reference : SIFT referenc...
-
This is dithering example, it make image like a stippling effect. I referenced to blew website. wiki page: https://en.wikipedia.org/wik...
-
I tried to download resnet101 model via torchvision model ex) torchvision.models.resnet101(pretrained=True) But it has such a error -----...