Showing posts with label amd. Show all posts
Showing posts with label amd. Show all posts

8/24/2025

CK Tile Tutorial Day 1 (AMD hip programming) - Vector add.

.

Concepts:

  • Basic kernel structure: Args → Kernel → operator()
  • Grid/Block configuration
  • One thread per element processing

Key Code:

struct VectorAddKernel {
    __device__ void operator()(args) {
        int idx = blockIdx.x * blockDim.x + threadIdx.x;
        c[idx] = a[idx] + b[idx];
    }
};


..

code

..

// Step 1: Simplest CK Tile Kernel - Vector Addition
// This demonstrates the absolute basics of CK Tile

#include <hip/hip_runtime.h>
#include <iostream>
#include <vector>

// ============================================
// PART 1: Kernel Arguments (Host → Device)
// ============================================
struct VectorAddKernelArgs {
const float* a_ptr;
const float* b_ptr;
float* c_ptr;
int n;
// Constructor from host arguments
VectorAddKernelArgs(const float* a, const float* b, float* c, int size)
: a_ptr(a), b_ptr(b), c_ptr(c), n(size) {}
};

// ============================================
// PART 2: The Kernel
// ============================================
struct VectorAddKernel {
// Static method to get grid size (how many blocks)
static dim3 GridSize(const VectorAddKernelArgs& args) {
// 256 threads per block, divide work
int blocks = (args.n + 255) / 256;
return dim3(blocks, 1, 1);
}
// Static method to get block size (threads per block)
static dim3 BlockSize() {
return dim3(256, 1, 1);
}
// The actual kernel function - called by each thread
__device__ void operator()(const VectorAddKernelArgs& args) const {
// Calculate global thread index
int idx = blockIdx.x * blockDim.x + threadIdx.x;
// Check bounds
if (idx < args.n) {
// Each thread does one element
args.c_ptr[idx] = args.a_ptr[idx] + args.b_ptr[idx];
}
}
};

// ============================================
// PART 3: Host Launch Function
// ============================================
__global__ void vector_add_kernel(VectorAddKernelArgs args) {
VectorAddKernel kernel;
kernel(args);
}

void run_vector_add(int n) {
std::cout << "\n=== Step 1: Vector Addition ===\n";
std::cout << "Adding two vectors of size " << n << "\n";
// Allocate host memory
std::vector<float> h_a(n, 1.0f);
std::vector<float> h_b(n, 2.0f);
std::vector<float> h_c(n, 0.0f);
// Allocate device memory
float *d_a, *d_b, *d_c;
hipMalloc(&d_a, n * sizeof(float));
hipMalloc(&d_b, n * sizeof(float));
hipMalloc(&d_c, n * sizeof(float));
// Copy to device
hipMemcpy(d_a, h_a.data(), n * sizeof(float), hipMemcpyHostToDevice);
hipMemcpy(d_b, h_b.data(), n * sizeof(float), hipMemcpyHostToDevice);
// Create kernel arguments
VectorAddKernelArgs args(d_a, d_b, d_c, n);
// Get launch configuration
dim3 grid = VectorAddKernel::GridSize(args);
dim3 block = VectorAddKernel::BlockSize();
std::cout << "Launching with grid(" << grid.x << "), block(" << block.x << ")\n";
// Launch kernel
vector_add_kernel<<<grid, block>>>(args);
// Copy result back
hipMemcpy(h_c.data(), d_c, n * sizeof(float), hipMemcpyDeviceToHost);
// Verify
bool correct = true;
for (int i = 0; i < std::min(10, n); i++) {
if (h_c[i] != 3.0f) {
correct = false;
break;
}
}
std::cout << "Result: " << (correct ? "CORRECT" : "WRONG") << "\n";
std::cout << "First 5 elements: ";
for (int i = 0; i < std::min(5, n); i++) {
std::cout << h_c[i] << " ";
}
std::cout << "\n";
// Cleanup
hipFree(d_a);
hipFree(d_b);
hipFree(d_c);
}

// ============================================
// PART 4: Main
// ============================================
int main() {
std::cout << "MareArts CK Tile Tutorial - Step 1: Vector Addition\n";
std::cout << "==========================================\n";
// Run with different sizes
run_vector_add(1024);
run_vector_add(10000);
std::cout << "\nKey Concepts Demonstrated:\n";
std::cout << "1. Kernel structure: Args → Kernel → operator()\n";
std::cout << "2. Grid/Block configuration\n";
std::cout << "3. Each thread processes one element\n";
std::cout << "4. Bounds checking for safety\n";
return 0;
}

...


Result

CK Tile Tutorial - Step 1: Vector Addition

==========================================


=== Step 1: Vector Addition ===

Adding two vectors of size 1024

Launching with grid(4), block(256)

Result: CORRECT

First 5 elements: 3 3 3 3 3 


=== Step 1: Vector Addition ===

Adding two vectors of size 10000

Launching with grid(40), block(256)

Result: CORRECT

First 5 elements: 3 3 3 3 3 


Key Concepts Demonstrated:

1. Kernel structure: Args → Kernel → operator()

2. Grid/Block configuration

3. Each thread processes one element

4. Bounds checking for safety



8/22/2025

ONNX Runtime with ROCm (AMD GPU) Setup Guide

ONNX Runtime with ROCm (AMD GPU) Setup Guide

Installation

Prerequisites

  • ROCm installed (6.0+ recommended)
  • Python 3.8-3.10

Install ONNX Runtime with ROCm Support

# 1. Remove existing ONNX Runtime (if any)
pip uninstall -y onnxruntime onnxruntime-gpu

# 2. Install from AMD ROCm repository
# For ROCm 6.4
pip install onnxruntime-rocm -f https://repo.radeon.com/rocm/manylinux/rocm-rel-6.4/

# For ROCm 6.2
pip install onnxruntime-rocm -f https://repo.radeon.com/rocm/manylinux/rocm-rel-6.2/

# For ROCm 6.0
pip install onnxruntime-rocm -f https://repo.radeon.com/rocm/manylinux/rocm-rel-6.0/

Verify Installation

import onnxruntime as ort

# Check available providers
print("Available providers:", ort.get_available_providers())

# Should show: ['MIGraphXExecutionProvider', 'ROCMExecutionProvider', 'CPUExecutionProvider']

Simple Usage Example

import onnxruntime as ort
import numpy as np

# Load ONNX model with ROCm
providers = ['ROCMExecutionProvider', 'CPUExecutionProvider']
session = ort.InferenceSession("model.onnx", providers=providers)

# Check which provider is being used
print(f"Using: {session.get_providers()[0]}")

# Prepare input (example: batch_size=1, 3 channels, 640x640 image)
input_data = np.random.randn(1, 3, 640, 640).astype(np.float32)

# Run inference
input_name = session.get_inputs()[0].name
output = session.run(None, {input_name: input_data})

print(f"Output shape: {output[0].shape}")

Advanced: Using MIGraphX (AMD Optimized)

# MIGraphX is AMD's optimized graph execution provider
# It can be faster than ROCMExecutionProvider for some models

providers = [
    'MIGraphXExecutionProvider',  # Fastest on AMD
    'ROCMExecutionProvider',       # Standard ROCm
    'CPUExecutionProvider'         # Fallback
]

session = ort.InferenceSession("model.onnx", providers=providers)

Complete Example: Image Detection

import onnxruntime as ort
import numpy as np
import cv2

def load_model(model_path, use_gpu=True):
    """Load ONNX model with ROCm support"""
    if use_gpu:
        providers = ['MIGraphXExecutionProvider', 'ROCMExecutionProvider', 'CPUExecutionProvider']
    else:
        providers = ['CPUExecutionProvider']
    
    session = ort.InferenceSession(model_path, providers=providers)
    print(f"Model loaded with: {session.get_providers()[0]}")
    return session

def preprocess_image(image_path, size=640):
    """Preprocess image for inference"""
    image = cv2.imread(image_path)
    resized = cv2.resize(image, (size, size))
    rgb = cv2.cvtColor(resized, cv2.COLOR_BGR2RGB)
    normalized = rgb.astype(np.float32) / 255.0
    transposed = normalized.transpose(2, 0, 1)  # HWC to CHW
    batched = np.expand_dims(transposed, axis=0)  # Add batch dimension
    return batched, image

def run_inference(session, input_data):
    """Run model inference"""
    input_name = session.get_inputs()[0].name
    outputs = session.run(None, {input_name: input_data})
    return outputs

# Usage
model = load_model("rtdetr_fp32.onnx", use_gpu=True)
input_data, original_image = preprocess_image("test.jpg")
outputs = run_inference(model, input_data)

print(f"Detection output shape: {outputs[0].shape}")

Troubleshooting

1. ROCMExecutionProvider not available

# Check ROCm installation
import subprocess
result = subprocess.run(['rocm-smi'], capture_output=True, text=True)
print(result.stdout)

2. Fallback to CPU

If ONNX Runtime falls back to CPU despite having ROCm:

  • Check ROCm version compatibility
  • Verify GPU is visible: rocm-smi
  • Set environment variable: export HIP_VISIBLE_DEVICES=0

3. Performance Tips

  • Use MIGraphXExecutionProvider for best performance on AMD GPUs
  • FP16 models can be faster but may have slight accuracy loss
  • Batch processing improves throughput

Environment Variables

# Select specific GPU
export HIP_VISIBLE_DEVICES=0

# Enable verbose logging
export ORT_ROCM_VERBOSE_LEVEL=1

# Set memory limit (in MB)
export ORT_ROCM_MEM_LIMIT=4096

Performance Comparison

ProviderRelative SpeedUse Case
MIGraphXExecutionProviderFastestProduction, optimized models
ROCMExecutionProviderFastGeneral purpose
CPUExecutionProviderSlowestFallback, debugging

Notes

  • ONNX Runtime ROCm version should match your ROCm installation
  • Not all ONNX operators are supported on ROCm - unsupported ops fall back to CPU
  • For best performance, export models with static shapes

9/18/2024

AMD Distributed Training Overview

# AMD Distributed Training Overview

AMD's approach to distributed training leverages its high-performance CPUs and GPUs, along with software frameworks, to enable efficient scaling of machine learning workloads across multiple nodes. Key aspects include:

1. **Hardware Solutions:**
   - AMD EPYC CPUs: Provide high core counts and memory bandwidth.
   - AMD Instinct GPUs: Accelerators designed for HPC and AI workloads.
   - AMD Infinity Fabric: High-speed interconnect for multi-GPU and multi-node systems.

2. **Software Framework:**
   - ROCm (Radeon Open Compute): Open-source software stack for GPU computing.
   - HIP (Heterogeneous-Compute Interface for Portability): C++ runtime API for GPU programming.
   - AMD's optimized libraries for deep learning frameworks like TensorFlow and PyTorch.

3. **Distributed Training Techniques:**
   - Data Parallelism: Distributing batches of training data across multiple GPUs or nodes.
   - Model Parallelism: Splitting large models across multiple devices.
   - Pipeline Parallelism: Dividing model layers across devices and processing in a pipelined fashion.

4. **Communication Optimization:**
   - RCCL (ROCm Communication Collectives Library): Optimized multi-GPU and multi-node collective communications.
   - Support for high-speed networking technologies like InfiniBand.

5. **Scalability:**
   - Support for scaling from single-node multi-GPU systems to large clusters.
   - Integration with job schedulers and resource managers for cluster environments.

6. **Ecosystem Integration:**
   - Compatibility with popular ML frameworks and distributed training tools.
   - Support for containers and orchestration platforms like Docker and Kubernetes.

7. **Performance Optimization:**
   - Mixed-precision training support.
   - Memory management techniques for large model training.
   - Automatic performance tuning tools.

AMD's distributed training solutions aim to provide high performance, scalability, and ease of use for researchers and organizations working on large-scale machine learning projects.

9/17/2024

HIP kernel for matrix multiplication that can leverage Matrix Cores

Here's an example of a custom HIP kernel for matrix multiplication that can leverage Matrix Cores:



```cpp
#include <hip/hip_runtime.h>
#include <hip/hip_fp16.h>
#include <iostream>

// Define matrix dimensions
#define M 16
#define N 16
#define K 16

// HIP kernel for matrix multiplication
__global__ void matrixMulKernel(half* A, half* B, float* C) {
// Shared memory for tile of A and B
__shared__ half As[M][K];
__shared__ half Bs[K][N];

int bx = blockIdx.x;
int by = blockIdx.y;
int tx = threadIdx.x;
int ty = threadIdx.y;

// Index of the first sub-matrix of A processed by the block
int aBegin = K * M * by;
// Index of the last sub-matrix of A processed by the block
int aEnd = aBegin + K - 1;
// Step size used to iterate through the sub-matrices of A
int aStep = M;

// Index of the first sub-matrix of B processed by the block
int bBegin = N * bx;
// Step size used to iterate through the sub-matrices of B
int bStep = K * N;

// Csub is used to store the element of the block sub-matrix
// that is computed by the thread
float Csub = 0;

// Loop over all the sub-matrices of A and B
for (int a = aBegin, b = bBegin; a <= aEnd; a += aStep, b += bStep) {
// Load the matrices from device memory to shared memory
As[ty][tx] = A[a + K * ty + tx];
Bs[ty][tx] = B[b + N * ty + tx];

// Synchronize to make sure the matrices are loaded
__syncthreads();

// Multiply the two matrices
#pragma unroll
for (int k = 0; k < K; ++k) {
Csub += __half2float(As[ty][k]) * __half2float(Bs[k][tx]);
}

// Synchronize to make sure that the preceding
// computation is done before loading two new
// sub-matrices of A and B in the next iteration
__syncthreads();
}

// Write the block sub-matrix to device memory
// each thread writes one element
int c = N * M * by + M * bx;
C[c + N * ty + tx] = Csub;
}

// Host function to launch the kernel
void launchMatrixMulKernel(half* A, half* B, float* C, int numBlocks) {
dim3 threadsPerBlock(M, N);
dim3 blocksPerGrid(numBlocks, numBlocks);
hipLaunchKernelGGL(matrixMulKernel, blocksPerGrid, threadsPerBlock, 0, 0, A, B, C);
}

// Error checking macro
#define HIP_CHECK(call) { hipError_t err = call; if (err != hipSuccess) { std::cerr << "HIP error: " << hipGetErrorString(err) << std::endl; exit(1); } }

int main() {
// Allocate memory
half *A, *B;
float *C;
HIP_CHECK(hipMalloc(&A, M * K * sizeof(half)));
HIP_CHECK(hipMalloc(&B, K * N * sizeof(half)));
HIP_CHECK(hipMalloc(&C, M * N * sizeof(float)));

// Initialize matrices (you would typically do this on the GPU)
// ...

// Launch kernel
launchMatrixMulKernel(A, B, C, 1); // Assuming 1 block for simplicity

// Clean up
HIP_CHECK(hipFree(A));
HIP_CHECK(hipFree(B));
HIP_CHECK(hipFree(C));

return 0;
}

```


Key points about this example:

1. It uses `half` precision for input matrices A and B, which can potentially benefit from Matrix Core acceleration.

2. The kernel is designed for 16x16 matrices, which is a common size for Matrix Core operations.

3. Shared memory is used to improve performance by reducing global memory accesses.

4. The main computation loop uses `__half2float` conversions. On GPUs with native FP16 support, these conversions might be optimized out.

5. The kernel uses a tiled approach, which is generally efficient for matrix multiplication.

6. Error checking is included for HIP calls.

Important considerations:

1. This kernel doesn't guarantee the use of Matrix Cores. The actual use of Matrix Cores depends on the GPU architecture and the HIP compiler's optimizations.

2. For larger matrices, you'd need to implement a more sophisticated tiling strategy.

3. Performance tuning is crucial. You might need to experiment with different block sizes and memory access patterns for optimal performance.

4. The HIP runtime and compiler will attempt to optimize this code for the target GPU, potentially leveraging Matrix Cores if available.

5. For production use, you should implement proper error handling and potentially use more sophisticated synchronization methods.

To fully leverage Matrix Cores, you might need to use specific intrinsics or rely on compiler optimizations. The exact method can vary depending on the GPU architecture and HIP version. Always profile your code to ensure you're getting the expected performance benefits.

8/22/2024

hpcc install on cuda system. version 2

Please following the process

1. First, add the ROCm repository (if you haven't already):
wget -qO - https://repo.radeon.com/rocm/rocm.gpg.key | sudo apt-key add -
echo 'deb [arch=amd64] https://repo.radeon.com/rocm/apt/debian/ ubuntu main' | sudo tee /etc/apt/sources.list.d/rocm.list

2. Update your package list:
sudo apt update

3. Install only the HIP compiler and development tools:
sudo apt install hip-base hip-doc
This should install the basic HIP tools without the full runtime that caused issues before.

4. After installation, add the HIP binaries to your PATH. Add this line to your ~/.bashrc file:
export PATH=$PATH:/opt/rocm/bin

5. Then, apply the changes:
source ~/.bashrc

6. Verify the installation:
hipcc --version