8/25/2025
CK Tile Tutorial Day 2 (AMD hip programming) - Simple GEMM.
Concepts Added:
- 2D grid/block configuration
- Matrix multiplication basics
- Each thread computes one output element
Key Pattern:
// Each thread computes C[row][col]
for (int k = 0; k < K; k++) {
sum += A[row][k] * B[k][col];
}
.8/24/2025
CK Tile Tutorial Day 1 (AMD hip programming) - Vector add.
.
Concepts:
- Basic kernel structure: Args → Kernel → operator()
- Grid/Block configuration
- One thread per element processing
Key Code:
struct VectorAddKernel {
__device__ void operator()(args) {
int idx = blockIdx.x * blockDim.x + threadIdx.x;
c[idx] = a[idx] + b[idx];
}
};
..
code
..
...
Result
CK Tile Tutorial - Step 1: Vector Addition
==========================================
=== Step 1: Vector Addition ===
Adding two vectors of size 1024
Launching with grid(4), block(256)
Result: CORRECT
First 5 elements: 3 3 3 3 3
=== Step 1: Vector Addition ===
Adding two vectors of size 10000
Launching with grid(40), block(256)
Result: CORRECT
First 5 elements: 3 3 3 3 3
Key Concepts Demonstrated:
1. Kernel structure: Args → Kernel → operator()
2. Grid/Block configuration
3. Each thread processes one element
4. Bounds checking for safety
3/07/2025
Check my torch support GPU
checkgpu.py
..
.
🙏
Thank you!
9/17/2024
What is IREE turbine
IREE-Turbine is a package or toolset that combines PyTorch, Torch-MLIR, IREE, and additional tools to provide a comprehensive solution for compiling, optimizing, and executing PyTorch models using IREE's infrastructure. Based on the information in the image, IREE-Turbine offers the following key features:
1. AOT Export: This allows for Ahead-Of-Time compilation of PyTorch modules (nn.Modules) into deployment-ready artifacts. These compiled artifacts can then take full advantage of IREE's runtime features.
2. Eager Execution: It provides a torch.compile backend and a Turbine Tensor/Device for interactive PyTorch sessions. This enables users to work with PyTorch in a familiar environment while leveraging IREE's optimization capabilities.
3. Custom Ops: IREE-Turbine offers integration for defining custom PyTorch operations and implementing them using either IREE's backend IR or the Pythonic kernel language. This allows for extending PyTorch's functionality while maintaining compatibility with IREE's optimization pipeline.
In essence, IREE-Turbine acts as a bridge between PyTorch and IREE, allowing PyTorch users to benefit from IREE's advanced compilation and runtime features while maintaining a familiar PyTorch-based workflow. It aims to provide a seamless experience for compiling PyTorch models to run efficiently on various hardware targets supported by IREE.
HIP kernel for matrix multiplication that can leverage Matrix Cores
Key points about this example:
1. It uses `half` precision for input matrices A and B, which can potentially benefit from Matrix Core acceleration.
2. The kernel is designed for 16x16 matrices, which is a common size for Matrix Core operations.
3. Shared memory is used to improve performance by reducing global memory accesses.
4. The main computation loop uses `__half2float` conversions. On GPUs with native FP16 support, these conversions might be optimized out.
5. The kernel uses a tiled approach, which is generally efficient for matrix multiplication.
6. Error checking is included for HIP calls.
Important considerations:
1. This kernel doesn't guarantee the use of Matrix Cores. The actual use of Matrix Cores depends on the GPU architecture and the HIP compiler's optimizations.
2. For larger matrices, you'd need to implement a more sophisticated tiling strategy.
3. Performance tuning is crucial. You might need to experiment with different block sizes and memory access patterns for optimal performance.
4. The HIP runtime and compiler will attempt to optimize this code for the target GPU, potentially leveraging Matrix Cores if available.
5. For production use, you should implement proper error handling and potentially use more sophisticated synchronization methods.
To fully leverage Matrix Cores, you might need to use specific intrinsics or rely on compiler optimizations. The exact method can vary depending on the GPU architecture and HIP version. Always profile your code to ensure you're getting the expected performance benefits.
Creating a custom CUDA kernel that directly utilizes tensor cores
Creating a custom CUDA kernel that directly utilizes tensor cores is an advanced topic, as tensor cores are typically accessed through higher-level libraries like cuBLAS or cuDNN. However, NVIDIA does provide a way to use tensor cores in custom kernels through their CUDA Core library, specifically with Warp Matrix Multiply-Accumulate (WMMA) API. Here's an overview of how to create a kernel that works on tensor cores:
Key points about this example:
1. We're using 16x16 matrices as this is a common size for tensor core operations.
2. The kernel uses `nvcuda::wmma::fragment` to define matrix fragments.
3. `load_matrix_sync`, `mma_sync`, and `store_matrix_sync` are used to load data, perform matrix multiplication, and store results using tensor cores.
4. The kernel operates on half-precision input (`half`) and produces single-precision output (`float`).
To use this kernel:
1. Compile with a CUDA compiler that supports tensor cores (CUDA 9.0 or later).
2. Use appropriate GPU architecture flags (e.g., `-arch=sm_70` for Volta, `-arch=sm_75` for Turing).
3. Allocate memory and copy data to the GPU before calling `launch_wmma_kernel`.
Important considerations:
1. Error checking is omitted for brevity but should be included in production code.
2. This is a basic example. Real-world usage often involves tiling and more complex memory access patterns for larger matrices.
3. Performance tuning is crucial. The exact dimensions and data types should be chosen based on your specific use case and target GPU architecture.
4. Not all operations can be efficiently mapped to tensor cores. They're most beneficial for large matrix multiplications common in deep learning workloads.
Remember, while this approach gives you direct control over tensor core usage, in many cases, using higher-level libraries like cuBLAS or cuDNN is more practical and can automatically leverage tensor cores when appropriate.
2/09/2023
AWS ec2 gpu instance comparison
Architecture |
NVIDIA GPU |
Instance type |
Instance name |
Number of GPUs |
GPU Memory (per GPU) |
GPU Interconnect (NVLink / PCIe) |
Thermal Design Power (TDP) from nvidia-smi |
Tensor Cores (mixed-precision) |
Precision Support |
CPU Type |
Nitro based |
Ampere |
A100 |
P4 |
p4d.24xlarge |
8 |
40 GB |
NVLink gen 3 (600 GB/s) |
400W |
Tensor Cores (Gen 3) |
FP64, FP32, FP16, INT8, BF16, TF32 |
Intel Xeon Scalable (Cascade Lake) |
Yes |
Ampere |
A10G |
G5 |
g5.xlarge |
1 |
24 GB |
NA (single GPU) |
300W |
Tensor Cores (Gen 3) |
FP64, FP32, FP16, INT8, BF16, TF32 |
AMD EPYC |
Yes |
Ampere |
A10G |
G5 |
g5.2xlarge |
1 |
24 GB |
NA (single GPU) |
300W |
Tensor Cores (Gen 3) |
FP64, FP32, FP16, INT8, BF16, TF32 |
AMD EPYC |
Yes |
Ampere |
A10G |
G5 |
g5.4xlarge |
1 |
24 GB |
NA (single GPU) |
300W |
Tensor Cores (Gen 3) |
FP64, FP32, FP16, INT8, BF16, TF32 |
AMD EPYC |
Yes |
Ampere |
A10G |
G5 |
g5.8xlarge |
1 |
24 GB |
NA (single GPU) |
300W |
Tensor Cores (Gen 3) |
FP64, FP32, FP16, INT8, BF16, TF32 |
AMD EPYC |
Yes |
Ampere |
A10G |
G5 |
g5.16xlarge |
1 |
24 GB |
NA (single GPU) |
300W |
Tensor Cores (Gen 3) |
FP64, FP32, FP16, INT8, BF16, TF32 |
AMD EPYC |
Yes |
Ampere |
A10G |
G5 |
g5.12xlarge |
4 |
24 GB |
PCIe |
300W |
Tensor Cores (Gen 3) |
FP64, FP32, FP16, INT8, BF16, TF32 |
AMD EPYC |
Yes |
Ampere |
A10G |
G5 |
g5.24xlarge |
4 |
24 GB |
PCIe |
300W |
Tensor Cores (Gen 3) |
FP64, FP32, FP16, INT8, BF16, TF32 |
AMD EPYC |
Yes |
Ampere |
A10G |
G5 |
g5.48xlarge |
8 |
24 GB |
PCIe |
300W |
Tensor Cores (Gen 3) |
FP64, FP32, FP16, INT8, BF16, TF32 |
AMD EPYC |
Yes |
Turing |
T4G |
G5 |
g5g.xlarge |
1 |
16 GB |
NA (single GPU) |
70W |
Tensor Cores (Gen 2) |
FP32, FP16, INT8 |
AWS Graviton2 |
Yes |
Turing |
T4G |
G5 |
g5g.2xlarge |
1 |
16 GB |
NA (single GPU) |
70W |
Tensor Cores (Gen 2) |
FP32, FP16, INT8 |
AWS Graviton2 |
Yes |
Turing |
T4G |
G5 |
g5g.4xlarge |
1 |
16 GB |
NA (single GPU) |
70W |
Tensor Cores (Gen 2) |
FP32, FP16, INT8 |
AWS Graviton2 |
Yes |
Turing |
T4G |
G5 |
g5g.8xlarge |
1 |
16 GB |
NA (single GPU) |
70W |
Tensor Cores (Gen 2) |
FP32, FP16, INT8 |
AWS Graviton2 |
Yes |
Turing |
T4G |
G5 |
g5g.16xlarge |
2 |
16 GB |
PCIe |
70W |
Tensor Cores (Gen 2) |
FP32, FP16, INT8 |
AWS Graviton2 |
Yes |
Turing |
T4G |
G5 |
g5g.metal |
2 |
16 GB |
PCIe |
70W |
Tensor Cores (Gen 2) |
FP32, FP16, INT8 |
AWS Graviton2 |
Yes |
Turing |
T4 |
G4 |
g4dn.xlarge |
1 |
16 GB |
NA (single GPU) |
70W |
Tensor Cores (Gen 2) |
FP32, FP16, INT8 |
Intel Xeon Scalable (Cascade Lake) |
Yes |
Turing |
T4 |
G4 |
g4dn.2xlarge |
1 |
16 GB |
NA (single GPU) |
70W |
Tensor Cores (Gen 2) |
FP32, FP16, INT8 |
Intel Xeon Scalable (Cascade Lake) |
Yes |
Turing |
T4 |
G4 |
g4dn.4xlarge |
1 |
16 GB |
NA (single GPU) |
70W |
Tensor Cores (Gen 2) |
FP32, FP16, INT8 |
Intel Xeon Scalable (Cascade Lake) |
Yes |
Turing |
T4 |
G4 |
g4dn.8xlarge |
1 |
16 GB |
NA (single GPU) |
70W |
Tensor Cores (Gen 2) |
FP32, FP16, INT8 |
Intel Xeon Scalable (Cascade Lake) |
Yes |
Turing |
T4 |
G4 |
g4dn.16xlarge |
1 |
16 GB |
NA (single GPU) |
70W |
Tensor Cores (Gen 2) |
FP32, FP16, INT8 |
Intel Xeon Scalable (Cascade Lake) |
Yes |
Turing |
T4 |
G4 |
g4dn.12xlarge |
4 |
16 GB |
PCIe |
70W |
Tensor Cores (Gen 2) |
FP32, FP16, INT8 |
Intel Xeon Scalable (Cascade Lake) |
Yes |
Turing |
T4 |
G4 |
g4dn.metal |
8 |
16 GB |
PCIe |
70W |
Tensor Cores (Gen 2) |
FP32, FP16, INT8 |
Intel Xeon Scalable (Cascade Lake) |
Yes |
Volta |
V100 |
P3 |
p3.2xlarge |
1 |
16 GB |
NA (single GPU) |
300W |
Tensor Cores (Gen 1) |
FP64, FP32, FP16 |
Intel Xeon (Broadwell) |
No |
Volta |
V100 |
P3 |
p3.8xlarge |
4 |
16 GB |
NVLink gen 2 (300 GB/s) |
300W |
Tensor Cores (Gen 1) |
FP64, FP32, FP16 |
Intel Xeon (Broadwell) |
No |
Volta |
V100 |
P3 |
p3.16xlarge |
8 |
16 GB |
NVLink gen 2 (300 GB/s) |
300W |
Tensor Cores (Gen 1) |
FP64, FP32, FP16 |
Intel Xeon (Broadwell) |
No |
Volta |
V100* |
P3 |
p3dn.24xlarge |
8 |
32 GB |
NVLink gen 2 (300 GB/s) |
300W |
Tensor Cores (Gen 1) |
FP64, FP32, FP16 |
Intel Xeon (Skylake) |
Yes |
Kepler |
K80 |
P2 |
p2.xlarge |
1 |
12 GB |
NA (single GPU) |
149W |
No |
FP64, FP32 |
Intel Xeon (Broadwell) |
No |
Kepler |
K80 |
P2 |
p2.8xlarge |
8 |
12 GB |
PCIe |
149W |
No |
FP64, FP32 |
Intel Xeon (Broadwell) |
No |
Kepler |
K80 |
P2 |
p2.16xlarge |
16 |
12 GB |
PCIe |
149W |
No |
FP64, FP32 |
Intel Xeon (Broadwell) |
No |
Maxwell |
M60 |
G3 |
g3s.xlarge |
1 |
8 GB |
PCIe |
150W |
No |
FP32 |
Intel Xeon (Broadwell) |
No |
Maxwell |
M60 |
G3 |
g3.4xlarge |
1 |
8 GB |
PCIe |
150W |
No |
FP32 |
Intel Xeon (Broadwell) |
No |
Maxwell |
M60 |
G3 |
g3.8xlarge |
2 |
8 GB |
PCIe |
150W |
No |
FP32 |
Intel Xeon (Broadwell) |
No |
Maxwell |
M60 |
G3 |
g3.16xlarge |
4 |
8 GB |
PCIe |
150W |
No |
FP32 |
Intel Xeon (Broadwell) |
No |
-
Logistic Classifier The logistic classifier is similar to equation of the plane. W is weight vector, X is input vector and y is output...
-
I use MOG2 algorithm to background subtraction. The process is resize to small for more fast processing to blur for avoid noise affectio...
-
This is data acquisition source code of LMS511(SICK co.) Source code is made by MFC(vs 2008). The sensor is communicated by TCP/IP. ...
-
Background subtractor example souce code. OpenCV support about 3 types subtraction algorithm. Those are MOG, MOG2, GMG algorithms. Det...
-
Image size of origin is 320*240. Processing time is 30.96 second took. The result of stitching The resul...
-
Created Date : 2009.10. Language : C++ Tool : Visual Studio C++ 2008 Library & Utilized : Point Grey-FlyCapture, Triclops, OpenCV...
-
As you can see in the following video, I created a class that stitching n cameras in real time. https://www.youtube.com/user/feelmare/sear...
-
The MNIST dataset is a dataset of handwritten digits, comprising 60 000 training examples and 10 000 test examples. The dataset can be downl...
-
This post is about how to copy Mat data to vector and copy vector data to Mat. Reference this example source code. printf ( "/////...
-
This example source code is to extract HOG feature from images. And save descriptors to XML file. The source code explain how to use HOGD...