9/18/2024

AMD Distributed Training Overview

# AMD Distributed Training Overview

AMD's approach to distributed training leverages its high-performance CPUs and GPUs, along with software frameworks, to enable efficient scaling of machine learning workloads across multiple nodes. Key aspects include:

1. **Hardware Solutions:**
   - AMD EPYC CPUs: Provide high core counts and memory bandwidth.
   - AMD Instinct GPUs: Accelerators designed for HPC and AI workloads.
   - AMD Infinity Fabric: High-speed interconnect for multi-GPU and multi-node systems.

2. **Software Framework:**
   - ROCm (Radeon Open Compute): Open-source software stack for GPU computing.
   - HIP (Heterogeneous-Compute Interface for Portability): C++ runtime API for GPU programming.
   - AMD's optimized libraries for deep learning frameworks like TensorFlow and PyTorch.

3. **Distributed Training Techniques:**
   - Data Parallelism: Distributing batches of training data across multiple GPUs or nodes.
   - Model Parallelism: Splitting large models across multiple devices.
   - Pipeline Parallelism: Dividing model layers across devices and processing in a pipelined fashion.

4. **Communication Optimization:**
   - RCCL (ROCm Communication Collectives Library): Optimized multi-GPU and multi-node collective communications.
   - Support for high-speed networking technologies like InfiniBand.

5. **Scalability:**
   - Support for scaling from single-node multi-GPU systems to large clusters.
   - Integration with job schedulers and resource managers for cluster environments.

6. **Ecosystem Integration:**
   - Compatibility with popular ML frameworks and distributed training tools.
   - Support for containers and orchestration platforms like Docker and Kubernetes.

7. **Performance Optimization:**
   - Mixed-precision training support.
   - Memory management techniques for large model training.
   - Automatic performance tuning tools.

AMD's distributed training solutions aim to provide high performance, scalability, and ease of use for researchers and organizations working on large-scale machine learning projects.

9/17/2024

What is IREE turbine

 IREE-Turbine is a package or toolset that combines PyTorch, Torch-MLIR, IREE, and additional tools to provide a comprehensive solution for compiling, optimizing, and executing PyTorch models using IREE's infrastructure. Based on the information in the image, IREE-Turbine offers the following key features:


1. AOT Export: This allows for Ahead-Of-Time compilation of PyTorch modules (nn.Modules) into deployment-ready artifacts. These compiled artifacts can then take full advantage of IREE's runtime features.


2. Eager Execution: It provides a torch.compile backend and a Turbine Tensor/Device for interactive PyTorch sessions. This enables users to work with PyTorch in a familiar environment while leveraging IREE's optimization capabilities.


3. Custom Ops: IREE-Turbine offers integration for defining custom PyTorch operations and implementing them using either IREE's backend IR or the Pythonic kernel language. This allows for extending PyTorch's functionality while maintaining compatibility with IREE's optimization pipeline.


In essence, IREE-Turbine acts as a bridge between PyTorch and IREE, allowing PyTorch users to benefit from IREE's advanced compilation and runtime features while maintaining a familiar PyTorch-based workflow. It aims to provide a seamless experience for compiling PyTorch models to run efficiently on various hardware targets supported by IREE.


HIP kernel for matrix multiplication that can leverage Matrix Cores

Here's an example of a custom HIP kernel for matrix multiplication that can leverage Matrix Cores:



```cpp
#include <hip/hip_runtime.h>
#include <hip/hip_fp16.h>
#include <iostream>

// Define matrix dimensions
#define M 16
#define N 16
#define K 16

// HIP kernel for matrix multiplication
__global__ void matrixMulKernel(half* A, half* B, float* C) {
// Shared memory for tile of A and B
__shared__ half As[M][K];
__shared__ half Bs[K][N];

int bx = blockIdx.x;
int by = blockIdx.y;
int tx = threadIdx.x;
int ty = threadIdx.y;

// Index of the first sub-matrix of A processed by the block
int aBegin = K * M * by;
// Index of the last sub-matrix of A processed by the block
int aEnd = aBegin + K - 1;
// Step size used to iterate through the sub-matrices of A
int aStep = M;

// Index of the first sub-matrix of B processed by the block
int bBegin = N * bx;
// Step size used to iterate through the sub-matrices of B
int bStep = K * N;

// Csub is used to store the element of the block sub-matrix
// that is computed by the thread
float Csub = 0;

// Loop over all the sub-matrices of A and B
for (int a = aBegin, b = bBegin; a <= aEnd; a += aStep, b += bStep) {
// Load the matrices from device memory to shared memory
As[ty][tx] = A[a + K * ty + tx];
Bs[ty][tx] = B[b + N * ty + tx];

// Synchronize to make sure the matrices are loaded
__syncthreads();

// Multiply the two matrices
#pragma unroll
for (int k = 0; k < K; ++k) {
Csub += __half2float(As[ty][k]) * __half2float(Bs[k][tx]);
}

// Synchronize to make sure that the preceding
// computation is done before loading two new
// sub-matrices of A and B in the next iteration
__syncthreads();
}

// Write the block sub-matrix to device memory
// each thread writes one element
int c = N * M * by + M * bx;
C[c + N * ty + tx] = Csub;
}

// Host function to launch the kernel
void launchMatrixMulKernel(half* A, half* B, float* C, int numBlocks) {
dim3 threadsPerBlock(M, N);
dim3 blocksPerGrid(numBlocks, numBlocks);
hipLaunchKernelGGL(matrixMulKernel, blocksPerGrid, threadsPerBlock, 0, 0, A, B, C);
}

// Error checking macro
#define HIP_CHECK(call) { hipError_t err = call; if (err != hipSuccess) { std::cerr << "HIP error: " << hipGetErrorString(err) << std::endl; exit(1); } }

int main() {
// Allocate memory
half *A, *B;
float *C;
HIP_CHECK(hipMalloc(&A, M * K * sizeof(half)));
HIP_CHECK(hipMalloc(&B, K * N * sizeof(half)));
HIP_CHECK(hipMalloc(&C, M * N * sizeof(float)));

// Initialize matrices (you would typically do this on the GPU)
// ...

// Launch kernel
launchMatrixMulKernel(A, B, C, 1); // Assuming 1 block for simplicity

// Clean up
HIP_CHECK(hipFree(A));
HIP_CHECK(hipFree(B));
HIP_CHECK(hipFree(C));

return 0;
}

```


Key points about this example:

1. It uses `half` precision for input matrices A and B, which can potentially benefit from Matrix Core acceleration.

2. The kernel is designed for 16x16 matrices, which is a common size for Matrix Core operations.

3. Shared memory is used to improve performance by reducing global memory accesses.

4. The main computation loop uses `__half2float` conversions. On GPUs with native FP16 support, these conversions might be optimized out.

5. The kernel uses a tiled approach, which is generally efficient for matrix multiplication.

6. Error checking is included for HIP calls.

Important considerations:

1. This kernel doesn't guarantee the use of Matrix Cores. The actual use of Matrix Cores depends on the GPU architecture and the HIP compiler's optimizations.

2. For larger matrices, you'd need to implement a more sophisticated tiling strategy.

3. Performance tuning is crucial. You might need to experiment with different block sizes and memory access patterns for optimal performance.

4. The HIP runtime and compiler will attempt to optimize this code for the target GPU, potentially leveraging Matrix Cores if available.

5. For production use, you should implement proper error handling and potentially use more sophisticated synchronization methods.

To fully leverage Matrix Cores, you might need to use specific intrinsics or rely on compiler optimizations. The exact method can vary depending on the GPU architecture and HIP version. Always profile your code to ensure you're getting the expected performance benefits.

Creating a custom CUDA kernel that directly utilizes tensor cores



Creating a custom CUDA kernel that directly utilizes tensor cores is an advanced topic, as tensor cores are typically accessed through higher-level libraries like cuBLAS or cuDNN. However, NVIDIA does provide a way to use tensor cores in custom kernels through their CUDA Core library, specifically with Warp Matrix Multiply-Accumulate (WMMA) API. Here's an overview of how to create a kernel that works on tensor cores:

1. Use CUDA Core WMMA API:
The WMMA API allows you to program tensor cores directly in your CUDA kernels.

2. Include necessary headers:
```cpp
#include <mma.h>
#include <cuda_fp16.h>
```

3. Use appropriate data types:
Tensor cores work with specific data types like half precision floating point (`__half`).

4. Define matrix fragments:
Use `nvcuda::wmma::fragment` to define matrix fragments that will be processed by tensor cores.

5. Load, compute, and store operations:
Use WMMA load, multiply-accumulate, and store operations.

Here's an example kernel that uses tensor cores via the WMMA API:



```cuda
#include <mma.h>
#include <cuda_fp16.h>

// Define matrix dimensions
const int M = 16;
const int N = 16;
const int K = 16;

__global__ void wmma_example(half *a, half *b, float *c) {
// Declare the fragments
nvcuda::wmma::fragment<nvcuda::wmma::matrix_a, M, N, K, half, nvcuda::wmma::col_major> a_frag;
nvcuda::wmma::fragment<nvcuda::wmma::matrix_b, M, N, K, half, nvcuda::wmma::col_major> b_frag;
nvcuda::wmma::fragment<nvcuda::wmma::accumulator, M, N, K, float> c_frag;

// Initialize the output to zero
nvcuda::wmma::fill_fragment(c_frag, 0.0f);

// Load the inputs
nvcuda::wmma::load_matrix_sync(a_frag, a, K);
nvcuda::wmma::load_matrix_sync(b_frag, b, K);

// Perform the matrix multiplication
nvcuda::wmma::mma_sync(c_frag, a_frag, b_frag, c_frag);

// Store the output
nvcuda::wmma::store_matrix_sync(c, c_frag, N, nvcuda::wmma::mem_row_major);
}

// Host function to launch the kernel
void launch_wmma_kernel(half *a, half *b, float *c) {
dim3 gridDim(1);
dim3 blockDim(32); // One warp
wmma_example<<<gridDim, blockDim>>>(a, b, c);
}

```

Key points about this example:

1. We're using 16x16 matrices as this is a common size for tensor core operations.
2. The kernel uses `nvcuda::wmma::fragment` to define matrix fragments.
3. `load_matrix_sync`, `mma_sync`, and `store_matrix_sync` are used to load data, perform matrix multiplication, and store results using tensor cores.
4. The kernel operates on half-precision input (`half`) and produces single-precision output (`float`).

To use this kernel:

1. Compile with a CUDA compiler that supports tensor cores (CUDA 9.0 or later).
2. Use appropriate GPU architecture flags (e.g., `-arch=sm_70` for Volta, `-arch=sm_75` for Turing).
3. Allocate memory and copy data to the GPU before calling `launch_wmma_kernel`.

Important considerations:

1. Error checking is omitted for brevity but should be included in production code.
2. This is a basic example. Real-world usage often involves tiling and more complex memory access patterns for larger matrices.
3. Performance tuning is crucial. The exact dimensions and data types should be chosen based on your specific use case and target GPU architecture.
4. Not all operations can be efficiently mapped to tensor cores. They're most beneficial for large matrix multiplications common in deep learning workloads.

Remember, while this approach gives you direct control over tensor core usage, in many cases, using higher-level libraries like cuBLAS or cuDNN is more practical and can automatically leverage tensor cores when appropriate.

9/16/2024

Pytorch model to mlir -> llvm -> executable file on Mac book m1


# Step 1: Define and train a simple PyTorch CNN model



import torch

import torch.nn as nn

import torch.optim as optim

import torchvision

import torchvision.transforms as transforms



# Define a simple CNN

class SimpleCNN(nn.Module):

def __init__(self):

super(SimpleCNN, self).__init__()

self.conv1 = nn.Conv2d(1, 32, 3, 1)

self.conv2 = nn.Conv2d(32, 64, 3, 1)

self.dropout1 = nn.Dropout2d(0.25)

self.dropout2 = nn.Dropout2d(0.5)

self.fc1 = nn.Linear(9216, 128)

self.fc2 = nn.Linear(128, 10)



def forward(self, x):

x = self.conv1(x)

x = nn.functional.relu(x)

x = self.conv2(x)

x = nn.functional.relu(x)

x = nn.functional.max_pool2d(x, 2)

x = self.dropout1(x)

x = torch.flatten(x, 1)

x = self.fc1(x)

x = nn.functional.relu(x)

x = self.dropout2(x)

x = self.fc2(x)

output = nn.functional.log_softmax(x, dim=1)

return output



# Train the model (simplified for brevity)

model = SimpleCNN()

criterion = nn.CrossEntropyLoss()

optimizer = optim.Adam(model.parameters())



# Assume we've trained the model...



# Save the trained model

torch.save(model.state_dict(), "simple_cnn.pth")



# Step 2: Compile the model with torch-mlir



import torch_mlir



# Load the trained model

model = SimpleCNN()

model.load_state_dict(torch.load("simple_cnn.pth"))

model.eval()



# Create an example input tensor

example_input = torch.randn(1, 1, 28, 28)



# Compile the model to MLIR

mlir_module = torch_mlir.compile(model, example_input, output_type="linalg-on-tensors")



# Save the MLIR module to a file

with open("simple_cnn.mlir", "w") as f:

f.write(str(mlir_module))



# Step 3: Lower MLIR to LLVM IR

# This step typically requires using the MLIR tools from the command line



# mlir-opt simple_cnn.mlir --convert-linalg-to-loops --convert-scf-to-cf --convert-vector-to-llvm --convert-memref-to-llvm --convert-func-to-llvm --reconcile-unrealized-casts | mlir-translate --mlir-to-llvmir > simple_cnn.ll



# Step 4: Compile LLVM IR to machine code

# Use Clang to compile for M1 Mac (arm64 architecture)



# clang -O3 -march=arm64 simple_cnn.ll -o simple_cnn_exec



# The result is an executable file named 'simple_cnn_exec'



# Step 5 (optional): Create a C++ wrapper to use the compiled model



#include <iostream>

#include <vector>



// Declare the function generated from our PyTorch model

extern "C" void simple_cnn(float* input, float* output);



int main() {

// Prepare input (28x28 image flattened to 1D array)

std::vector<float> input(784, 0.0f); // Initialize with zeros for simplicity


// Prepare output (10 classes for MNIST)

std::vector<float> output(10, 0.0f);


// Call the compiled model

simple_cnn(input.data(), output.data());


// Print the output (class probabilities)

for (int i = 0; i < 10; ++i) {

std::cout << "Class " << i << " probability: " << output[i] << std::endl;

}


return 0;

}



# Compile the C++ wrapper with the compiled model

# clang++ -O3 wrapper.cpp simple_cnn_exec -o final_executable