5 different case of FSDP and TP usage.
10/11/2024
10/10/2024
FSDP and TP explanation for 2 layer model
FSDP and TP are complementary parallelism techniques:
- FSDP (Fully Sharded Data Parallelism):
- Shards model parameters across GPUs
- Each GPU holds a portion of each layer's parameters
- During forward/backward pass, it gathers/scatters parameters as needed
- Reduces memory usage per GPU, allowing larger models
- TP (Tensor Parallelism):
- Splits individual tensors (layers) across GPUs
- Each GPU computes a portion of a layer's operations
- Useful for very large layers that don't fit on a single GPU
When combined:
- FSDP handles overall model distribution
- TP handles distribution of large individual layers
- This allows for even larger models and better GPU utilization
Textual Representation:
GPU 1 GPU 2 GPU 3 GPU 4 +--------+ +--------+ +--------+ +--------+ | L1 P1 | | L1 P2 | | L2 P1 | | L2 P2 | | TP1 | | TP2 | | TP1 | | TP2 | +--------+ +--------+ +--------+ +--------+ | | | | +------------+ +------------+ Layer 1 Layer 2 L1, L2: Layers 1 and 2 P1, P2: Parameter shards (FSDP) TP1, TP2: Tensor Parallel splits
9/30/2024
How Gradient calculation in batch size.
Let's use a simplified example with just 2 data points and walk through the process with actual numbers. This will help illustrate how gradients are calculated and accumulated for a batch.
Let's assume we have a very simple model with one parameter w
, currently set to 1.0. Our loss function is the square error, and we're using basic gradient descent with a learning rate of 0.1.
Data points:
- x1 = 2, y1 = 4
- x2 = 3, y2 = 5
Batch size = 2 (both data points in one batch)
Step 1: Forward pass
- For x1: prediction = w * x1 = 1.0 * 2 = 2
- For x2: prediction = w * x2 = 1.0 * 3 = 3
Step 2: Calculate losses
- Loss1 = (prediction1 - y1)^2 = (2 - 4)^2 = 4
- Loss2 = (prediction2 - y2)^2 = (3 - 5)^2 = 4
- Total batch loss = (Loss1 + Loss2) / 2 = (4 + 4) / 2 = 4
Step 3: Backward pass (calculate gradients)
- Gradient1 = 2 * (prediction1 - y1) * x1 = 2 * (2 - 4) * 2 = -8
- Gradient2 = 2 * (prediction2 - y2) * x2 = 2 * (3 - 5) * 3 = -12
Step 4: Accumulate gradients
- Total gradient = (Gradient1 + Gradient2) / 2 = (-8 + -12) / 2 = -10
Step 5: Update weight (once for the batch)
- New w = old w - learning_rate * total gradient
- New w = 1.0 - 0.1 * (-10) = 2.0
So, after processing this batch of 2 data points:
- We calculated 2 individual gradients (-8 and -12)
- We accumulated these into one total gradient (-10)
- We performed one weight update, changing w from 1.0 to 2.0
This process would then repeat for the next batch. In this case, we've processed all our data, so this completes one epoch.
9/28/2024
How many GPUs do I need to train a LLM?
9/22/2024
What is TorchOps.cpp.inc in torch-mlir
What is TorchOps.cpp.inc
?
- TorchOps.cpp.inc: This file contains implementations of the operations for the
torch-mlir
dialect. It is typically generated from.td
(TableGen) files that define the dialect and its operations. - The
.td
(TableGen) files describe MLIR operations in a high-level, declarative form, and thecmake
build process automatically generates.cpp.inc
files (likeTorchOps.cpp.inc
) from these.td
files.
How it gets generated:
- TableGen: The
TableGen
tool processes.td
files that define the operations and attributes for thetorch
dialect. - CMake Build: During the CMake build process, the
mlir-tblgen
tool is invoked to generate various.inc
files, includingTorchOps.cpp.inc
.
Where It Is Generated:
The TorchOps.cpp.inc
file is usually generated in the build
directory under the subdirectories for the torch-mlir project. For example:
build/tools/torch-mlir/lib/Dialect/Torch/IR/TorchOps.cpp.inc
This file gets included in the compiled source code to provide the implementation of the Torch dialect operations.
How to Ensure It Is Generated:
If the file is missing, it's likely because there was an issue in the build process. Here’s how to ensure it’s generated:
Ensure CMake and Ninja Build: Make sure the CMake and Ninja build process is working correctly by following the steps we discussed earlier. You can check that the
TorchOps.cpp.inc
file is generated by looking in the build directory:ls build/tools/torch-mlir/lib/Dialect/Torch/IR/
Check for TableGen Files: Make sure that the
.td
files (such asTorchOps.td
) are present in the source directory. These are used bymlir-tblgen
to generate the.cpp.inc
files.
Debugging if Not Generated:
If TorchOps.cpp.inc
or similar files are not generated, ensure:
- You are running the full build using
ninja
ormake
. mlir-tblgen
is being invoked during the build process (you should see log messages referencingmlir-tblgen
).
IREE test code and explanation
.
..
To run this code:
- Save it to a file, e.g.,
test_iree.py
. - Make sure you have IREE and its Python bindings installed and properly set up in your environment.
- Run the script using Python:
python test_iree.py
This script will:
- Define a simple MLIR function that adds two 4-element float32 tensors.
- Compile this MLIR code to an IREE module.
- Set up the IREE runtime environment.
- Create input data as NumPy arrays.
- Execute the compiled function with the input data.
- Print the result.
The output should show each step of the process and finally print the result, which should be [ 6. 8. 10. 12.]
.
This example demonstrates the basic workflow for testing MLIR code with IREE using Python. You can modify the MLIR code string and input data to test different functions and operations as needed.
9/20/2024
mlir build and test
To build and run your toy1.cpp
code with MLIR, you need to follow these steps. This assumes you are using the Toy language tutorial from MLIR as a base.
1. Setup MLIR Development Environment
If you haven’t done this already, you’ll need to clone and build the LLVM project with MLIR enabled. Here are the steps:
a. Clone LLVM with MLIR
git clone https://github.com/llvm/llvm-project.git
cd llvm-project
b. Build MLIR
mkdir build
cd build
cmake -G Ninja ../llvm \
-DLLVM_ENABLE_PROJECTS=mlir \
-DLLVM_BUILD_EXAMPLES=ON \
-DCMAKE_BUILD_TYPE=Release \
-DLLVM_ENABLE_ASSERTIONS=ON
cmake --build . --target check-mlir
You can also follow the full guide for building MLIR from the official MLIR Getting Started guide【19†source】.
2. Implementing the Toy Language (toy1.cpp
)
You are using a simplified example of the Toy Language from the MLIR tutorial. For this code to work, you need to create a proper Toy dialect and Toy compiler.
a. Writing the toy1.cpp
Save your example code as toy1.cpp
inside your MLIR directory.
#include "toy/Dialect.h"
#include "toy/Parser.h"
#include "toy/Passes.h"
#include "toy/Lowering.h"
#include <mlir/IR/MLIRContext.h>
#include <mlir/Pass/PassManager.h>
#include <mlir/ExecutionEngine/ExecutionEngine.h>
#include <mlir/IR/Verifier.h>
#include <mlir/Parser/Parser.h>
#include <mlir/Support/FileUtilities.h>
#include <mlir/Support/LogicalResult.h>
#include <mlir/Support/ToolUtilities.h>
#include <mlir/Support/LLVM.h>
#include <mlir/Target/LLVMIR/ModuleTranslation.h>
int main(int argc, char **argv) {
mlir::MLIRContext context;
mlir::PassManager pm(&context);
// Define your toy program in MLIR (using Toy dialect)
// "var a = [[1, 2, 3], [4, 5, 6]]; var b<2, 3> = ..."
// Parse it, verify, and run it
// Example: Create a pass that optimizes or lowers the Toy language IR into MLIR
return 0;
}
You will need to modify this template to use the Toy language's parser and lower the Toy code into MLIR.
3. Integrating with the MLIR Pass Pipeline
You’ll need to define and register your passes. This step lowers Toy language constructs (like variable assignments, matrix multiplication, and transposing) into the MLIR representation.
b. Register Toy Passes and Dialect
You can define passes to lower your Toy language to MLIR:
// In your main, define the following steps:
pm.addPass(toy::createShapeInferencePass());
pm.addPass(mlir::createCSEPass());
pm.addPass(mlir::createCanonicalizerPass());
pm.addPass(toy::createLowerToAffinePass());
pm.addPass(toy::createLowerToLLVMPass());
4. Running Your Toy Code in MLIR
Once you've written the Toy language logic and set up the passes, you can now run and test it using the MLIR tools.
a. Compile toy1.cpp
After you set up your CMakeLists.txt file (using the MLIR Toy Tutorial) and ensure that the Toy dialect is registered, you can compile the Toy language.
cd build
cmake --build . --target toy-compiler
b. Run Toy Compiler
To run your Toy code and compile it into MLIR:
./toy-compiler toy1.cpp -o output.mlir
This will generate MLIR code for your Toy program.
5. Testing and Debugging
Once you've compiled your Toy language code to MLIR, you can use MLIR’s optimization and debugging tools:
mlir-opt output.mlir --canonicalize --cse
mlir-translate --mlir-to-llvmir output.mlir | llc -filetype=obj -o output.o
This will optimize and translate your Toy program into LLVM IR and finally to machine code that can be executed.
References:
- MLIR Getting Started【19†source】
- MLIR Toy Tutorial【18†source】
This setup will help you compile and run Toy language code through MLIR!
9/18/2024
AMD Distributed Training Overview
9/17/2024
What is IREE turbine
IREE-Turbine is a package or toolset that combines PyTorch, Torch-MLIR, IREE, and additional tools to provide a comprehensive solution for compiling, optimizing, and executing PyTorch models using IREE's infrastructure. Based on the information in the image, IREE-Turbine offers the following key features:
1. AOT Export: This allows for Ahead-Of-Time compilation of PyTorch modules (nn.Modules) into deployment-ready artifacts. These compiled artifacts can then take full advantage of IREE's runtime features.
2. Eager Execution: It provides a torch.compile backend and a Turbine Tensor/Device for interactive PyTorch sessions. This enables users to work with PyTorch in a familiar environment while leveraging IREE's optimization capabilities.
3. Custom Ops: IREE-Turbine offers integration for defining custom PyTorch operations and implementing them using either IREE's backend IR or the Pythonic kernel language. This allows for extending PyTorch's functionality while maintaining compatibility with IREE's optimization pipeline.
In essence, IREE-Turbine acts as a bridge between PyTorch and IREE, allowing PyTorch users to benefit from IREE's advanced compilation and runtime features while maintaining a familiar PyTorch-based workflow. It aims to provide a seamless experience for compiling PyTorch models to run efficiently on various hardware targets supported by IREE.
HIP kernel for matrix multiplication that can leverage Matrix Cores
Key points about this example:
1. It uses `half` precision for input matrices A and B, which can potentially benefit from Matrix Core acceleration.
2. The kernel is designed for 16x16 matrices, which is a common size for Matrix Core operations.
3. Shared memory is used to improve performance by reducing global memory accesses.
4. The main computation loop uses `__half2float` conversions. On GPUs with native FP16 support, these conversions might be optimized out.
5. The kernel uses a tiled approach, which is generally efficient for matrix multiplication.
6. Error checking is included for HIP calls.
Important considerations:
1. This kernel doesn't guarantee the use of Matrix Cores. The actual use of Matrix Cores depends on the GPU architecture and the HIP compiler's optimizations.
2. For larger matrices, you'd need to implement a more sophisticated tiling strategy.
3. Performance tuning is crucial. You might need to experiment with different block sizes and memory access patterns for optimal performance.
4. The HIP runtime and compiler will attempt to optimize this code for the target GPU, potentially leveraging Matrix Cores if available.
5. For production use, you should implement proper error handling and potentially use more sophisticated synchronization methods.
To fully leverage Matrix Cores, you might need to use specific intrinsics or rely on compiler optimizations. The exact method can vary depending on the GPU architecture and HIP version. Always profile your code to ensure you're getting the expected performance benefits.
Creating a custom CUDA kernel that directly utilizes tensor cores
Creating a custom CUDA kernel that directly utilizes tensor cores is an advanced topic, as tensor cores are typically accessed through higher-level libraries like cuBLAS or cuDNN. However, NVIDIA does provide a way to use tensor cores in custom kernels through their CUDA Core library, specifically with Warp Matrix Multiply-Accumulate (WMMA) API. Here's an overview of how to create a kernel that works on tensor cores:
Key points about this example:
1. We're using 16x16 matrices as this is a common size for tensor core operations.
2. The kernel uses `nvcuda::wmma::fragment` to define matrix fragments.
3. `load_matrix_sync`, `mma_sync`, and `store_matrix_sync` are used to load data, perform matrix multiplication, and store results using tensor cores.
4. The kernel operates on half-precision input (`half`) and produces single-precision output (`float`).
To use this kernel:
1. Compile with a CUDA compiler that supports tensor cores (CUDA 9.0 or later).
2. Use appropriate GPU architecture flags (e.g., `-arch=sm_70` for Volta, `-arch=sm_75` for Turing).
3. Allocate memory and copy data to the GPU before calling `launch_wmma_kernel`.
Important considerations:
1. Error checking is omitted for brevity but should be included in production code.
2. This is a basic example. Real-world usage often involves tiling and more complex memory access patterns for larger matrices.
3. Performance tuning is crucial. The exact dimensions and data types should be chosen based on your specific use case and target GPU architecture.
4. Not all operations can be efficiently mapped to tensor cores. They're most beneficial for large matrix multiplications common in deep learning workloads.
Remember, while this approach gives you direct control over tensor core usage, in many cases, using higher-level libraries like cuBLAS or cuDNN is more practical and can automatically leverage tensor cores when appropriate.
9/16/2024
Pytorch model to mlir -> llvm -> executable file on Mac book m1
-
fig 1. Left: set 4 points (Left Top, Right Top, Right Bottom, Left Bottom), right:warped image to (0,0) (300,0), (300,300), (0,300) Fi...
-
In past, I wrote an articel about YUV 444, 422, 411 introduction and yuv <-> rgb converting example code. refer to this page -> ht...
-
As you can see in the following video, I created a class that stitching n cameras in real time. https://www.youtube.com/user/feelmare/sear...
-
This is data acquisition source code of LMS511(SICK co.) Source code is made by MFC(vs 2008). The sensor is communicated by TCP/IP. ...
-
Logistic Classifier The logistic classifier is similar to equation of the plane. W is weight vector, X is input vector and y is output...
-
* Introduction - The solution shows panorama image from multi images. The panorama images is processing by real-time stitching algorithm...
-
My Environment : MS VS 2008 & MFC(Dialog Based) Joy Stick : Logitech Extreme 3D pro (XBox Type) Cteated Date : 2012. 03 [source code]...
-
This is dithering example, it make image like a stippling effect. I referenced to blew website. wiki page: https://en.wikipedia.org/wik...
-
When we study cuda firstly, thread indexing is very confusing. So I tried to clean up. First, Let's grab a sense of looking at ...
-
This article explain how to access the thread index when you make block and thread with two dimensions. please refer to this page about me...