5 different case of FSDP and TP usage.
10/11/2024
10/10/2024
FSDP and TP explanation for 2 layer model
FSDP and TP are complementary parallelism techniques:
- FSDP (Fully Sharded Data Parallelism):
- Shards model parameters across GPUs
- Each GPU holds a portion of each layer's parameters
- During forward/backward pass, it gathers/scatters parameters as needed
- Reduces memory usage per GPU, allowing larger models
- TP (Tensor Parallelism):
- Splits individual tensors (layers) across GPUs
- Each GPU computes a portion of a layer's operations
- Useful for very large layers that don't fit on a single GPU
When combined:
- FSDP handles overall model distribution
- TP handles distribution of large individual layers
- This allows for even larger models and better GPU utilization
Textual Representation:
GPU 1 GPU 2 GPU 3 GPU 4 +--------+ +--------+ +--------+ +--------+ | L1 P1 | | L1 P2 | | L2 P1 | | L2 P2 | | TP1 | | TP2 | | TP1 | | TP2 | +--------+ +--------+ +--------+ +--------+ | | | | +------------+ +------------+ Layer 1 Layer 2 L1, L2: Layers 1 and 2 P1, P2: Parameter shards (FSDP) TP1, TP2: Tensor Parallel splits
Subscribe to:
Posts (Atom)
-
fig 1. Left: set 4 points (Left Top, Right Top, Right Bottom, Left Bottom), right:warped image to (0,0) (300,0), (300,300), (0,300) Fi...
-
In past, I wrote an articel about YUV 444, 422, 411 introduction and yuv <-> rgb converting example code. refer to this page -> ht...
-
As you can see in the following video, I created a class that stitching n cameras in real time. https://www.youtube.com/user/feelmare/sear...
-
This is data acquisition source code of LMS511(SICK co.) Source code is made by MFC(vs 2008). The sensor is communicated by TCP/IP. ...
-
My Environment : MS VS 2008 & MFC(Dialog Based) Joy Stick : Logitech Extreme 3D pro (XBox Type) Cteated Date : 2012. 03 [source code]...
-
This is dithering example, it make image like a stippling effect. I referenced to blew website. wiki page: https://en.wikipedia.org/wik...
-
* Introduction - The solution shows panorama image from multi images. The panorama images is processing by real-time stitching algorithm...
-
Logistic Classifier The logistic classifier is similar to equation of the plane. W is weight vector, X is input vector and y is output...
-
This article explain how to access the thread index when you make block and thread with two dimensions. please refer to this page about me...
-
When we study cuda firstly, thread indexing is very confusing. So I tried to clean up. First, Let's grab a sense of looking at ...