MareArts Computer Vision Study.: FSDP and TP explanation for 2 layer model

10/10/2024

FSDP and TP explanation for 2 layer model

FSDP and TP are complementary parallelism techniques:

FSDP (Fully Sharded Data Parallelism):
- Shards model parameters across GPUs
- Each GPU holds a portion of each layer's parameters
- During forward/backward pass, it gathers/scatters parameters as needed
- Reduces memory usage per GPU, allowing larger models
TP (Tensor Parallelism):
- Splits individual tensors (layers) across GPUs
- Each GPU computes a portion of a layer's operations
- Useful for very large layers that don't fit on a single GPU

When combined:

FSDP handles overall model distribution
TP handles distribution of large individual layers
This allows for even larger models and better GPU utilization

Textual Representation:


   GPU 1        GPU 2        GPU 3        GPU 4
 +--------+   +--------+   +--------+   +--------+
 | L1 P1  |   | L1 P2  |   | L2 P1  |   | L2 P2  |
 |  TP1   |   |  TP2   |   |  TP1   |   |  TP2   |
 +--------+   +--------+   +--------+   +--------+
     |            |            |            |
     +------------+            +------------+
          Layer 1                  Layer 2

L1, L2: Layers 1 and 2
P1, P2: Parameter shards (FSDP)
TP1, TP2: Tensor Parallel splits

MareArts Computer Vision Study.

Pages

10/10/2024

FSDP and TP explanation for 2 layer model

No comments:

Post a Comment