10/10/2024

FSDP and TP explanation for 2 layer model

 FSDP and TP are complementary parallelism techniques:

  1. FSDP (Fully Sharded Data Parallelism):
    • Shards model parameters across GPUs
    • Each GPU holds a portion of each layer's parameters
    • During forward/backward pass, it gathers/scatters parameters as needed
    • Reduces memory usage per GPU, allowing larger models
  2. TP (Tensor Parallelism):
    • Splits individual tensors (layers) across GPUs
    • Each GPU computes a portion of a layer's operations
    • Useful for very large layers that don't fit on a single GPU

When combined:

  • FSDP handles overall model distribution
  • TP handles distribution of large individual layers
  • This allows for even larger models and better GPU utilization

Textual Representation:

GPU 1 GPU 2 GPU 3 GPU 4 +--------+ +--------+ +--------+ +--------+ | L1 P1 | | L1 P2 | | L2 P1 | | L2 P2 | | TP1 | | TP2 | | TP1 | | TP2 | +--------+ +--------+ +--------+ +--------+ | | | | +------------+ +------------+ Layer 1 Layer 2 L1, L2: Layers 1 and 2 P1, P2: Parameter shards (FSDP) TP1, TP2: Tensor Parallel splits

No comments:

Post a Comment