FSDP and TP are complementary parallelism techniques:
- FSDP (Fully Sharded Data Parallelism):
- Shards model parameters across GPUs
- Each GPU holds a portion of each layer's parameters
- During forward/backward pass, it gathers/scatters parameters as needed
- Reduces memory usage per GPU, allowing larger models
- TP (Tensor Parallelism):
- Splits individual tensors (layers) across GPUs
- Each GPU computes a portion of a layer's operations
- Useful for very large layers that don't fit on a single GPU
When combined:
- FSDP handles overall model distribution
- TP handles distribution of large individual layers
- This allows for even larger models and better GPU utilization
Textual Representation:
GPU 1 GPU 2 GPU 3 GPU 4 +--------+ +--------+ +--------+ +--------+ | L1 P1 | | L1 P2 | | L2 P1 | | L2 P2 | | TP1 | | TP2 | | TP1 | | TP2 | +--------+ +--------+ +--------+ +--------+ | | | | +------------+ +------------+ Layer 1 Layer 2 L1, L2: Layers 1 and 2 P1, P2: Parameter shards (FSDP) TP1, TP2: Tensor Parallel splits
No comments:
Post a Comment