The `fsdp_config.activation_checkpointing` does come with a computational overhead since it recomputes activations during the backward pass, but it's generally the most efficient option for large models like LLaMA 70B for several reasons:
1. Alternative memory saving options usually have bigger throughput impacts:
- `activation_cpu_offload=true`: Much slower due to CPU-GPU data transfer
- `fp32_cpu_offload=true`: Also involves slow CPU-GPU transfers
- Reducing batch size: Directly reduces throughput
- Reducing model size: Changes model behavior
2. Activation checkpointing's overhead is typically around 20-30% compute time, but:
- Only affects backward pass
- Modern GPUs often have compute headroom
- Memory savings (50-70%) usually outweigh the computation cost
3. The recompute strategy is smart:
- Only recomputes necessary activations
- Can utilize GPU compute while other operations are happening
- Modern implementations are optimized to minimize recomputation
So while it's not zero-cost, it's usually the minimum-impact option that allows large models to fit in memory while maintaining reasonable throughput.
No comments:
Post a Comment