12/31/2024

Insight about "fsdp_config.activation_checkpointing" option

 The `fsdp_config.activation_checkpointing` does come with a computational overhead since it recomputes activations during the backward pass, but it's generally the most efficient option for large models like LLaMA 70B for several reasons:


1. Alternative memory saving options usually have bigger throughput impacts:

- `activation_cpu_offload=true`: Much slower due to CPU-GPU data transfer

- `fp32_cpu_offload=true`: Also involves slow CPU-GPU transfers

- Reducing batch size: Directly reduces throughput

- Reducing model size: Changes model behavior


2. Activation checkpointing's overhead is typically around 20-30% compute time, but:

- Only affects backward pass

- Modern GPUs often have compute headroom

- Memory savings (50-70%) usually outweigh the computation cost


3. The recompute strategy is smart:

- Only recomputes necessary activations

- Can utilize GPU compute while other operations are happening

- Modern implementations are optimized to minimize recomputation


So while it's not zero-cost, it's usually the minimum-impact option that allows large models to fit in memory while maintaining reasonable throughput.


No comments:

Post a Comment