Showing posts with label fsdp. Show all posts
Showing posts with label fsdp. Show all posts

1/11/2025

fsdp difference between fsdp_config.activation_checkpointing and fsdp_config.activation_checkpointing_reentrant

explain the key differences between these two FSDP (Fully Sharded Data Parallel) configuration parameters:

`fsdp_config.activation_checkpointing`:

- This is the main switch that enables/disables activation checkpointing

- When set to `true`, it saves memory by discarding intermediate activations during the forward pass and recomputing them during the backward pass

- In your command, it's set to `false`, meaning no activation checkpointing will be performed


`fsdp_config.activation_checkpointing_reentrant`:

- This is a more specific setting that controls HOW activation checkpointing is implemented

- When set to `true` (as in your command), it uses a reentrant approach which is more memory efficient but potentially slower

- Reentrant implementation makes nested activation checkpointing possible and handles complex model architectures better

- This setting only has an effect if `activation_checkpointing` is enabled


In your specific case, since `activation_checkpointing=false`, the `activation_checkpointing_reentrant=true` setting won't have any actual effect on the training process.


A typical memory-optimized configuration would be:

```yaml

fsdp_config:

  activation_checkpointing: true

  activation_checkpointing_reentrant: true

```


This would give you maximum memory efficiency at the cost of some computation overhead. However, your configuration seems to be optimized for speed rather than memory usage, which makes sense for a performance-focused training setup (as suggested by your YAML filename containing "performance").

1/08/2025

fsdp mixed precision pure vs default

`mixed_precision: PURE` and `mixed_precision: DEFAULT` in FSDP:


`mixed_precision: DEFAULT` (what you saw in logs):

- Parameters are stored in bfloat16

- Gradients are computed and reduced in float32

- Buffers (like batch norm stats) are in bfloat16

- Results in log: "param_dtype=torch.bfloat16, reduce_dtype=torch.float32, buffer_dtype=torch.bfloat16"


`mixed_precision: PURE`:

- Parameters are stored in bfloat16

- Gradients are computed and reduced in bfloat16 (this is the key difference)

- Buffers are in bfloat16

- Would show in logs: "param_dtype=torch.bfloat16, reduce_dtype=torch.bfloat16, buffer_dtype=torch.bfloat16"


Performance comparison:

1. Memory Usage:

- PURE uses less memory because gradients are in bfloat16

- DEFAULT uses more memory because gradients are in float32


2. Speed:

- PURE is typically faster because:

  - Less memory bandwidth used during gradient communication

  - Faster gradient reduction operations

  - Particularly beneficial for distributed training

- However, training might be less stable


3. Training Stability:

- DEFAULT is more numerically stable because gradient reduction happens in float32

- PURE might require more careful tuning of learning rate and other hyperparameters


From your logs showing throughput around 191 tokens/sec/device, you might get better performance with PURE mode, possibly 5-15% faster due to reduced communication overhead. However, if you experience training instability (very high loss values or NaNs), you should switch back to DEFAULT.


Recommendation:

1. Start with PURE for better performance

2. Monitor training metrics closely

3. If you see instability, fall back to DEFAULT


12/31/2024

Insight about "fsdp_config.activation_checkpointing" option

 The `fsdp_config.activation_checkpointing` does come with a computational overhead since it recomputes activations during the backward pass, but it's generally the most efficient option for large models like LLaMA 70B for several reasons:


1. Alternative memory saving options usually have bigger throughput impacts:

- `activation_cpu_offload=true`: Much slower due to CPU-GPU data transfer

- `fp32_cpu_offload=true`: Also involves slow CPU-GPU transfers

- Reducing batch size: Directly reduces throughput

- Reducing model size: Changes model behavior


2. Activation checkpointing's overhead is typically around 20-30% compute time, but:

- Only affects backward pass

- Modern GPUs often have compute headroom

- Memory savings (50-70%) usually outweigh the computation cost


3. The recompute strategy is smart:

- Only recomputes necessary activations

- Can utilize GPU compute while other operations are happening

- Modern implementations are optimized to minimize recomputation


So while it's not zero-cost, it's usually the minimum-impact option that allows large models to fit in memory while maintaining reasonable throughput.


10/10/2024

FSDP and TP explanation for 2 layer model

 FSDP and TP are complementary parallelism techniques:

  1. FSDP (Fully Sharded Data Parallelism):
    • Shards model parameters across GPUs
    • Each GPU holds a portion of each layer's parameters
    • During forward/backward pass, it gathers/scatters parameters as needed
    • Reduces memory usage per GPU, allowing larger models
  2. TP (Tensor Parallelism):
    • Splits individual tensors (layers) across GPUs
    • Each GPU computes a portion of a layer's operations
    • Useful for very large layers that don't fit on a single GPU

When combined:

  • FSDP handles overall model distribution
  • TP handles distribution of large individual layers
  • This allows for even larger models and better GPU utilization

Textual Representation:

GPU 1 GPU 2 GPU 3 GPU 4 +--------+ +--------+ +--------+ +--------+ | L1 P1 | | L1 P2 | | L2 P1 | | L2 P2 | | TP1 | | TP2 | | TP1 | | TP2 | +--------+ +--------+ +--------+ +--------+ | | | | +------------+ +------------+ Layer 1 Layer 2 L1, L2: Layers 1 and 2 P1, P2: Parameter shards (FSDP) TP1, TP2: Tensor Parallel splits