How many GPUs do I need to train a LLM?
This is a complicated question in general, but if we assume that you are using FSDP with
FULL_SHARD, activation checkpointing, and DecoupledLionW, then a good rule of thumb is:
Your total cluster memory in GB should be larger than 12 * N (# billions of params).
E.g. To train a GPT-13B model which has ~13 billion params,
have at least 12 * 13 = 156 GB of total memory across your GPUs.
You can accomplish this with 4xA100-40GB, or 2xA100-80GB, etc.
If you run into OOM errors when using small device counts,
reduce device_train_microbatch_size until it succeeds.
Keep in mind: even though training will work in these minimalist settings,
you will get much better throughput_per_device
if you use a larger cluster or devices with higher memory capacity,
because this will enable you to use larger microbatch sizes.
No comments:
Post a Comment