9/28/2024

How many GPUs do I need to train a LLM?



How many GPUs do I need to train a LLM?

This is a complicated question in general, but if we assume that you are using FSDP with 
FULL_SHARD, activation checkpointing, and DecoupledLionW, then a good rule of thumb is:

Your total cluster memory in GB should be larger than 12 * N (# billions of params).

E.g. To train a GPT-13B model which has ~13 billion params, 
have at least 12 * 13 = 156 GB of total memory across your GPUs. 
You can accomplish this with 4xA100-40GB, or 2xA100-80GB, etc.

If you run into OOM errors when using small device counts, 
reduce device_train_microbatch_size until it succeeds.

Keep in mind: even though training will work in these minimalist settings, 
you will get much better throughput_per_device 
if you use a larger cluster or devices with higher memory capacity, 
because this will enable you to use larger microbatch sizes.

No comments:

Post a Comment