AI in Multiple GPUs: ZeRO & FSDP

https://towardsdatascience.com/ai-in-multiple-gpus-zero-fsdp/(towardsdatascience.com)

Training large AI models with multiple GPUs creates a significant memory bottleneck because traditional methods replicate the entire model, gradients, and optimizer states on every device. The Zero Redundancy Optimizer (ZeRO) is a powerful memory optimization strategy that solves this issue by partitioning these components across all available GPUs. ZeRO is implemented in three progressive stages, starting with partitioning optimizer states (ZeRO-1), then adding gradients (ZeRO-2), and finally partitioning the model parameters themselves (ZeRO-3). In its most advanced stage, ZeRO-3 allows each GPU to hold only a slice of the model, dynamically gathering the full parameters for each layer just-in-time for computation and then discarding them. This clever technique dramatically reduces the memory footprint per GPU, making it possible to train enormous models that would otherwise be too large to fit.

0 points•by ogg•4 months ago

Comments (0)

No comments yet. Be the first to comment!

Want to join the discussion?