Cutting LLM Memory by 84%: A Deep Dive into Fused Kernels

https://towardsdatascience.com/cutting-llm-memory-by-84-a-deep-dive-into-fused-kernels/(towardsdatascience.com)

Training large language models often hits a memory wall during the final cross-entropy loss calculation due to a massive intermediate tensor called the "logit bottleneck." This bottleneck occurs when the model's hidden state is projected into the vast vocabulary space, creating a tensor that can consume tens of gigabytes of VRAM and cause out-of-memory errors. A powerful solution involves creating a custom "fused kernel" that combines the linear projection and the cross-entropy loss into a single, highly efficient operation. By processing the calculation in small, tiled chunks, this technique avoids ever creating the full logit tensor in memory, dramatically reducing peak VRAM usage by as much as 84%.

0 points•by ogg•1 month ago

Comments (0)

No comments yet. Be the first to comment!

Want to join the discussion?