Why We’ve Been Optimizing the Wrong Thing in LLMs for Years

https://towardsdatascience.com/why-weve-been-optimizing-the-wrong-thing-in-llms-for-years/(towardsdatascience.com)

Standard Large Language Models are trained using Next-Token Prediction (NTP), an inefficient method that dedicates equal compute to both important and filler words. A new approach, Multi-Token Prediction (MTP), explicitly trains models to predict multiple future tokens at each step, leveraging the model's latent ability to plan ahead. The MTP architecture uses a shared trunk to process context and multiple independent heads to predict several future tokens in parallel. This method yields stronger performance on reasoning benchmarks and achieves up to a three-fold increase in inference speed. The performance gains from MTP are most significant in larger models, demonstrating a new scaling law for model training.

0 points•by ogg•3 months ago

Comments (0)

No comments yet. Be the first to comment!

Want to join the discussion?