Why MoE models get more from speculative decoding

https://cohere.com/blog/mixture-of-experts-models-get-more-from-speculative-decoding(cohere.com)

Speculative decoding can significantly speed up inference for large Mixture-of-Experts (MoE) models. A novel approach uses a smaller, distilled MoE model as the drafter, which is more effective than a standard dense model because its architecture aligns better with the larger target model. This alignment is further improved by a new technique called "agreeable routing," which trains the drafter to predict and use the same experts as the target model. This method results in higher quality draft tokens, leading to greater acceptance rates and a substantial increase in overall inference speed.

0 points•by chrisf•2 hours ago

Comments (0)

No comments yet. Be the first to comment!

Want to join the discussion?