0

Differential Transformer V2

https://huggingface.co/blog/microsoft/diff-attn-v2(huggingface.co)
Differential Transformer V2 (DIFF V2) is an improved model architecture focused on enhancing inference efficiency, training stability, and design simplicity. It achieves faster decoding and compatibility with standard tools like FlashAttention by increasing query heads without adding key-value heads. The new version improves training stability by removing the per-head RMSNorm, which previously caused large gradients and instability during large-scale pretraining. Preliminary experiments indicate that DIFF V2 achieves lower language modeling loss and reduces training instability compared to the standard Transformer baseline, especially under large learning rate settings.
0 pointsby hdt8 days ago

Comments (0)

No comments yet. Be the first to comment!

Want to join the discussion?