mmBERT: ModernBERT goes Multilingual

https://huggingface.co/blog/mmbert(huggingface.co)

mmBERT is a state-of-the-art, massively multilingual encoder model trained on over 3 trillion tokens across more than 1800 languages. It builds upon the ModernBERT architecture and introduces novel training techniques to improve performance and efficiency, especially for low-resource languages. The training process uses a three-phase approach that progressively adds languages, anneals the data sampling distribution, and reduces the masking rate over time. This strategy allows the model to effectively learn from high-resource languages first before incorporating and improving performance on low-resource ones.

0 points•by ogg•1 month ago

Comments (0)

No comments yet. Be the first to comment!

Want to join the discussion?