0
mmBERT: ModernBERT goes Multilingual
https://huggingface.co/blog/mmbert(huggingface.co)mmBERT is a state-of-the-art, massively multilingual encoder model trained on over 3 trillion tokens across more than 1800 languages. It builds upon the ModernBERT architecture and introduces novel training techniques to improve performance and efficiency, especially for low-resource languages. The training process uses a three-phase approach that progressively adds languages, anneals the data sampling distribution, and reduces the masking rate over time. This strategy allows the model to effectively learn from high-resource languages first before incorporating and improving performance on low-resource ones.
0 points•by ogg•1 month ago