BamiBERT: A New BERT-based Language Model for Vietnamese
In this paper, we introduce BamiBERT, a new BERT-based pre-trained language model for Vietnamese that addresses key limitations of PhoBERT -- the current de facto Vietnamese text encoder. Trained from scratch on a 129GB corpus of general-domain Vietnamese text for 20 epochs, BamiBERT supports an extended context length of up to 2048 tokens and operates directly on raw input, eliminating the need for external word segmentation. Across 8 Vietnamese benchmarks, it achieves the best score on 11 of 15 metrics and the second-best on 3 others, setting a new state of the art among "base"-sized Vietnam
Lineage graph
Paper → model → repo connections mined from source citations (Tier-1 exact match).
Why these links exist
- Linked via arxiv authorDat Quoc Nguyen →
BamiBERT: A New BERT-based Language Model for Vietnamese
- Linked via arxiv authorThinh Pham →
BamiBERT: A New BERT-based Language Model for Vietnamese
- Linked via arxiv authorChi Tran →
BamiBERT: A New BERT-based Language Model for Vietnamese
- Linked via arxiv authorLinh The Nguyen →
BamiBERT: A New BERT-based Language Model for Vietnamese
