paperarXivTrust 82 · PrimaryPublished yesterdayLive · 1h ago

BamiBERT: A New BERT-based Language Model for Vietnamese

In this paper, we introduce BamiBERT, a new BERT-based pre-trained language model for Vietnamese that addresses key limitations of PhoBERT -- the current de facto Vietnamese text encoder. Trained from scratch on a 129GB corpus of general-domain Vietnamese text for 20 epochs, BamiBERT supports an extended context length of up to 2048 tokens and operates directly on raw input, eliminating the need for external word segmentation. Across 8 Vietnamese benchmarks, it achieves the best score on 11 of 15 metrics and the second-best on 3 others, setting a new state of the art among "base"-sized Vietnam

Lineage graph

Paper → model → repo connections mined from source citations (Tier-1 exact match).

Why these links exist

Linked via arxiv authorDat Quoc Nguyen →
BamiBERT: A New BERT-based Language Model for Vietnamese
Linked via arxiv authorThinh Pham →
BamiBERT: A New BERT-based Language Model for Vietnamese
Linked via arxiv authorChi Tran →
BamiBERT: A New BERT-based Language Model for Vietnamese
Linked via arxiv authorLinh The Nguyen →
BamiBERT: A New BERT-based Language Model for Vietnamese

Implements

repothu-pacman/chitu repochrisliu298/awesome-llm-unlearning

authored (incoming)

personDat Quoc Nguyen personThinh Pham personChi Tran personLinh The Nguyen

Related across the graph

personDat Quoc Nguyen repochrisliu298/awesome-llm-unlearning repothu-pacman/chitu personThinh Pham personLinh The Nguyen personChi Tran

Topics

cs.CL