Read original ↗
paperarXivTrust 82 · PrimaryPublished yesterdayLive · 1h ago

BamiBERT: A New BERT-based Language Model for Vietnamese

In this paper, we introduce BamiBERT, a new BERT-based pre-trained language model for Vietnamese that addresses key limitations of PhoBERT -- the current de facto Vietnamese text encoder. Trained from scratch on a 129GB corpus of general-domain Vietnamese text for 20 epochs, BamiBERT supports an extended context length of up to 2048 tokens and operates directly on raw input, eliminating the need for external word segmentation. Across 8 Vietnamese benchmarks, it achieves the best score on 11 of 15 metrics and the second-best on 3 others, setting a new state of the art among "base"-sized Vietnam

Lineage graph

Paper → model → repo connections mined from source citations (Tier-1 exact match).

Why these links exist

  • Linked via arxiv authorDat Quoc Nguyen

    BamiBERT: A New BERT-based Language Model for Vietnamese

  • Linked via arxiv authorThinh Pham

    BamiBERT: A New BERT-based Language Model for Vietnamese

  • Linked via arxiv authorChi Tran

    BamiBERT: A New BERT-based Language Model for Vietnamese

  • Linked via arxiv authorLinh The Nguyen

    BamiBERT: A New BERT-based Language Model for Vietnamese

Implements

authored (incoming)

Related across the graph

Topics