Read original ↗
newsReddit r/MachineLearningTrust 52 · CommunityPublished 3h agoLive · 1m ago

H64LM: A 249M-parameter Mixture-of-Experts Transformer built from scratch in PyTorch [P]

Hi everyone, I built H64LM, a research project to better understand modern LLMs by implementing one from scratch in PyTorch. Instead of relying on high-level training frameworks, I implemented the core components myself attention, MoE routing, normalization, and the training loop. Features 249M-parameter Transformer Grouped Query Attention (GQA) Sparse Mixture-of-Experts (8 experts, Top-2 routi

Covers

Covers (incoming)

Related across the graph