newsReddit r/MachineLearningTrust 52 · CommunityPublished 3h agoLive · 1m ago

H64LM: A 249M-parameter Mixture-of-Experts Transformer built from scratch in PyTorch [P]

Hi everyone, I built H64LM, a research project to better understand modern LLMs by implementing one from scratch in PyTorch. Instead of relying on high-level training frameworks, I implemented the core components myself attention, MoE routing, normalization, and the training loop. Features 249M-parameter Transformer Grouped Query Attention (GQA) Sparse Mixture-of-Experts (8 experts, Top-2 routi

Covers

repoluziyao1995/vllm repoalibaba/rtp-llm repoperemartra/Rearchitecting-LLMs repoquant-kit paperUnderstanding Large Language Models

Covers (incoming)

repolucidrains/torch-einops-utils

Related across the graph

repolucidrains/torch-einops-utils repoalibaba/rtp-llm repoquant-kit repoluziyao1995/vllm paperUnderstanding Large Language Models repoperemartra/Rearchitecting-LLMs