Read original ↗
newsReddit r/MachineLearningTrust 52 · CommunityPublished 21h agoLive · 11h ago

Looking for feedback on a small test SLM I built completely from scratch [P]

Architecture: - Parameter count: 216.5M - Layers: 10 - Attention / no attention:** Attention — 12-head multi-head self-attention, RoPE positional encoding, SDPA. Decoder-only, pre-norm, RMSNorm + SwiGLU, tied input/output embeddings. (hidden 1032, head_dim 86, FFN 4416) - Tokenizer:** Custom 36k SentencePiece unigram, case-preserving, byte-fallback, with atomic chat/role + memory special tokens (`<|user|>`,

Covers

Covers (incoming)

Related across the graph