newsReddit r/MachineLearningTrust 52 · CommunityPublished 21h agoLive · 11h ago

Looking for feedback on a small test SLM I built completely from scratch [P]

Architecture: - Parameter count: 216.5M - Layers: 10 - Attention / no attention:** Attention — 12-head multi-head self-attention, RoPE positional encoding, SDPA. Decoder-only, pre-norm, RMSNorm + SwiGLU, tied input/output embeddings. (hidden 1032, head_dim 86, FFN 4416) - Tokenizer:** Custom 36k SentencePiece unigram, case-preserving, byte-fallback, with atomic chat/role + memory special tokens (`<|user|>`,

Covers

paperAttend, Transform, or Silence: Operator-Level Visual Skipping for Efficient Multimodal LLM Inference repomanojmallick/sigmap paperSparse attention at million-token context paperUnderstanding Large Language Models paperNLL-Guided Full-Attention Layer Selection for Training-Free Sliding-Window Adaptation

Covers (incoming)

paperUnlocking Speech-Text Compositional Powers: Instruction-Following Speech Language Models without Instruction Tuning paperOn the Role of Directionality in Structural Generalization

Related across the graph

paperNLL-Guided Full-Attention Layer Selection for Training-Free Sliding-Window Adaptation paperOn the Role of Directionality in Structural Generalization paperUnlocking Speech-Text Compositional Powers: Instruction-Following Speech Language Models without Instruction Tuning paperSparse attention at million-token context paperAttend, Transform, or Silence: Operator-Level Visual Skipping for Efficient Multimodal LLM Inference paperUnderstanding Large Language Models repomanojmallick/sigmap