newsReddit r/MachineLearningTrust 52 · CommunityPublished 21h agoLive · 11h ago
Looking for feedback on a small test SLM I built completely from scratch [P]
Architecture: - Parameter count: 216.5M - Layers: 10 - Attention / no attention:** Attention — 12-head multi-head self-attention, RoPE positional encoding, SDPA. Decoder-only, pre-norm, RMSNorm + SwiGLU, tied input/output embeddings. (hidden 1032, head_dim 86, FFN 4416) - Tokenizer:** Custom 36k SentencePiece unigram, case-preserving, byte-fallback, with atomic chat/role + memory special tokens (`<|user|>`,
Covers
paperAttend, Transform, or Silence: Operator-Level Visual Skipping for Efficient Multimodal LLM Inferencerepomanojmallick/sigmappaperSparse attention at million-token contextpaperUnderstanding Large Language ModelspaperNLL-Guided Full-Attention Layer Selection for Training-Free Sliding-Window Adaptation
Covers (incoming)
Related across the graph
paperNLL-Guided Full-Attention Layer Selection for Training-Free Sliding-Window AdaptationpaperOn the Role of Directionality in Structural GeneralizationpaperUnlocking Speech-Text Compositional Powers: Instruction-Following Speech Language Models without Instruction TuningpaperSparse attention at million-token contextpaperAttend, Transform, or Silence: Operator-Level Visual Skipping for Efficient Multimodal LLM InferencepaperUnderstanding Large Language Modelsrepomanojmallick/sigmap
