paperarXivTrust 82 · PrimaryPublished 5d agoLive · 3d ago
Representational Depth of Evaluation Awareness Shifts With Scale in Open-Weight Language Models
Do language models know when they are being tested? This question matters for AI safety: a model that recognises an evaluation context could alter its behaviour strategically, making downstream benchmarks harder to interpret. Using 11 models spanning Qwen 2.5, Gemma 2, and Llama 3.2, we find a systematic size-dependent shift in representational depth: in both Qwen 2.5 and Gemma 2, the layer at which evaluation-awareness is most linearly recoverable moves from late layers in smaller models to early layers in larger ones. This suggests that scale changes not only the strength of evaluation-aware
Lineage graph
Paper → model → repo connections mined from source citations (Tier-1 exact match).
