Read original ↗
paperarXivTrust 82 · PrimaryPublished 5d agoLive · 3d ago

Representational Depth of Evaluation Awareness Shifts With Scale in Open-Weight Language Models

Do language models know when they are being tested? This question matters for AI safety: a model that recognises an evaluation context could alter its behaviour strategically, making downstream benchmarks harder to interpret. Using 11 models spanning Qwen 2.5, Gemma 2, and Llama 3.2, we find a systematic size-dependent shift in representational depth: in both Qwen 2.5 and Gemma 2, the layer at which evaluation-awareness is most linearly recoverable moves from late layers in smaller models to early layers in larger ones. This suggests that scale changes not only the strength of evaluation-aware

Lineage graph

Paper → model → repo connections mined from source citations (Tier-1 exact match).

Covers

Related across the graph

Topics