Read original ↗
EnrichedOpen SourceReddit r/LocalLLaMACommunityLive · 5d agoPublished 6/28/2026

I had 55 LLMs blind-grade each other (22k judgments, all open). Every model family with enough data is biased toward its own siblings. Qwen judges favor Qwen by ~0.9 points. Mistral penalizes its own by ~1.0.

I have been running an open evaluation setup where N models answer the same prompt, then blind-grade each other in an N x N matrix with self-judgments excluded. No single privileged judge. So far: 286 evaluations, 198 hand-written questions, 22,254 valid judgments across 55 model

View in news graph →

Why it matters

This story from Reddit r/LocalLLaMA is relevant to the Open Source branch of the AI ecosystem and may affect models, products, or research direction.

Technical breakdown

I have been running an open evaluation setup where N models answer the same prompt, then blind-grade each other in an N x N matrix with self-judgments excluded. No single privileged judge. So far: 286 evaluations, 198 hand-written questions, 22,254 valid judgments across 55 models from 11 developer families. Code, dataset, and all prompts are MIT licensed. The finding I did not expect: same-family

Business impact

Watch for product launches, funding moves, or policy shifts tied to this headline.