EnrichedOpen SourceReddit r/LocalLLaMACommunityLive · 5d agoPublished 6/28/2026

I had 55 LLMs blind-grade each other (22k judgments, all open). Every model family with enough data is biased toward its own siblings. Qwen judges favor Qwen by ~0.9 points. Mistral penalizes its own by ~1.0.

I have been running an open evaluation setup where N models answer the same prompt, then blind-grade each other in an N x N matrix with self-judgments excluded. No single privileged judge. So far: 286 evaluations, 198 hand-written questions, 22,254 valid judgments across 55 model

View in news graph →

Open Source Reddit r/LocalLLaMA

Why it matters

This story from Reddit r/LocalLLaMA is relevant to the Open Source branch of the AI ecosystem and may affect models, products, or research direction.

Technical breakdown

Business impact

Watch for product launches, funding moves, or policy shifts tied to this headline.