Read original ↗
paperarXivTrust 82 · PrimaryPublished 8d agoLive · 7d ago

Ask, Don't Judge: Binary Questions for Interpretable LLM Evaluation and Self-Improvement

Evaluating LLM outputs remains a major bottleneck in NLP: human evaluation is expensive and slow, lexical metrics correlate poorly with human judgments on open-ended generation, and holistic LLM judges often produce opaque scores that are hard to debug. We propose BINEVAL, a framework that decomposes evaluation criteria into atomic binary questions and aggregates the resulting verdicts into interpretable, multi-dimensional scores. Given a task prompt, a meta-prompt generates fine-grained evaluation questions, and an LLM answers them independently for each output, yielding transparent question-

Lineage graph

Paper → model → repo connections mined from source citations (Tier-1 exact match).

Covers (incoming)

Implements (incoming)

Related across the graph

Topics