paperarXivTrust 82 · PrimaryPublished 8d agoLive · 7d ago
Ask, Don't Judge: Binary Questions for Interpretable LLM Evaluation and Self-Improvement
Evaluating LLM outputs remains a major bottleneck in NLP: human evaluation is expensive and slow, lexical metrics correlate poorly with human judgments on open-ended generation, and holistic LLM judges often produce opaque scores that are hard to debug. We propose BINEVAL, a framework that decomposes evaluation criteria into atomic binary questions and aggregates the resulting verdicts into interpretable, multi-dimensional scores. Given a task prompt, a meta-prompt generates fine-grained evaluation questions, and an LLM answers them independently for each output, yielding transparent question-
Lineage graph
Paper → model → repo connections mined from source citations (Tier-1 exact match).
