paper · arXiv

NuclearQAv2: A Structured Benchmark for Evaluating Domain-Science Competence in Large Language Models

Large language models (LLMs) have demonstrated strong performance across a wide range of tasks, but ensuring their reliability in highly technical domains remains a significant challenge. In nuclear engineering, problem solving often requires not only factual knowledge but also quantitative reasoning and conceptual understanding. To address the need for systematic evaluation in this domain, we introduce NuclearQAv2, a benchmark for assessing LLMs on nuclear engineering knowledge. The benchmark comprises approximately 1,240 question-answer pairs spanning three categories: boolean, numeric, and

Want the primary source?View original →

newsNew benchmark exposes reasoning gaps in top models newsDeepSWE: new benchmark looking at how well today's frontier models can actually write code [R]

companyNorthwind AI

newsDeepSWE: new benchmark looking at how well today's frontier models can actually write code [R]companyNorthwind AI newsNew benchmark exposes reasoning gaps in top models

cs.CL