company profile

Northwind AI

An applied-research lab building reliable reasoning models.

8Connections
3Papers
1Models
0Repos
4News

Papers · 3

NuclearQAv2: A Structured Benchmark for Evaluating Domain-Science Competence in Large Language Models

Large language models (LLMs) have demonstrated strong performance across a wide range of tasks, but ensuring their reliability in highly technical domains remains a significant challenge. In nuclear engineering, problem solving often requires not only factual knowledge but also quantitative reasoning and conceptual understanding. To address the need for systematic evaluation in this domain, we introduce NuclearQAv2, a benchmark for assessing LLMs on nuclear engineering knowledge. The benchmark comprises approximately 1,240 question-answer pairs spanning three categories: boolean, numeric, and

CORTEX: A Structured Reasoning Benchmark for Trustworthy 3D Chest CT MLLMs

Reasoning in multimodal large language models (MLLMs) has shown strong promise in medical imaging. However, this reasoning is usually free-form text judged only by its final answer, making it hard to interpret and verify, especially in 3D radiology, where a diagnosis should be traceable to evidence in the scan. Existing chest CT question-answering datasets compound this by reducing expert radiology reports to answer-only pairs, dropping the reasoning that links findings to conclusions and omitting the patient history clinicians rely on. As a result, reasoning-capable 3D chest CT MLLMs remain o

Self-rewarding agents that retrace failures

Agents that attribute their own errors and retrace to repair multi-step reasoning.