Northwind AI
An applied-research lab building reliable reasoning models.
News · 4
Introducing LifeSciBench
Introducing LifeSciBench, an expert-authored, expert-reviewed benchmark for evaluating how AI systems handle real-world life science research tasks and decisions.
Learning to lead in a hybrid human-AI enterprise
As adoption of AI agents looks set to surge by as much as 300% in the next two years, leadership teams are carefully considering the implications of a hybrid human-AI workforce. Unlike existing enterprise-level automation that relies on manual input, AI agents are capable of autonomously coordinating complex tasks, interacting with multiple tools and environments across…
Using AI to help physicians diagnose rare genetic diseases affecting children
Researchers used an OpenAI reasoning model to help diagnose rare diseases, identifying 18 new diagnoses in previously unsolved cases.
New benchmark exposes reasoning gaps in top models
A harder evaluation suite shows even leading models struggle on multi-hop tasks.
Papers · 3
NuclearQAv2: A Structured Benchmark for Evaluating Domain-Science Competence in Large Language Models
Large language models (LLMs) have demonstrated strong performance across a wide range of tasks, but ensuring their reliability in highly technical domains remains a significant challenge. In nuclear engineering, problem solving often requires not only factual knowledge but also quantitative reasoning and conceptual understanding. To address the need for systematic evaluation in this domain, we introduce NuclearQAv2, a benchmark for assessing LLMs on nuclear engineering knowledge. The benchmark comprises approximately 1,240 question-answer pairs spanning three categories: boolean, numeric, and
CORTEX: A Structured Reasoning Benchmark for Trustworthy 3D Chest CT MLLMs
Reasoning in multimodal large language models (MLLMs) has shown strong promise in medical imaging. However, this reasoning is usually free-form text judged only by its final answer, making it hard to interpret and verify, especially in 3D radiology, where a diagnosis should be traceable to evidence in the scan. Existing chest CT question-answering datasets compound this by reducing expert radiology reports to answer-only pairs, dropping the reasoning that links findings to conclusions and omitting the patient history clinicians rely on. As a result, reasoning-capable 3D chest CT MLLMs remain o
Self-rewarding agents that retrace failures
Agents that attribute their own errors and retrace to repair multi-step reasoning.