company profile

Northwind AI

An applied-research lab building reliable reasoning models.

8Connections

3Papers

1Models

0Repos

4News

News · 4

Introducing LifeSciBench

Introducing LifeSciBench, an expert-authored, expert-reviewed benchmark for evaluating how AI systems handle real-world life science research tasks and decisions.

Learning to lead in a hybrid human-AI enterprise

As adoption of AI agents looks set to surge by as much as 300% in the next two years, leadership teams are carefully considering the implications of a hybrid human-AI workforce. Unlike existing enterprise-level automation that relies on manual input, AI agents are capable of autonomously coordinating complex tasks, interacting with multiple tools and environments across…

Using AI to help physicians diagnose rare genetic diseases affecting children

Researchers used an OpenAI reasoning model to help diagnose rare diseases, identifying 18 new diagnoses in previously unsolved cases.

New benchmark exposes reasoning gaps in top models

A harder evaluation suite shows even leading models struggle on multi-hop tasks.

Papers · 3

NuclearQAv2: A Structured Benchmark for Evaluating Domain-Science Competence in Large Language Models

Large language models (LLMs) have demonstrated strong performance across a wide range of tasks, but ensuring their reliability in highly technical domains remains a significant challenge. In nuclear engineering, problem solving often requires not only factual knowledge but also quantitative reasoning and conceptual understanding. To address the need for systematic evaluation in this domain, we introduce NuclearQAv2, a benchmark for assessing LLMs on nuclear engineering knowledge. The benchmark comprises approximately 1,240 question-answer pairs spanning three categories: boolean, numeric, and

CORTEX: A Structured Reasoning Benchmark for Trustworthy 3D Chest CT MLLMs

Reasoning in multimodal large language models (MLLMs) has shown strong promise in medical imaging. However, this reasoning is usually free-form text judged only by its final answer, making it hard to interpret and verify, especially in 3D radiology, where a diagnosis should be traceable to evidence in the scan. Existing chest CT question-answering datasets compound this by reducing expert radiology reports to answer-only pairs, dropping the reasoning that links findings to conclusions and omitting the patient history clinicians rely on. As a result, reasoning-capable 3D chest CT MLLMs remain o

Self-rewarding agents that retrace failures

Agents that attribute their own errors and retrace to repair multi-step reasoning.

Models · 1

Retrace-1.5B

A small reasoning model tuned to self-correct via failure traces.