Read original ↗
paperarXivTrust 82 · PrimaryPublished 3d agoLive · 2d ago

Calibration, Not Compilation: Detecting and Repairing Misspecified Probabilistic Programs Written by Language Models

Language models increasingly write probabilistic programs (in NumPyro, Stan, or Pyro), but a program that compiles, runs, and passes every unit test can still be \emph{statistically} wrong -- a Gaussian likelihood for heavy-tailed data, a Poisson for over-dispersed counts, an invalid prior support, or a pathological parameterization. The right verifier is therefore not a test suite but the Bayesian workflow itself: posterior predictive checks, simulation-based calibration, sampler diagnostics ($\hat R$, divergences, ESS), and held-out predictive density. We study this calibration oracle along

Lineage graph

Paper → model → repo connections mined from source citations (Tier-1 exact match).

Covers

Implements (incoming)

Related across the graph

Topics