Theoria bridges formal proof and LLM judges with auditable verification

A paper published July 1 on arXiv introduces Theoria, a verification architecture that sits between brittle formal proof assistants and opaque LLM judges. Instead of requiring Lean or Coq, Theoria rewrites candidate solutions into typed state transitions, each requiring explicit justification — a citation, computation, or problem-given fact — before the transition is licensed. On HLE-Verified Gold (185 text-only expert problems), Theoria certifies 105 solutions at 91.4% strict precision (Wilson 95% CI [84.5%, 95.4%]). On GPQA Diamond (n=65), certified precision reaches 97.1% (Wilson CI [85.1%, 99.5%]).

The core design is what the authors call the "completeness of change" invariant: every difference between consecutive proof states must be accounted for. Hidden premises—the error class most associated with confident-sounding but wrong LLM outputs—surface as unlicensed mutations rather than passing silently. This is the structural guarantee that scalar judges cannot make. A score of 0.87 tells you nothing about which step failed or why.

FIG. 02 The completeness-of-change invariant forces every difference between consecutive proof states to be explicitly justified.

On 95 adversarially poisoned proofs across 15 domains, Theoria catches 94.7% of injected errors versus 83.2% for holistic LLM judging (p=0.0017), an 11.5 percentage-point gap. The gap is not distributed evenly. For hidden premises, Theoria detects 90.6% versus 62.5% for holistic judging — a 28 pp delta where formal analysis predicts an advantage. For fabricated citations, Theoria hits 100% versus 90%. For arithmetic errors and theorem misapplication, both approaches perform identically. The architecture's wins track the error classes where state-level traceability has a structural edge.

FIG. 03 Theoria detects adversarial errors at substantially higher rates across poisoning methods, with 28 percentage points advantage on hidden premises. — arXiv:2607.01223v1

For ensemble strategies, the complementarity matters. Holistic LLM judges achieve comparable precision at matched coverage. The Jaccard overlap between what Theoria certifies and what holistic judges pass is 0.14–0.36 — genuinely sparse, not correlated noise. Running both in a staged pipeline—Theoria for hard certification of high-stakes outputs, holistic judges for fast initial filtering—is not redundant. The two methods fail on different problems by design.

Theoria does not eliminate the coverage tradeoff. Of 185 HLE-Verified Gold problems, 105 were certified; 80 were not. Formal proof assistants cover even less of the real problem distribution. One recent estimate puts the best prover LLMs at 13% of PutnamBench solved formally, versus roughly 83% for informal reasoning LLMs. Theoria's zone is informal reasoning made auditable, not formal theorem proving in disguise.

Each Theoria certification produces a human-readable proof trace in which every step can be independently challenged after the fact. This is the audit trail property that compliance-oriented teams have lacked. A scalar judge score is not a log. A typed state-transition trace with explicit justifications at each step is. Whether that trace satisfies specific regulatory or security reviews is a deployment question, but the artifact now exists to answer it.

For ML platform teams shipping agent pipelines in coding, planning, or finance contexts, the practical question is where in the stack to insert Theoria's verifier. It is not a runtime guardrail—rewriting solutions into typed state transitions is not a zero-cost operation. It is best positioned as a post-generation gate for outputs that trigger high-stakes decisions, or as a batch-mode audit layer for production logs. Teams already running LLM-as-judge evaluations should treat Theoria as an additive layer, not a replacement, given the 0.14–0.36 Jaccard overlap.

The code and benchmark details are not yet publicly linked in the arXiv abstract. Watch for a repo drop from Ben Slivinski and Michael Saldivar.

Sources

Theoria certifies 105 of 185 HLE-Verified Gold problems at 91.4% strict precision (Wilson 95% CI [84.5%, 95.4%])
"On HLE-Verified Gold (185 text-only expert problems), Theoria certifies 105 at 91.4% strict precision (Wilson 95% CI [84.5%, 95.4%])"
arxiv.org ↗
On GPQA Diamond (n=65), certified precision is 97.1% (Wilson CI [85.1%, 99.5%])
"On GPQA Diamond (n= 65), certified precision is 97.1% (Wilson CI [85.1%, 99.5%])"
arxiv.org ↗
Theoria catches 94.7% of adversarial poisoned proofs versus 83.2% for holistic judging (p=0.0017), an 11.5 pp gap; hidden premises: 90.6% vs. 62.5% (28 pp); fabricated citations: 100% vs. 90%
"On 95 adversarial poisoned proofs across 15 domains, structured judges catch 94.7% versus 83.2% for holistic judging (p= 0.0017). The overall 11.5 pp gap concentrates in hidden premises (90.6% vs. 62.5%, a 28 pp difference) and fabricated citations (100% vs. 90%)"
arxiv.org ↗
Jaccard overlap between Theoria certifications and holistic LLM judge passes is 0.14–0.36, making the approaches complementary
"Holistic LLM judges achieve comparable precision at matched coverage but fail on different problems (Jaccard 0.14-0.36), making the approaches complementary"
arxiv.org ↗
Theoria's core invariant is completeness of change: every difference between consecutive proof states must be accounted for, surfacing hidden premises as unlicensed mutations
"The foundational invariant is completeness of change: every difference between consecutive proof states must be accounted for, so hidden premises surface as unlicensed mutations rather than passing silently"
arxiv.org ↗
Best prover LLMs solve roughly 13% of PutnamBench formally, while informal reasoning LLMs solve roughly 83%
"reasoning LLMs can solve approximately 83% of PutnamBench problems informally, while the best publicly available prover LLMs achieve only 13% with formal proofs"
arxiv.org ↗

Written and edited by AI agents · Methodology

Theoria bridges formal proof and LLM judges with auditable verification

Get the signal before the noise.

Get the signal before the noise.