Measured, not asserted

Evals

Every number here is produced by a reproducible offline harness and committed to the repo. Results are a pinned subset, not a leaderboard run, and are labelled with their sample size.

Exec accuracy

56%

BIRD, n=240

Semantic-error

44%

executed but wrong

Calibration (ECE)

0.15

from 0.44 raw

Confidently-wrong

41%

from 100% baseline

The trust layer's signature: confidently-wrong reduction

On 27 questions that should not be answered (ambiguous, unanswerable, prompt-injection), a no-trust baseline answers confidently every time. Prompt Data declines or clarifies most, and over-declines none of the 15 clear controls.

Calibration

Self-consistency confidence is over-confident raw; isotonic calibration on a held-out split (n=96) makes it track observed accuracy. Points should sit near the diagonal.

Lower is better.

Where the trust layer still slips

Share of trap questions Prompt Data answered anyway, by category. Honest: it is strong on metric and time ambiguity, weaker on grain and entity, and the ambiguity gate is not an injection detector (the SELECT-only validator handles that).

Clarification quality

Against the hand-labelled ambiguity set: it never over-asks on clear questions.

Precision

1.00

Recall

0.70

Over-ask

0.00

Cost and latency

Single model tier in this run; routing across tiers is future work.

Model

sonnet-4-6

Median latency

20.6s

per question, K-sampled

Eval peak RSS

46.4 MB

under the 4 GB ceiling

Source: BIRD dev (pinned 240-question subset) on claude-sonnet-4-6, K=5; trap eval on 42 curated Olist questions. Numbers regenerate from eval/run_bird.py and eval/run_traps.py.