Evals
Every number here is produced by a reproducible offline harness and committed to the repo. Results are a pinned subset, not a leaderboard run, and are labelled with their sample size.
The trust layer's signature: confidently-wrong reduction
On 27 questions that should not be answered (ambiguous, unanswerable, prompt-injection), a no-trust baseline answers confidently every time. Prompt Data declines or clarifies most, and over-declines none of the 15 clear controls.
Calibration
Self-consistency confidence is over-confident raw; isotonic calibration on a held-out split (n=96) makes it track observed accuracy. Points should sit near the diagonal.
Lower is better.
Where the trust layer still slips
Share of trap questions Prompt Data answered anyway, by category. Honest: it is strong on metric and time ambiguity, weaker on grain and entity, and the ambiguity gate is not an injection detector (the SELECT-only validator handles that).
Clarification quality
Against the hand-labelled ambiguity set: it never over-asks on clear questions.
Cost and latency
Single model tier in this run; routing across tiers is future work.
Source: BIRD dev (pinned 240-question subset) on claude-sonnet-4-6, K=5; trap eval on 42 curated Olist questions. Numbers regenerate from eval/run_bird.py and eval/run_traps.py.