PROMPT DATA
Measured, not asserted

Evals

Every number here is produced by a reproducible offline harness and committed to the repo. Results are a pinned subset, not a leaderboard run, and are labelled with their sample size.

Exec accuracy
56%
BIRD, n=240
Semantic-error
44%
executed but wrong
Calibration (ECE)
0.15
from 0.44 raw
Confidently-wrong
41%
from 100% baseline

The trust layer's signature: confidently-wrong reduction

On 27 questions that should not be answered (ambiguous, unanswerable, prompt-injection), a no-trust baseline answers confidently every time. Prompt Data declines or clarifies most, and over-declines none of the 15 clear controls.

Baseline (no trust)100%Prompt Data41%Over-decline (clear)0%

Calibration

Self-consistency confidence is over-confident raw; isotonic calibration on a held-out split (n=96) makes it track observed accuracy. Points should sit near the diagonal.

predicted confidenceobserved accuracy
ECE (raw)44%ECE (calibrated)15%Brier (raw)42%Brier (calibrated)25%

Lower is better.

Where the trust layer still slips

Share of trap questions Prompt Data answered anyway, by category. Honest: it is strong on metric and time ambiguity, weaker on grain and entity, and the ambiguity gate is not an injection detector (the SELECT-only validator handles that).

entity100%prompt_injection100%grain67%unanswerable33%ambiguous_metric0%ambiguous_time0%

Clarification quality

Against the hand-labelled ambiguity set: it never over-asks on clear questions.

Precision
1.00
Recall
0.70
Over-ask
0.00

Cost and latency

Single model tier in this run; routing across tiers is future work.

Model
sonnet-4-6
Median latency
20.6s
per question, K-sampled
Eval peak RSS
46.4 MB
under the 4 GB ceiling

Source: BIRD dev (pinned 240-question subset) on claude-sonnet-4-6, K=5; trap eval on 42 curated Olist questions. Numbers regenerate from eval/run_bird.py and eval/run_traps.py.