The honest page

Limitations

A trust product that hides its weaknesses is not trustworthy. Here is what Prompt Data does not do well, stated plainly.

It can still be confidently wrong

Execution accuracy on the benchmark subset is about 56%, so a real fraction of answered questions return the wrong result. The trust layer reduces how often those are shown confidently; it does not make them correct. A query that runs cleanly and that the model agrees with itself on can still answer the wrong question.

Confidence is well-calibrated but coarse

Confidence comes from how often several sampled queries agree. On a strong model they usually agree, so the signal is blunt — it sorts answers into roughly “likely right” and “shaky” rather than finely ranking each one. Calibration makes the number trustworthy as a probability (it tracks accuracy), but it cannot manufacture resolution the underlying signal lacks.

The numbers are a pinned subset, not a leaderboard

Accuracy and calibration are measured on a fixed 240-question slice of BIRD with one model, and calibration is evaluated on a small held-out split (n=96). They are honest and reproducible, but they carry real sampling noise and should not be read as full-benchmark or state-of-the-art figures.

Ambiguity detection misses some cases

It is strong on undefined metrics and vague time windows, but weaker on aggregation grain and entity ambiguity — on the trap set it still answered 41% of questions it arguably should have clarified. Improving the depth of ambiguity detection is future work.

Read-only is guaranteed; semantic appropriateness is not

A SELECT-only validator parses every query and guarantees no write or schema change ever reaches the database, which is also the backstop against prompt injection. What it cannot judge is whether a read is the right read. The ambiguity gate is not an injection detector — the validator is — so an adversarial prompt that still produces a valid SELECT will execute (safely), and the trust layer rather than the sandbox is what flags it.

One model, one database, English only

The live demo runs against the Olist e-commerce database with a single model tier and English questions. Model routing across tiers, bring-your-own-database, and broader language coverage are out of scope for this version.