How it works

Methodology

Generating SQL is the easy part. The hard part is knowing when to trust the answer. Prompt Data is built around four behaviors, and every claim it makes about its own quality is measured, never asserted.

It asks instead of guessing

Before answering, Prompt Data classifies a question for ambiguity: an undefined metric (“top” by what?), a vague time window (“last quarter”), an unclear entity or grain, or a request for data that simply is not in the schema. When a question is genuinely underspecified, it returns a short clarifying question with tappable options instead of guessing.

The trade-off is real: ask too much and it is annoying; ask too little and it guesses wrong. On a hand-labelled set it hits 1.00 precision and 0.70 recall with zero over-asking on clear questions.

It shows its work

When it proceeds, Prompt Data surfaces the concrete assumptions the query embodies, extracted directly from the SQL it actually ran: which tables and joins, which filters, the aggregation grain, the row cap, and which business terms it resolved (for example, “revenue” to the sum of item prices, or “a customer” to the unique customer id rather than the per-order id). These cannot drift from the query because they are read back out of it.

Its confidence is calibrated

Prompt Data samples several candidate queries and measures how often they agree. On its own this signal is over-confident, so a calibration map fit offline (isotonic regression on a held-out split) rescales it to match observed accuracy. That took calibration error from 0.44 down to 0.15 — so when Prompt Data says it is 70% sure, it is roughly right 70% of the time. The map is fit offline and loaded statically; nothing is fit at request time.

Its quality is measured

An offline harness runs the eval suite against BIRD (a hard public text-to-SQL benchmark) and a curated trap set. The headline result: on questions that should not be confidently answered, a no-trust baseline answers 100% of them; Prompt Data answers only 41% — clarifying or declining the rest — while still answering every clear control. Execution accuracy on the BIRD subset (n=240) is 56%.

Under the hood

A read-only, SELECT-only sandbox parses every generated query and rejects anything that is not a single read-only statement before it reaches the database — the real backstop against prompt injection. The server runs LLM inference via the Anthropic API and stays well under a 4 GB memory budget. The numbers on this site are produced by a reproducible batch job and committed to the repository.