(S)uper (H)ard r(EL)ati(O)nal (B)enchmark
In search of the context layer: an enterprise-difficulty relational data benchmark for AI.
- Why this benchmark exists
- How this benchmark compares to BIRD
- Five dimensions of enterprise data complexity
- How prompts are constructed and validated
- How warehouses and gold SQL are generated
- Twenty mutations that reflect real world noise
- Every evaluation uses a unique warehouse
- How we help ensure scores are comparable across runs
- Participants submit results, not IP
- Register and get started
- Who built this
Section 01Why this benchmark exists
The mission. We set out to help researchers and AI systems make progress on enterprise relational data — which meant building a benchmark that looks like the environments we work in. Messy, undocumented, and large, but with coherent underlying structure and validated gold SQL throughout. Each warehouse is procedurally generated from a unique seed, with twenty mutation types applied across joins, values, and tables, and entities fragmented across multiple source systems. No two evaluations share the same schema.
That level of realism makes this currently the hardest (and perhaps most real-world indicative) benchmark in this domain.
The problem. Industry-standard text-to-SQL benchmarks provide the semantic information needed to write queries: foreign keys, table descriptions, column descriptions, and evidence hints. None tests whether a system can discover these relationships from raw data. In production enterprise environments, semantic discovery is the problem. Writing SQL is the easy part once you know what the data means.
No published benchmark currently provides (but this benchmark will): procedurally generated enterprise-scale warehouses with known ground truth, cross-system entity resolution through bridge tables of variable quality, bronze-to-silver noise inheritance, derived concepts encoded only in synthetic query history, source-system naming personas that obfuscate column semantics, three-axis difficulty parameterization, or an IP-safe submission system that evaluates output without requiring codebase disclosure.
Section 02How this benchmark compares to BIRD
We honor the groundbreaking BIRD benchmark but see opportunity for further supporting researchers via a new benchmark. The table below compares key dimensions, and independent research has identified areas where current evaluation methodology could be strengthened.
| Dimension | BIRD | This benchmark |
|---|---|---|
| Tables per database | Avg. 7 (max 13) | 50–95 tables, millions of rows |
| Semantic metadata provided | FK, descriptions, evidence hints | None |
| Data exposure risk | High — train data public since 2023 | Zero — generated per run |
| Cross-system entity resolution | No | Yes |
| Bridge tables (variable quality) | No | Yes |
| Concepts in query history only | No | Yes |
| Bronze→silver noise inheritance | No | Yes |
| Scoring method | Exact match (EX) | Accuracy (competition) + F1 (self-score in train) |
| SQL dialect | SQLite only | Any — scoring compares result sets, not SQL syntax |
| Codebase disclosure required | In certain circumstances | Never |
Independent evaluation has identified quality issues in BIRD's gold SQL. Wretblad et al. (ACL 2024) documented how noise in gold queries distorts text-to-SQL benchmark scores, sometimes flipping which system appears stronger. A separate analysis of 151 BIRD training questions found 49 where the gold SQL is demonstrably incorrect (Matson, MotherDuck, Feb 2026) — systems may be reproducing the benchmark's errors rather than learning correct SQL.
Section 03Five dimensions of enterprise data complexity
Each dimension isolates an aspect of data complexity absent from existing relational data benchmarks.
Section 04How prompts are constructed and validated
The difficulty dimensions above are reflected in how prompts are constructed.
Prompt difficulty is parameterized along three independent axes, enabling stratified analysis of where systems fail.
| Axis | Easy | Moderate | Hard |
|---|---|---|---|
| SQL complexity | Single-table, counts, filters | Multi-join, aggregations, subqueries | Complex SQL operations, many tables, nested logic |
| Concept complexity | Direct column references | Derived metrics from query history | Multi-step derived concepts, cross-system joins required |
| Prompt clarity | Clean, precise question | Minor ambiguity, extraneous context | Grammatical errors, verbose persona, irrelevant detail |
Prompt personas range from a senior analyst who asks precise questions to a new hire who adds extraneous context and misspells terms. Difficulty distribution is reported alongside accuracy so that systems are evaluated on where they fail, not collapsed into a single number.
Example: same query pattern, two prompt tiers
Revenue breakdown by payment method, highest first. Exclude any line items missing a total. Need this for the board deck.
Hi there! I'm new to the team (just started last week actually, still figuring out where the coffee machine is haha) and my manager asked me to pull together some shipping data but I want to make sure I'm describing this correctly before I waste anyone's time. So basically what I think we need is a breakdown by each shipping carrier we work with. For each of those carriers, we want to know two things: how many times shipments were actually dispatched where we have a weight recorded, and the average weight of those shipments in kilograms. Oh and I think it would make the most sense to sort the whole thing so the carrier with the highest number of dispatches shows up at the top. Sorry if this is overly detailed!
Both prompts target the same SQL pattern, a grouped aggregation with a filter and ORDER BY. But Tier 3 buries the requirements in conversational noise, hedging, and irrelevant detail. Systems must extract the analytical intent from both.
Section 05How warehouses and gold SQL are generated
The warehouse generator was initially built for internal use. Schemantic's Semantic Acquisition Model (SAM) needed to be tested against warehouses that look like production enterprise environments. That meant we needed a dataset with ground truth semantics, something that is vanishingly rare in the data world. A static test set would allow SAM to overfit. The generator exists so that every internal test run is against a warehouse SAM has never seen.
Each train and test warehouse is built through a seven-stage pipeline. The generator selects an industry vertical, creates entities and source systems, layers in realistic noise, and produces gold SQL with natural-language prompts — all from a unique random seed.
Each generated warehouse contains 50–95 tables with thousands of columns and millions of rows, working but undeclared join relationships, entities spread across multiple source systems with variant identifiers, temporal event data, controlled data quality issues, and derived tables with traceable lineage.
Source systems are assigned naming "personas" like legacy
financial systems with truncated column names (cst_acq_dt), modern SaaS platforms with transparent naming, and
systems in between. Beyond the industry-specific systems,
each warehouse also includes cross-functional systems that
real enterprise companies run like CRMs, marketing
automation tools, HR systems, support ticketing portals.
These functional systems share entities with the vertical systems (a CRM Contact is the same person as a POS Customer), creating additional bridge tables and cross-system joins that participants must discover.
One real-world entity may exist across dozens of tables under different names. This mirrors production environments where the same customer appears in 40 tables with 6 different column name conventions.
Gold SQL validation
During generation, every gold SQL query (the reference-correct SQL query for each prompt) is executed against the generated warehouse. Validation confirms: the query executes without error, returns a non-empty result set (unless empty is correct), and the result set is plausible. Queries that fail validation are regenerated. This addresses the accumulation of gold standard errors documented in fixed-dataset benchmarks.
The benchmark ships Parquet files as the canonical data source, and DuckDB SQL as the canonical gold SQL. Scoring compares result sets, not SQL syntax. Therefore, participants can compete in any SQL dialect they choose. A correct query run in Postgres, Snowflake, or BigQuery scores identically to one in DuckDB.
Gold SQL is generated first, and the natural-language prompt is then derived from it. This reverses the typical annotation workflow, where humans write questions then write SQL, which is the source of common gold standard errors: misalignment between the question asked and the query written.
Section 06Twenty mutations that reflect real world noise
The generator introduces controlled mutations at three levels (join, value, and dataset) grounded in patterns observed in production enterprise warehouses. Each mutation has a named type, a deterministic application rule, and a parameterized intensity range.
Join mutations
| Mutation | Description |
|---|---|
| prefix | Prepend string to ID values |
| suffix | Append string to ID values |
| leading_zeros | Zero-pad numeric IDs to fixed width |
| nulls | Replace 5–25% of FK values with NULL |
| name_mismatch | Rename column to opaque name |
| compound_two | Concatenate 2 columns into delimited key |
Value mutations
| Mutation | Description |
|---|---|
| miscast | Cast numeric column to VARCHAR |
| null_missing | Replace 10–40% of values with NULL or sentinel values |
| duplicate_records | Duplicate 5–15% of rows |
| test_records | Insert 3–8 fake test rows |
| date_proliferation | Add 2–4 redundant derived date columns |
Dataset mutations
| Mutation | Description |
|---|---|
| duplicate_table | Exact copy with different name |
| empty_table | Valid schema, zero rows |
| name_versioned | Copy with _v2 / _backup / _deprecated suffix |
| table_subset | Column subset (30–50% columns removed) |
| near_duplicate | Copy with 1–3 columns renamed or reordered |
| table_segment | Filtered subset via WHERE clause |
| orphaned_table | Valid schema, no FK relationships |
| abandoned_table | Copy with dates shifted 2–5 years back |
| defective_pk | Copy with duplicate or null primary keys |
Section 07Every evaluation uses a unique warehouse
Each run begins with a randomly selected industry vertical drawn from 76 subverticals spanning agriculture, healthcare, hospitality, logistics, finance, government, and more, and a randomly generated set of business entities. Those entities are assigned to randomly selected source systems, each with its own naming persona: legacy systems with truncated column names, modern SaaS platforms with clear naming, and systems in between. The assignment of entities to source systems, the naming conventions applied, and the bridge tables connecting them are all randomized per run.
The data itself is generated from a random seed. LLMs are used throughout the pipeline to produce structurally unique warehouses: the specific tables, columns, join paths, analytical concepts, query history, gold SQL, and natural-language prompts differ on every invocation. A fixed set of 20 data quality mutations is applied, but which tables and columns are affected is randomized. The result is a warehouse where no two runs share the same schema, the same entity names, the same join structure, or the same prompts.
After generation, the internal semantic model is stripped. Participants receive only raw data (mutated Parquet files), a bare schema (table names, column names, types), a query log (raw SQL with timestamps), and natural-language prompts. The semantic structure that produced the data must be rediscovered from these outputs alone.
Section 08How we help ensure scores are comparable across runs
If every warehouse is unique, a second concern: is one participant's test harder than another's?
The content varies across runs, with different entities, different table names, different systems with different naming conventions. But the difficulty profile is largely constrained during generation. Each run uses a similar range of entity counts, source system counts, table counts, mutation types and coverage, bridge table quality tiers, and prompt difficulty across the difficulty matrix. This is validated with retry logic: if a generated warehouse falls outside the target range across any of these dimensions, it is rejected and regenerated.
To verify that difficulty is comparable run-to-run, datasets are baselined against three static reference text-to-SQL systems before release. Each system must score within a tight calibration band:
| Reference model | Low threshold | Target | High threshold |
|---|---|---|---|
| GPT-5.4 Nano | 4.4% | 5.4% | 6.4% |
| Claude Haiku 4.5 | 8.4% | 9.4% | 10.4% |
| Gemini 3.1 Flash Lite | 13.3% | 14.3% | 15.3% |
All three models must fall within ±100 basis points of their target accuracy for a dataset to be eligible. For example, the chart below shows eleven candidate datasets: only four passed calibration (all three models within band simultaneously). The remaining seven were rejected because at least one model scored outside its band.
4.4 – 6.4%
8.4 – 10.4%
13.3 – 15.3%
The analogy is a standardized exam: each test contains a similar distribution of easy, moderate, and hard questions even though the specific questions change between administrations. The calibration step is the psychometric review that measures (and constrains) relative difficulty. Baseline scores are included in the released packages.
Section 09Participants submit results, not IP
Participants submit SQL output only. No source code, model weights, architectural details, or intermediate artifacts are required or collected. The benchmark evaluates what a system produces, not how it produces it. Any approach, including RAG, fine-tuning, semantic layer pre-processing, deterministic pipelines, or methods that do not yet exist, can compete on equal terms without exposing proprietary methods.
The BIRD team may require participants to upload their complete codebase for evaluation. Members of that org have historically been funded by and embedded within a commercial entity operating in the data platform and AI analytics space.
Commercial teams with patented or trade secret methods in semantic discovery or text-to-SQL cannot thoughtfully participate under those terms. Standard IP protection practice precludes submitting source code to an evaluation pipeline operated by a competing organization. The result: a growing category of systems is excluded from the leaderboard by structure, not by capability.
Section 10Register and get started
We are in beta through Q2 2026. Researchers interested in early access, stress-testing generated warehouses against their own systems, or proposing new difficulty dimensions: hello@schemantic.io or simply register here.
Step 1: Register and acquire a training data set to practice on.
Step 2: Training packages include gold SQL and a scoring harness. Use these to measure your system's performance — the harness calculates both accuracy and F1 score for deeper error analysis. The same scoring harness is used in the official evaluation.
Step 3: When ready, request an official evaluation. A procedural generator constructs a novel synthetic data warehouse containing parquet files, a schema, and a synthetic query history. No schema documentation, join definitions, entity mappings, or semantic metadata is provided.
Step 4: Using only these, answer a bank of analytical prompts within a 12-hour time limit.
Step 5: Submissions are scored on accuracy. Leaderboard placement is opt-in; scores are not published automatically. Participants may request one evaluation per quarter, unless otherwise permitted by the Schemantic team.
Section 11Who built this
Kirsten Lum (@kirsten_lum_) is the architect and lead developer of the (S)uper (H)ard r(EL)ati(O)nal (B)enchmark. She is also the CEO/CTO of Schemantic.io. Previously, she was senior technical leadership at Amazon, where she led data science, engineering, architecture, and analysis, along with economics, instrumentation, and technical product management.
Tremendous gratitude to Wing Yew Lum, Garrett Fiddler, Sheffield Leithart, and Jonathan Brownell for their support in benchmark design, engineering, and validation.
Thank you to the dozens of experts who contributed feedback and domain knowledge, including the amazing, brilliant, as well as kind: Victor Cheng, Hannah Yuen, Erica Davis, Amos Yuen, Nathan Lee, and Christopher Gutierrez.
@misc{lum2026shelob,
title = {SHELOB: A Generative Benchmark for Semantic Discovery and Enterprise Text-to-SQL},
author = {Lum, Kirsten and Fiddler, Garrett and Leithart, Sheffield and Brownell, Jonathan and Torgerson, Erin and Lum, Wing Yew},
year = {2026},
url = {https://schemantic.io/benchmark},
note = {Operating in Beta through Q2 2026}
}