a Context Layer Benchmark at Enterprise Scale | Last Published: 3/27/2026

(S)uper (H)ard r(EL)ati(O)nal (B)enchmark

In search of the context layer: an enterprise-difficulty relational data benchmark for AI.

Table of Contents

Why this benchmark exists
How this benchmark compares to BIRD
Five dimensions of enterprise data complexity
How prompts are constructed and validated
How warehouses and gold SQL are generated
Twenty mutations that reflect real world noise
Every evaluation uses a unique warehouse
How we help ensure scores are comparable across runs
Participants submit results, not IP
Register and get started
Who built this

Section 01Why this benchmark exists

The mission. We set out to help researchers and AI systems make progress on enterprise relational data — which meant building a benchmark that looks like the environments we work in. Messy, undocumented, and large, but with coherent underlying structure and validated gold SQL throughout. Each warehouse is procedurally generated from a unique seed, with twenty mutation types applied across joins, values, and tables, and entities fragmented across multiple source systems. No two evaluations share the same schema.

That level of realism makes this currently the hardest (and perhaps most real-world indicative) benchmark in this domain.

The problem. Industry-standard text-to-SQL benchmarks provide the semantic information needed to write queries: foreign keys, table descriptions, column descriptions, and evidence hints. None tests whether a system can discover these relationships from raw data. In production enterprise environments, semantic discovery is the problem. Writing SQL is the easy part once you know what the data means.

No published benchmark currently provides (but this benchmark will): procedurally generated enterprise-scale warehouses with known ground truth, cross-system entity resolution through bridge tables of variable quality, bronze-to-silver noise inheritance, derived concepts encoded only in synthetic query history, source-system naming personas that obfuscate column semantics, three-axis difficulty parameterization, or an IP-safe submission system that evaluates output without requiring codebase disclosure.

Section 02How this benchmark compares to BIRD

We honor the groundbreaking BIRD benchmark but see opportunity for further supporting researchers via a new benchmark. The table below compares key dimensions, and independent research has identified areas where current evaluation methodology could be strengthened.

Dimension	BIRD	This benchmark
Tables per database	Avg. 7 (max 13)	50–95 tables, millions of rows
Semantic metadata provided	FK, descriptions, evidence hints	None
Data exposure risk	High — train data public since 2023	Zero — generated per run
Cross-system entity resolution	No	Yes
Bridge tables (variable quality)	No	Yes
Concepts in query history only	No	Yes
Bronze→silver noise inheritance	No	Yes
Scoring method	Exact match (EX)	Accuracy (competition) + F1 (self-score in train)
SQL dialect	SQLite only	Any — scoring compares result sets, not SQL syntax
Codebase disclosure required	In certain circumstances	Never

Independent evaluation has identified quality issues in BIRD's gold SQL. Wretblad et al. (ACL 2024) documented how noise in gold queries distorts text-to-SQL benchmark scores, sometimes flipping which system appears stronger. A separate analysis of 151 BIRD training questions found 49 where the gold SQL is demonstrably incorrect (Matson, MotherDuck, Feb 2026) — systems may be reproducing the benchmark's errors rather than learning correct SQL.

Section 03Five dimensions of enterprise data complexity

Each dimension isolates an aspect of data complexity absent from existing relational data benchmarks.

5.1 Semantic opacity

Column names are obfuscated by source-system personas. The system cannot rely on naming conventions to infer meaning. It must examine the data itself: value distributions, co-occurrence patterns, statistical properties.

5.2 Cross-system entity resolution

One entity (a customer, an order, a product) exists across multiple source systems with different identifiers, different column names, and different granularities. Bridge tables connecting these systems have variable quality: some complete, some partial, some stale.

5.3 Bronze-layer noise surviving into silver

The silver layer inherits noise from bronze: 90-day data gaps, implausible dates, all-null columns, junk sentinel values, segments masquerading as subsets, type mismatches, duplicate records, test records, and abandoned tables. Silver does not clean these because in production, it rarely does.

5.4 Concepts that exist only in query history

Derived metrics and business events are not defined in any schema metadata. They exist only as patterns in the synthetic query history, embedded in computed aggregations, filtered subsets, and windowed calculations. Ignoring query history makes these prompts unanswerable.

5.5 Dynamic test generation eliminates contamination

No generated warehouse, schema, or prompt set appears in any training corpus. Every evaluation is against data the model has never seen.

Section 04How prompts are constructed and validated

The difficulty dimensions above are reflected in how prompts are constructed.

Prompt difficulty is parameterized along three independent axes, enabling stratified analysis of where systems fail.

Axis	Easy	Moderate	Hard
SQL complexity	Single-table, counts, filters	Multi-join, aggregations, subqueries	Complex SQL operations, many tables, nested logic
Concept complexity	Direct column references	Derived metrics from query history	Multi-step derived concepts, cross-system joins required
Prompt clarity	Clean, precise question	Minor ambiguity, extraneous context	Grammatical errors, verbose persona, irrelevant detail

Prompt personas range from a senior analyst who asks precise questions to a new hire who adds extraneous context and misspells terms. Difficulty distribution is reported alongside accuracy so that systems are evaluated on where they fail, not collapsed into a single number.

Example: same query pattern, two prompt tiers

Tier 1 — Senior Analyst

Clean, precise, no ambiguity

Revenue breakdown by payment method, highest first. Exclude any line items missing a total. Need this for the board deck.

Tier 3 — New Hire

Verbose, uncertain, irrelevant context

Hi there! I'm new to the team (just started last week actually, still figuring out where the coffee machine is haha) and my manager asked me to pull together some shipping data but I want to make sure I'm describing this correctly before I waste anyone's time. So basically what I think we need is a breakdown by each shipping carrier we work with. For each of those carriers, we want to know two things: how many times shipments were actually dispatched where we have a weight recorded, and the average weight of those shipments in kilograms. Oh and I think it would make the most sense to sort the whole thing so the carrier with the highest number of dispatches shows up at the top. Sorry if this is overly detailed!

Both prompts target the same SQL pattern, a grouped aggregation with a filter and ORDER BY. But Tier 3 buries the requirements in conversational noise, hedging, and irrelevant detail. Systems must extract the analytical intent from both.

Section 05How warehouses and gold SQL are generated

The warehouse generator was initially built for internal use. Schemantic's Semantic Acquisition Model (SAM) needed to be tested against warehouses that look like production enterprise environments. That meant we needed a dataset with ground truth semantics, something that is vanishingly rare in the data world. A static test set would allow SAM to overfit. The generator exists so that every internal test run is against a warehouse SAM has never seen.

Each train and test warehouse is built through a seven-stage pipeline. The generator selects an industry vertical, creates entities and source systems, layers in realistic noise, and produces gold SQL with natural-language prompts — all from a unique random seed.

01Entity relationship model

↓

02Source system personas

↓

03Bronze layer

↓

04Silver layer

↓

05Semantic terms

↓

06Query history

↓

07Gold SQL + prompts

Each generated warehouse contains 50–95 tables with thousands of columns and millions of rows, working but undeclared join relationships, entities spread across multiple source systems with variant identifiers, temporal event data, controlled data quality issues, and derived tables with traceable lineage.

Source systems are assigned naming "personas" like legacy financial systems with truncated column names (cst_acq_dt), modern SaaS platforms with transparent naming, and systems in between. Beyond the industry-specific systems, each warehouse also includes cross-functional systems that real enterprise companies run like CRMs, marketing automation tools, HR systems, support ticketing portals.

These functional systems share entities with the vertical systems (a CRM Contact is the same person as a POS Customer), creating additional bridge tables and cross-system joins that participants must discover.

One real-world entity may exist across dozens of tables under different names. This mirrors production environments where the same customer appears in 40 tables with 6 different column name conventions.

Gold SQL validation

During generation, every gold SQL query (the reference-correct SQL query for each prompt) is executed against the generated warehouse. Validation confirms: the query executes without error, returns a non-empty result set (unless empty is correct), and the result set is plausible. Queries that fail validation are regenerated. This addresses the accumulation of gold standard errors documented in fixed-dataset benchmarks.

The benchmark ships Parquet files as the canonical data source, and DuckDB SQL as the canonical gold SQL. Scoring compares result sets, not SQL syntax. Therefore, participants can compete in any SQL dialect they choose. A correct query run in Postgres, Snowflake, or BigQuery scores identically to one in DuckDB.

Gold SQL is generated first, and the natural-language prompt is then derived from it. This reverses the typical annotation workflow, where humans write questions then write SQL, which is the source of common gold standard errors: misalignment between the question asked and the query written.

Section 06Twenty mutations that reflect real world noise

The generator introduces controlled mutations at three levels (join, value, and dataset) grounded in patterns observed in production enterprise warehouses. Each mutation has a named type, a deterministic application rule, and a parameterized intensity range.

Join mutations

Mutation	Description
prefix	Prepend string to ID values
suffix	Append string to ID values
leading_zeros	Zero-pad numeric IDs to fixed width
nulls	Replace 5–25% of FK values with NULL
name_mismatch	Rename column to opaque name
compound_two	Concatenate 2 columns into delimited key

Value mutations

Mutation	Description
miscast	Cast numeric column to VARCHAR
null_missing	Replace 10–40% of values with NULL or sentinel values
duplicate_records	Duplicate 5–15% of rows
test_records	Insert 3–8 fake test rows
date_proliferation	Add 2–4 redundant derived date columns

Dataset mutations

Mutation	Description
duplicate_table	Exact copy with different name
empty_table	Valid schema, zero rows
name_versioned	Copy with _v2 / _backup / _deprecated suffix
table_subset	Column subset (30–50% columns removed)
near_duplicate	Copy with 1–3 columns renamed or reordered
table_segment	Filtered subset via WHERE clause
orphaned_table	Valid schema, no FK relationships
abandoned_table	Copy with dates shifted 2–5 years back
defective_pk	Copy with duplicate or null primary keys

Section 07Every evaluation uses a unique warehouse

Each run begins with a randomly selected industry vertical drawn from 76 subverticals spanning agriculture, healthcare, hospitality, logistics, finance, government, and more, and a randomly generated set of business entities. Those entities are assigned to randomly selected source systems, each with its own naming persona: legacy systems with truncated column names, modern SaaS platforms with clear naming, and systems in between. The assignment of entities to source systems, the naming conventions applied, and the bridge tables connecting them are all randomized per run.

The data itself is generated from a random seed. LLMs are used throughout the pipeline to produce structurally unique warehouses: the specific tables, columns, join paths, analytical concepts, query history, gold SQL, and natural-language prompts differ on every invocation. A fixed set of 20 data quality mutations is applied, but which tables and columns are affected is randomized. The result is a warehouse where no two runs share the same schema, the same entity names, the same join structure, or the same prompts.

After generation, the internal semantic model is stripped. Participants receive only raw data (mutated Parquet files), a bare schema (table names, column names, types), a query log (raw SQL with timestamps), and natural-language prompts. The semantic structure that produced the data must be rediscovered from these outputs alone.

Section 08How we help ensure scores are comparable across runs

If every warehouse is unique, a second concern: is one participant's test harder than another's?

The content varies across runs, with different entities, different table names, different systems with different naming conventions. But the difficulty profile is largely constrained during generation. Each run uses a similar range of entity counts, source system counts, table counts, mutation types and coverage, bridge table quality tiers, and prompt difficulty across the difficulty matrix. This is validated with retry logic: if a generated warehouse falls outside the target range across any of these dimensions, it is rejected and regenerated.

To verify that difficulty is comparable run-to-run, datasets are baselined against three static reference text-to-SQL systems before release. Each system must score within a tight calibration band:

Reference model	Low threshold	Target	High threshold
GPT-5.4 Nano	4.4%	5.4%	6.4%
Claude Haiku 4.5	8.4%	9.4%	10.4%
Gemini 3.1 Flash Lite	13.3%	14.3%	15.3%

All three models must fall within ±100 basis points of their target accuracy for a dataset to be eligible. For example, the chart below shows eleven candidate datasets: only four passed calibration (all three models within band simultaneously). The remaining seven were rejected because at least one model scored outside its band.

GPT-5.4 Nano
4.4 – 6.4%

5.4

10.5

6.2

4.4

5.7

6.6

12.1

9.7

6.2

5.1

4.4

Haiku 4.5
8.4 – 10.4%

10.0

12.7

9.7

8.8

8.6

10.6

16.2

12.5

11.4

10.0

6.9

Gemini Flash
13.3 – 15.3%

14.9

22.6

13.5

14.2

13.4

18.8

21.7

22.8

15.9

15.4

14.8

Eligible (all 3 within band) Rejected

1234567891011

The analogy is a standardized exam: each test contains a similar distribution of easy, moderate, and hard questions even though the specific questions change between administrations. The calibration step is the psychometric review that measures (and constrains) relative difficulty. Baseline scores are included in the released packages.

Section 09Participants submit results, not IP

ARTKYS

All Right Then, Keep Your Secrets

Participants submit SQL output only. No source code, model weights, architectural details, or intermediate artifacts are required or collected. The benchmark evaluates what a system produces, not how it produces it. Any approach, including RAG, fine-tuning, semantic layer pre-processing, deterministic pipelines, or methods that do not yet exist, can compete on equal terms without exposing proprietary methods.

The BIRD team may require participants to upload their complete codebase for evaluation. Members of that org have historically been funded by and embedded within a commercial entity operating in the data platform and AI analytics space.

Commercial teams with patented or trade secret methods in semantic discovery or text-to-SQL cannot thoughtfully participate under those terms. Standard IP protection practice precludes submitting source code to an evaluation pipeline operated by a competing organization. The result: a growing category of systems is excluded from the leaderboard by structure, not by capability.

Section 10Register and get started

We are in beta through Q2 2026. Researchers interested in early access, stress-testing generated warehouses against their own systems, or proposing new difficulty dimensions: hello@schemantic.io or simply register here.

Step 1: Register and acquire a training data set to practice on.

Step 2: Training packages include gold SQL and a scoring harness. Use these to measure your system's performance — the harness calculates both accuracy and F1 score for deeper error analysis. The same scoring harness is used in the official evaluation.

Step 3: When ready, request an official evaluation. A procedural generator constructs a novel synthetic data warehouse containing parquet files, a schema, and a synthetic query history. No schema documentation, join definitions, entity mappings, or semantic metadata is provided.

Step 4: Using only these, answer a bank of analytical prompts within a 12-hour time limit.

Step 5: Submissions are scored on accuracy. Leaderboard placement is opt-in; scores are not published automatically. Participants may request one evaluation per quarter, unless otherwise permitted by the Schemantic team.

Section 11Who built this

Kirsten Lum (@kirsten_lum_) is the architect and lead developer of the (S)uper (H)ard r(EL)ati(O)nal (B)enchmark. She is also the CEO/CTO of Schemantic.io. Previously, she was senior technical leadership at Amazon, where she led data science, engineering, architecture, and analysis, along with economics, instrumentation, and technical product management.

Tremendous gratitude to Wing Yew Lum, Garrett Fiddler, Sheffield Leithart, and Jonathan Brownell for their support in benchmark design, engineering, and validation.

Thank you to the dozens of experts who contributed feedback and domain knowledge, including the amazing, brilliant, as well as kind: Victor Cheng, Hannah Yuen, Erica Davis, Amos Yuen, Nathan Lee, and Christopher Gutierrez.

Cite this benchmark

@misc{lum2026shelob,
  title   = {SHELOB: A Generative Benchmark for Semantic Discovery and Enterprise Text-to-SQL},
  author  = {Lum, Kirsten and Fiddler, Garrett and Leithart, Sheffield and Brownell, Jonathan and Torgerson, Erin and Lum, Wing Yew},
  year    = {2026},
  url     = {https://schemantic.io/benchmark},
  note    = {Operating in Beta through Q2 2026}
}