Proven

EAAPL-KNW002: Semantic Data Layer

Pattern ID: EAAPL-KNW002 Status: Proven Complexity: High Tags: knowledge-graph llm traceability high-complexity Version: 1.0 Last Updated: 2026-06-12

1. Executive Summary

The Semantic Data Layer (SDL) is a governed translation layer that sits between an enterprise's raw data sources and its AI applications. It maps enterprise data to a shared business ontology, enabling natural language queries to be translated into precise, governed data access without requiring AI applications to understand the underlying physical data model.

The SDL solves a critical enterprise AI problem: LLMs given raw schema access (column names, table structures) produce inconsistent query interpretations because the same business concept — "revenue," "active customer," "exposure" — is defined differently across systems. The SDL establishes a single authoritative definition for every business term and maps each source system to it.

For CIOs and CTOs, the SDL delivers three compounding benefits: (1) AI applications become source-system-agnostic, so migrations and system changes do not break AI behaviour; (2) the business glossary enforces consistent AI answers because all AI applications share the same term definitions; (3) semantic caching — re-using translated queries for equivalent natural language questions — reduces LLM API costs by 30–70% in high-volume deployments.

The pattern is most valuable in organisations with ≥5 source systems, cross-domain AI use cases, and active data governance programmes. Implementation requires 3–6 months to reach production maturity.

2. Problem Statement

2.1 Business Problem

Enterprise data is physically distributed across ERP, CRM, data warehouse, operational databases, and SaaS platforms. Each system defines business concepts independently: "customer" in Salesforce may be a legal entity, while "customer" in the billing system may be an individual account, and "customer" in the analytics warehouse may be a household cluster. AI applications trained to answer business questions across these systems produce inconsistent and sometimes contradictory answers because they resolve the same term differently depending on which system they access.

2.2 Technical Problem

LLMs translating natural language questions to SQL or SPARQL queries against raw schemas frequently misinterpret column names, join conditions, and aggregation logic. Without a semantic layer, prompt engineering must embed physical schema details and business rules directly into every AI application — creating brittle integrations that break when schemas change and accumulate contradictory business rule definitions across applications.

2.3 Symptoms

Two AI applications return different "total revenue" figures for the same time period
AI-generated SQL queries fail intermittently due to schema changes in source systems
Business analysts must manually verify every AI-generated data answer against known benchmarks
Adding a new data source requires updating every AI application's prompt separately
Data governance cannot locate where a specific business definition is operationalised in AI systems

2.4 Cost of Inaction

Trust collapse: business users stop relying on AI-generated data insights within weeks of launch when inconsistencies are discovered
Compliance exposure: regulatory reporting generated by AI applications with inconsistent term definitions produces incorrect submissions
Engineering debt: each new AI application rebuilds the same business rule logic independently, creating N maintenance obligations
Data migration risk: any source system migration risks breaking all AI applications that rely on physical schema knowledge

3. Context

3.1 When to Apply

≥5 source systems that must be queryable by AI applications using shared business terminology
Active enterprise data governance programme with a business glossary in progress or completed
Cross-domain AI use cases (e.g., finance + operations + customer) that require consistent term definitions
High query volume AI deployments where semantic caching can deliver measurable cost reduction
Regulatory reporting requirements that demand consistent definitions across AI outputs

3.2 When NOT to Apply

Single source system AI applications — the abstraction overhead is not justified
Organisations without a data governance programme — SDL without ontology ownership degrades to an unmaintained mapping layer
Real-time streaming AI use cases where query translation latency is unacceptable
Early MVP/PoC phases — validate AI value proposition first, add semantic governance layer when production is confirmed

3.3 Prerequisites

Business glossary with ≥80% coverage of key business terms used in target AI use cases
Data catalogue with documented source system schemas and ownership
Data steward function with clear domain ownership responsibilities
API or direct connection access to all source systems that will be mapped to the semantic layer

3.4 Industry Applicability

Industry	Applicability	Primary Use Case
Financial Services	Critical	Regulatory reporting consistency, risk metric definitions, customer exposure calculation
Healthcare	High	Clinical terminology standardisation (SNOMED, LOINC mapping), patient data access
Retail / CPG	High	Product taxonomy, sales metrics consistency, customer segmentation definitions
Manufacturing	High	Product hierarchy, BOM relationships, operational KPI definitions
Telecommunications	High	Network entity relationships, service definitions, customer hierarchy
Government	High	Policy term consistency, inter-agency data sharing, citizen service definitions

4. Architecture Overview

The Semantic Data Layer is structured into five functional layers that together form a pipeline from business definition to data retrieval.

4.1 Business Glossary Foundation

The business glossary is the authoritative source of business term definitions. It precedes the SDL and must be governed independently. Each glossary entry specifies: term name, canonical definition, synonyms, related terms, owning business domain, and the data steward responsible for maintaining the definition. The SDL treats the business glossary as read-only input — it does not own definitions, it operationalises them.

4.2 Ontology Layer

The ontology translates the business glossary into a machine-readable formal specification using OWL (Web Ontology Language) or a property graph schema. Business entities become ontology classes. Relationships between entities become ontology properties. Business metrics and derived measures become calculated properties with defined formulas. The ontology is maintained by the ontology governance committee and versioned in source control. Schema changes go through a formal change control process with impact analysis.

4.3 Semantic Mapping Layer

The semantic mapping layer connects the ontology to the physical source systems. For each ontology class and property, a mapping definition specifies: source system, schema/table/column path, transformation logic (type casts, aggregations, filters), validity constraints, and effective date range. Mappings are authored by data engineers in collaboration with domain data stewards. They are stored in a mapping registry — a versioned repository of mapping definitions that can be audited and rolled back.

Automated mapping suggestions use LLM-assisted column name analysis to propose initial mappings for human review, accelerating the mapping authoring process. All automated suggestions require human validation before activation. Mapping confidence is tracked: manually authored and validated mappings are marked HIGH confidence; LLM-suggested and human-validated are MEDIUM; any auto-activated mappings would be LOW (not permitted in production).

4.4 Natural Language to Query Translation

When an AI application or end user poses a natural language question, the SDL's query translation component processes it in three stages.

Semantic Disambiguation resolves ambiguous terms by referencing the ontology. If the question uses "revenue," disambiguation resolves it to the canonical FinancialMetric.GrossRevenue definition, including its precise calculation formula and applicable source systems. The disambiguated intent is represented as a structured semantic query.

Query Generation translates the structured semantic intent into an executable query (SQL, SPARQL, GraphQL, or a graph traversal) against the appropriate source system. The generated query uses the mapping definitions to navigate from ontology concepts to physical schema paths. Query templates for common patterns (aggregations, time-series, entity lookups) are pre-verified by data engineers and reused wherever possible to avoid LLM query hallucination.

Result Enrichment annotates the query result with semantic metadata: which ontology concepts were queried, which source systems were accessed, which mapping versions were used, and the data freshness timestamp. This metadata is returned to the calling application and can be surfaced to end users or logged for audit.

4.5 Semantic Caching Layer

Translated queries (NL input → structured query) are cached using a dual-key strategy: (1) exact match on the normalised NL question string; (2) semantic equivalence via embedding similarity comparison against cached question embeddings. When a semantically equivalent question is detected, the cached translated query is returned without re-invoking the LLM translation step.

Cache invalidation is triggered by: ontology changes (any change affecting the query's concept set); mapping changes for the source systems accessed; cache TTL expiry (configurable per domain based on data freshness requirements). Cache hit rates of 30–70% are typical in production deployments with diverse but patterned question sets.

5. Architecture Diagram

ARCHITECTURE DIAGRAM

flowchart TD subgraph Sources["Source Systems"] A[ERP / CRM] B[Data Warehouse] end subgraph SDL["Semantic Data Layer"] C[Business Glossary] D[Ontology + Mapping Registry] E[Semantic Cache] end subgraph Translation["Query Translation"] F[NL Disambiguation] G[Query Generator] H[Result Enricher] end C --> D I[NL Question] --> E E -->|cache hit| H E -->|cache miss| F F --> D D --> G G --> A G --> B A --> H B --> H H --> J[AI Application] H --> E style I fill:#dbeafe,stroke:#3b82f6 style C fill:#fef9c3,stroke:#eab308 style D fill:#fef9c3,stroke:#eab308 style E fill:#fef9c3,stroke:#eab308 style F fill:#f0fdf4,stroke:#22c55e style G fill:#f0fdf4,stroke:#22c55e style H fill:#f0fdf4,stroke:#22c55e style A fill:#dbeafe,stroke:#3b82f6 style B fill:#dbeafe,stroke:#3b82f6 style J fill:#d1fae5,stroke:#10b981

6. Components

Component	Type	Responsibility	Technology Options	Criticality
Business Glossary	Governance	Authoritative business term definitions; owned by data governance, read by SDL	Collibra, Alation, Atlan, Microsoft Purview, custom metadata store	Critical
Ontology Engine	Governance	OWL or property graph schema; formal machine-readable term definitions; version control	Protégé (OWL), custom JSON-LD registry, dbt semantic layer, AtScale	High
Mapping Registry	Storage	Versioned ontology-to-physical-schema mappings; source authoring and change history	Custom PostgreSQL registry, dbt metrics layer, Cube.dev semantic layer	Critical
LLM Mapping Suggestion	AI	Propose initial mappings via column name/description analysis	OpenAI GPT-4o, Anthropic Claude, custom fine-tuned model	Medium
Query Translation Engine	Processing	NL → ontology intent → executable query generation	LangChain SQL agent, Vanna.ai, Microsoft Semantic Kernel, custom	Critical
Semantic Disambiguation Module	Processing	Resolve NL terms to canonical ontology concepts; handle synonyms and context	Vector similarity + ontology lookup, LLM with ontology context injection	High
Semantic Cache	Storage	Cache NL queries and their translated forms; semantic equivalence matching	Redis + pgvector, Weaviate, custom embedding cache	Medium
Result Enricher	Processing	Annotate query results with semantic metadata and provenance	Custom middleware layer	High
Impact Analyser	Governance	Detect source system schema changes; assess impact on active mappings	Custom schema diff tool, Monte Carlo data observability, Great Expectations	Medium

7. Data Flow

7.1 Primary Data Flow — Natural Language Query to Result

Step	Actor	Action	Output
1	End User / AI App	Submits natural language question	NL question string
2	Semantic Cache	Checks exact and semantic match against cache	Cache hit → skip to step 8; miss → continue
3	Semantic Disambiguation	Resolves NL terms against ontology; identifies concept intent	Structured semantic intent with resolved ontology URIs
4	Mapping Registry	Looks up physical schema paths for resolved concepts	Mapping definitions for each ontology concept
5	Query Generator	Produces executable query from semantic intent + mappings	SQL / SPARQL / GraphQL query
6	Source System	Executes query; returns raw result set	Raw data result
7	Result Enricher	Annotates result with ontology concept labels, source system, mapping version, freshness	Enriched result set with semantic metadata
8	Semantic Cache	Stores NL → query mapping with embedding for future hits	Cache entry written
9	Calling Application	Receives enriched result	Data answer with full semantic provenance

7.2 Error Flow

Error	Detection	Recovery	Escalation
Ontology term not found (unmapped NL term)	Disambiguation returns null mapping	Return "term not understood" with closest suggestions; log unmapped term	Ontology backlog: data steward creates new term
Source system query failure	Query executor exception	Retry ×2; return partial result with availability note; log failure	Alert data engineering; flag source system health
Mapping mismatch (schema drift in source)	Result validation fails post-enrichment; impact analyser detects column removal	Deactivate affected mapping; return "data unavailable" with reason; route to steward	Immediate data steward notification for mapping repair
Semantic cache stale (post-ontology change)	Cache invalidation job triggered by ontology change event	Flush affected cache entries; force re-translation	Operational log; no escalation required if automated
Query translation hallucination (LLM produces invalid SQL)	SQL validation before execution; query explainer check	Reject invalid query; fall back to template-based generation if available	Log hallucination instance; flag for translation model review

8. Security Considerations

8.1 Authentication and Authorisation

The SDL query API enforces attribute-based access control (ABAC): the calling application's identity determines which ontology concepts and source systems it is permitted to query. A concept-level permission model prevents an AI application authorised for "customer contact information" from accessing "customer financial information" even if those concepts share a source table. OAuth 2.0 client credentials flow is used for service-to-service authentication.

8.2 Secrets Management

Source system connection credentials are stored in a secrets vault. The SDL query execution engine retrieves credentials at query time using short-lived dynamic secrets where the source system supports it (e.g., database IAM authentication). Connection strings are never stored in the mapping registry or logged.

8.3 Data Classification

Each ontology concept is tagged with a data classification level inherited from the most sensitive source system attribute mapped to it. Query results inherit the highest classification level of any concept in the query. Results above a calling application's authorised classification are blocked at the result enricher with an access denied response and audit log entry.

8.4 Encryption

All inter-component communication uses TLS 1.3. Semantic cache entries containing query results are encrypted at rest. The mapping registry is encrypted at rest. Query logs (containing potential sensitive terms) are encrypted and access-restricted to authorised operations teams.

8.5 Auditability

Every NL query, resolved semantic intent, generated physical query, and result return is logged with: caller identity, timestamp, ontology concepts accessed, source systems queried, mapping versions used, and data classification of the result. These logs provide a complete lineage record: from the AI application's question to the physical data rows accessed, with every translation step documented.

8.6 OWASP LLM Top 10 Mapping

OWASP LLM Risk	Relevance	Mitigation
LLM01 Prompt Injection	Adversarial NL query designed to inject SQL or manipulate query generator	Parameterised query generation (LLM produces intent, not raw SQL); SQL injection prevention at execution layer
LLM02 Insecure Output Handling	LLM-generated query passed directly to database execution	Generated query validated by SQL parser before execution; reject queries with DML statements (INSERT/UPDATE/DELETE)
LLM03 Training Data Poisoning	Mapping suggestion LLM trained on data with incorrect mappings	Human validation required for all LLM-suggested mappings; training data provenance tracked
LLM04 Model Denial of Service	Adversarially complex NL queries generating expensive database queries	Query cost estimation before execution; maximum query cost limit; rate limiting per caller
LLM05 Supply Chain Vulnerabilities	Query translation LLM dependency could be compromised	Pinned model versions; model integrity verification; ability to swap translation model without architecture change
LLM06 Sensitive Information Disclosure	SDL translates NL query that inadvertently exposes restricted data	ABAC at concept level; classification check before result return; field-level masking for PII concepts
LLM07 Insecure Plugin Design	Source system connectors as plugins to the query engine	Connector authentication validated; schema-scoped connector permissions; connector code review
LLM08 Excessive Agency	Query translation agent could autonomously execute DML if not constrained	Read-only database connections for SDL execution; DML blocked at connection level
LLM09 Overreliance	Business users over-trust AI-generated data answers from SDL	Semantic metadata surfaced with every result: data freshness, mapping confidence, source system
LLM10 Model Theft	Ontology and mapping registry encode proprietary business logic	Access-controlled APIs; mapping registry not exposed externally; no bulk export endpoints

9. Governance Considerations

9.1 Responsible AI

The SDL makes AI data access deterministic and governed, which is itself a responsible AI control. However, the ontology encodes business choices (e.g., which revenue calculation formula is canonical) that may disadvantage certain business units or stakeholders. An ontology review process must include representation from all affected business domains. Decisions that favour one domain's definition over another must be documented with rationale.

9.2 Model Risk Management

The query translation LLM is a model risk management artefact. Its performance is measured on a golden question set (questions with known correct SQL outputs). Precision and recall on the golden set are monitored per query category. A model validation report is produced when the translation model is upgraded or its prompt is substantially changed.

9.3 Human Approval Gates

All ontology changes require data steward approval from the affected domain plus sign-off from the ontology governance committee. Mapping changes for source systems that feed regulatory reporting require a secondary approval from the compliance team. The mapping staging environment allows testing NL queries against a proposed mapping change before production activation, with results compared against a golden answer set.

9.4 Policy Ownership

The business glossary is owned by the Chief Data Officer's organisation. Ontology is jointly owned by the data architecture function and domain data stewards. Mapping definitions are owned by the data engineering function with domain data steward sign-off. Query translation model prompts and configuration are owned by the AI engineering function. Changes in any of these domains trigger impact analysis in downstream layers.

9.5 Traceability

The SDL maintains a complete provenance record for every query result: which NL question was asked → which ontology concepts were resolved → which mappings were used (with version) → which source tables/columns were accessed → which rows were returned → what the data freshness was at time of query. This provenance record satisfies regulatory requirements for AI decision auditability and supports data lineage documentation in the data catalogue.

9.6 Governance Artefacts

Artefact	Owner	Frequency	Location
Business glossary	CDO / Domain Data Stewards	Continuously maintained	Data governance platform (Collibra/Alation)
Ontology version history	Data Architecture + Governance Committee	Per change	Version-controlled ontology repository
Mapping registry	Data Engineering + Domain Stewards	Per change	Versioned mapping registry database
Golden question set	Data Engineering + Business Analysts	Quarterly refresh	Test suite in CI/CD pipeline
Translation model performance report	AI Engineering	Per model version	ML model registry
Query audit log	Operations	Continuous	Immutable audit log store

10. Operational Considerations

10.1 Monitoring and SLOs

Metric	SLO Target	Alerting Threshold	Tool
NL query end-to-end latency p95	≤3 seconds (cache miss); ≤200ms (cache hit)	>5s p95 over 5 min	Prometheus + Grafana
Semantic cache hit rate	≥40% in steady-state production	<20% over 1 hour	Custom metric
Translation accuracy on golden set	≥90% precision on validated golden questions	<85% precision	Scheduled evaluation job
Mapping coverage (% active ontology concepts with valid mappings)	≥95% coverage	<90% coverage	Data quality dashboard
Source system query failure rate	<1% of translated queries fail execution	>5% failure rate	Query execution metrics
Stale mapping alert rate	0 unacknowledged stale mapping alerts >24 hours	Any unacknowledged alert >24h	Incident management tool

10.2 Logging

Query logs are structured JSON: {timestamp, caller_id, nl_question_hash, resolved_concepts, mappings_used, source_systems, query_execution_ms, result_row_count, data_classification, cache_hit}. NL question text is hashed in operational logs; raw text is stored in the separate audit log (access-restricted). Audit logs are immutable and retained per regulatory requirements.

10.3 Incident Management

A P1 incident is declared when the SDL query API is unavailable or when the translation accuracy rate drops below 75% on the golden set. The on-call data engineering team has a 15-minute response SLA. A P2 incident covers mapping staleness affecting regulatory reporting concepts — 2-hour response SLA. Mapping outages affecting non-critical domains are P3 with next-business-day response.

10.4 Disaster Recovery

Scenario	RTO	RPO	Recovery Procedure
Query translation service failure	5 min (restart; stateless)	N/A (stateless)	Container restart; validate with health check query
Mapping registry unavailable	30 min	5 min (replica promotion)	Promote read replica; validate mapping count
Semantic cache corruption	15 min	0 (cache is reconstructable)	Flush cache; warm from query log replay
Business glossary platform outage	SDL continues with cached ontology snapshot	Last cached snapshot (max 1 hour)	SDL reads ontology snapshot; alert data governance to restore glossary platform

10.5 Capacity Planning

Query translation compute is CPU-intensive for cache misses (LLM invocation). At scale, 70%+ cache hit rates make the average compute cost manageable. Plan for bursty LLM API quota: cache miss spikes occur when new question types are introduced (e.g., a new AI application launches). Semantic cache storage grows at approximately 1 KB per cached entry; 1 million cached entries requires ~1 GB, which is negligible.

11. Cost Considerations

11.1 Cost Drivers

Cost Driver	Description	Typical Range
LLM API calls for query translation	Per-query LLM cost for cache misses; depends on cache hit rate	$0.002–$0.02 per cache miss query
Semantic cache infrastructure	Vector similarity cache (Redis + embedding index)	$500–$3,000/month
Mapping registry database	PostgreSQL or equivalent; modest size but high availability required	$200–$1,000/month
Data steward and ontology governance labour	The dominant ongoing cost — human expertise to maintain mappings	2–5 FTE (shared across data governance programme)
Source system query compute	Depends on source system pricing; SDL adds query overhead	Variable; monitor via query cost analysis
Business glossary platform	Commercial platforms (Collibra, Alation) carry significant licence cost	$50,000–$500,000/year depending on scale

11.2 Scaling Risks

LLM translation cost for unique queries grows linearly without cache; organisations with highly diverse question sets see lower cache hit rates
Ontology and mapping maintenance labour scales with organisational complexity, not query volume — a large organisation needs proportionally more data stewards regardless of query load
Source system schema drift creates mapping maintenance burden that grows with the number of source systems and their change velocity

11.3 Optimisations

Semantic caching is the single highest-ROI optimisation — invest in cache quality and TTL tuning before any other cost reduction effort
Template-based query generation for the most common query patterns avoids LLM invocation entirely for those patterns
Lightweight open-source embedding models can replace commercial embeddings for cache similarity matching at substantially lower cost
Shared ontology governance across AI programmes (not SDL-specific) distributes the data steward cost across multiple value streams

11.4 Indicative Cost Ranges

Deployment Scale	Monthly Infrastructure Cost	Annual Total (incl. governance labour)
Single domain, 3 source systems	$2,000–$5,000	$150,000–$300,000
Multi-domain, 10 source systems	$8,000–$20,000	$500,000–$1,200,000
Enterprise-wide, 50+ source systems	$30,000–$80,000	$2,000,000–$5,000,000

12. Trade-Off Analysis

12.1 Semantic Layer Technology Options

Option	Strengths	Weaknesses	Best For
Custom SDL (this pattern)	Maximum control; integrates with existing data catalogue; extensible	High build and maintenance cost; requires strong data engineering capability	Large enterprises with diverse source systems and active data governance
dbt Semantic Layer	Native integration with dbt-managed data warehouse; strong SQL ecosystem	Limited to SQL sources; no NL query translation built in; weak ontology expressiveness	Data warehouse-centric organisations already using dbt
Cube.dev / AtScale	Managed semantic layer; built-in caching; BI tool integration	Commercial; primarily metric-focused; limited relationship graph expressiveness	Analytics-heavy use cases; BI + AI hybrid access patterns
Microsoft Fabric Semantic Model	Deep Azure/Power BI integration; enterprise support	Azure lock-in; Power BI-centric; limited graph relationship support	Microsoft-native organisations

12.2 Architectural Tensions

Tension	Option A	Option B	Recommended Resolution
Ontology completeness vs. maintenance burden	Complete ontology covering all business terms before any AI deployment	Minimal ontology covering only terms needed for current AI use cases	Incremental ontology: start with the concepts needed for the first 2–3 AI use cases; expand governed by demand; never build ahead of usage
Query translation accuracy vs. latency	Thorough multi-step LLM translation with disambiguation for accuracy	Single-pass template matching for low latency	Hybrid: templates for high-frequency, well-understood query patterns; LLM for novel queries; cache bridges the gap
Semantic cache freshness vs. hit rate	Short TTL for freshness (lower hit rate, higher cost)	Long TTL for high hit rate (risk of stale results)	Domain-calibrated TTL: fast-changing operational metrics get short TTL; stable reference data gets long TTL
Centralised vs. federated SDL	Single centralised SDL for enterprise-wide consistency	Domain-federated semantic layers with federation standards	Centralised ontology and business glossary; domain-federated mapping registries for source system ownership

13. Failure Modes

Failure	Likelihood	Impact	Detection	Recovery
Ontology definition conflict (two domains define same term differently)	High	High — SDL produces contradictory answers depending on resolver	Data steward conflict reports; inconsistent AI answers	Ontology governance committee arbitration; canonical definition documented; alternate terms for domain-specific variants
Mapping staleness (source schema change breaks mapping)	High	High for affected concepts; scoped to specific queries	Impact analyser detects schema drift; query failures for affected mappings	Mapping repair by data engineer; automated schema drift alerts minimise time to detection
Semantic cache poisoning (incorrect translation cached)	Low	Medium — affects all queries hitting that cache entry	Golden set regression; user-reported incorrect answers	Flush affected cache entries; identify root cause (hallucination or mapping error); fix translation
Translation LLM unavailability	Medium	High if no fallback — all cache-miss queries fail	LLM API health check; query failure rate spike	Fallback to template-only translation for known query patterns; queue novel queries for retry
Business glossary platform outage	Low	Medium — SDL continues with snapshot; new glossary updates not reflected	Glossary platform health check	SDL operates from last cached ontology snapshot; alert data governance; acceptable degradation for max 4 hours

13.1 Cascading Failure Scenarios

Scenario 1: Mass Mapping Invalidation. A source ERP system undergoes a major version upgrade, changing 40% of table/column names. The impact analyser flags 312 mapping invalidations simultaneously. The human review queue floods beyond capacity. The SDL switches to "degraded mode" — only serving queries against concepts with valid mappings, returning "data temporarily unavailable" for others. Resolution requires a war room with data engineering and ERP administrators; a mapping batch repair tool is executed to accelerate the re-mapping process.

Scenario 2: Ontology Terminology Change Cascade. The data governance committee renames a core concept ("Client" → "Customer") to align with a new CRM system. The SDL flushes all cache entries containing the old term. All AI applications must update their prompts to use the new term. In the interim, AI applications asking about "clients" receive no results because the old ontology term is deprecated. The lesson: ontology term renames require a deprecation period where both old and new terms are accepted, with a migration window before the old term is removed.

14. Regulatory Considerations

Regulation	Relevant Clause	Requirement	How SDL Addresses It
APRA CPS 230	§36–§38 (Service Continuity)	Critical data services must have documented availability and recovery plans	SDL availability SLOs, DR procedures, and degraded-mode operation documented
APRA CPS 234	§15 (Information Asset Management)	Information assets classified proportionate to sensitivity	Data classification on every ontology concept and query result
Australian Privacy Act 1988	APP 6 (Use or Disclosure)	Personal information only used for the purpose it was collected	ABAC at concept level prevents AI applications from accessing personal data outside their authorised purpose
EU AI Act	Article 13 (Transparency)	High-risk AI decisions must be explainable	Semantic metadata on every query result provides the translation chain: NL → concept → physical data
EU GDPR	Article 5(1)(b) (Purpose Limitation)	Data only processed for specified, explicit, legitimate purposes	Purpose-scoped access control enforced at the ontology concept level
ISO/IEC 42001	§8.4 (AI system transparency)	Organisations must document AI system data inputs and transformations	Mapping registry + query audit log provides full input documentation
NIST AI RMF	MAP 2.2 (AI Risk Characterisation)	Risks from AI data access characterised and documented	Mapping confidence levels and data classification labels quantify data access risk

15. Reference Implementations

15.1 AWS

Component	AWS Service
Ontology / mapping registry	Aurora PostgreSQL with custom schema
Query translation LLM	Amazon Bedrock (Claude or Titan)
Semantic cache	ElastiCache Redis + custom embedding index
Business glossary	AWS Glue Data Catalog (limited) or third-party Collibra on EC2
Source system connectivity	Amazon Athena (data lake), RDS direct connection, Redshift
Monitoring	CloudWatch + Managed Prometheus/Grafana
Access control	AWS IAM + Lake Formation fine-grained access

15.2 Azure

Component	Azure Service
Ontology / mapping registry	Azure SQL Database
Query translation LLM	Azure OpenAI Service (GPT-4o)
Semantic cache	Azure Cache for Redis + Azure AI Search
Business glossary	Microsoft Purview Data Catalog
Source system connectivity	Azure Synapse Analytics, Azure SQL, Fabric OneLake
Monitoring	Azure Monitor + Grafana
Access control	Azure AD ABAC + Purview data policies

15.3 GCP

Component	GCP Service
Ontology / mapping registry	Cloud SQL PostgreSQL
Query translation LLM	Vertex AI Gemini
Semantic cache	Memorystore Redis + Vertex AI Vector Search
Business glossary	Dataplex Data Catalog
Source system connectivity	BigQuery, Cloud SQL, AlloyDB
Monitoring	Cloud Monitoring + Grafana

15.4 On-Premises

Component	Technology
Ontology / mapping registry	PostgreSQL + custom API layer
Query translation	Self-hosted Ollama (Llama 3.x) or on-prem LLM
Semantic cache	Redis Enterprise + pgvector
Business glossary	Collibra on-prem or open-source Amundsen/DataHub
Source connectivity	Direct JDBC/ODBC; Airbyte for data movement

Pattern ID	Pattern Name	Relationship Type	Notes
EAAPL-KNW001	Enterprise Knowledge Graph	Complementary	SDL provides the semantic interface to the knowledge graph; together they create governed NL-to-knowledge access
EAAPL-KNW003	AI Knowledge Corpus Management	Upstream	Corpus documents are richer when the semantic layer provides entity and term context for ingestion
EAAPL-KNW006	Corpus Quality Assurance	Supporting	Quality assurance validates that corpus documents use terms consistently with the SDL ontology
EAAPL-RAG002	Text-to-SQL	Specialisation	Text-to-SQL is a simpler version of the SDL concept — SDL adds ontology governance and multi-source abstraction
EAAPL-GOV001	AI Data Governance	Dependency	SDL is an implementation of AI data governance principles — requires a functioning data governance programme
EAAPL-SEC001	AI Data Access Control	Supporting	SDL's ABAC implementation is an application of the AI data access control pattern

17. Maturity Assessment

Overall Maturity Label: Proven

Dimension	Score (1–5)	Rationale
Technology readiness	4	NL-to-SQL translation is production-proven; semantic caching is well-understood; managed semantic layers from dbt/Cube are commercial grade
Organisational capability	2	Requires mature data governance including a business glossary and data steward function — rare below large enterprise level
Standards availability	3	OWL/RDF/SPARQL are mature; property graph query standards (GQL) are emerging; semantic layer API standards are fragmented
Vendor ecosystem	4	Multiple commercial semantic layer products; multiple LLM options for translation; strong open-source tooling
Case evidence	3	Strong evidence in analytics-heavy domains (BI semantic layers); AI-specific SDL evidence is growing but less documented
Regulatory alignment	5	SDL directly addresses regulatory transparency, purpose limitation, and auditability requirements for AI data access
Overall	3.5 / 5	Proven with strong regulatory alignment; primary constraint is the prerequisite data governance programme maturity

18. Revision History

Version	Date	Author	Changes
1.0	2026-06-12	EAAPL Editorial Board	Initial publication — covers ontology governance, semantic mapping, NL query translation, semantic caching, business glossary integration, and mapping validation

Track this pattern for APRA/ASIC review

← Back to Library More Knowledge Management →

EAAPL-KNW002: Semantic Data Layer

EAAPL-KNW002: Semantic Data Layer

1. Executive Summary

2. Problem Statement

2.1 Business Problem

2.2 Technical Problem

2.3 Symptoms

2.4 Cost of Inaction

3. Context

3.1 When to Apply

3.2 When NOT to Apply

3.3 Prerequisites

3.4 Industry Applicability

4. Architecture Overview

4.1 Business Glossary Foundation

4.2 Ontology Layer

4.3 Semantic Mapping Layer

4.4 Natural Language to Query Translation

4.5 Semantic Caching Layer

5. Architecture Diagram

6. Components

7. Data Flow

7.1 Primary Data Flow — Natural Language Query to Result

7.2 Error Flow

8. Security Considerations

8.1 Authentication and Authorisation

8.2 Secrets Management

8.3 Data Classification

8.4 Encryption

8.5 Auditability

8.6 OWASP LLM Top 10 Mapping

9. Governance Considerations

9.1 Responsible AI

9.2 Model Risk Management

9.3 Human Approval Gates

9.4 Policy Ownership

9.5 Traceability

9.6 Governance Artefacts

10. Operational Considerations

10.1 Monitoring and SLOs

10.2 Logging

10.3 Incident Management

10.4 Disaster Recovery

10.5 Capacity Planning

11. Cost Considerations

11.1 Cost Drivers

11.2 Scaling Risks

11.3 Optimisations

11.4 Indicative Cost Ranges

12. Trade-Off Analysis

12.1 Semantic Layer Technology Options

12.2 Architectural Tensions

13. Failure Modes

13.1 Cascading Failure Scenarios

14. Regulatory Considerations

15. Reference Implementations

15.1 AWS

15.2 Azure

15.3 GCP

15.4 On-Premises

16. Related Patterns

17. Maturity Assessment

18. Revision History