EAAPLEnterprise AI Architecture Pattern Library
EAAPLLibraryKnowledge Management
Proven
⇄ Compare

EAAPL-KNW002: Semantic Data Layer

EAAPL-KNW002: Semantic Data Layer

Pattern ID: EAAPL-KNW002 Status: Proven Complexity: High Tags: knowledge-graph llm traceability high-complexity Version: 1.0 Last Updated: 2026-06-12


1. Executive Summary

The Semantic Data Layer (SDL) is a governed translation layer that sits between an enterprise's raw data sources and its AI applications. It maps enterprise data to a shared business ontology, enabling natural language queries to be translated into precise, governed data access without requiring AI applications to understand the underlying physical data model.

The SDL solves a critical enterprise AI problem: LLMs given raw schema access (column names, table structures) produce inconsistent query interpretations because the same business concept — "revenue," "active customer," "exposure" — is defined differently across systems. The SDL establishes a single authoritative definition for every business term and maps each source system to it.

For CIOs and CTOs, the SDL delivers three compounding benefits: (1) AI applications become source-system-agnostic, so migrations and system changes do not break AI behaviour; (2) the business glossary enforces consistent AI answers because all AI applications share the same term definitions; (3) semantic caching — re-using translated queries for equivalent natural language questions — reduces LLM API costs by 30–70% in high-volume deployments.

The pattern is most valuable in organisations with ≥5 source systems, cross-domain AI use cases, and active data governance programmes. Implementation requires 3–6 months to reach production maturity.


2. Problem Statement

2.1 Business Problem

Enterprise data is physically distributed across ERP, CRM, data warehouse, operational databases, and SaaS platforms. Each system defines business concepts independently: "customer" in Salesforce may be a legal entity, while "customer" in the billing system may be an individual account, and "customer" in the analytics warehouse may be a household cluster. AI applications trained to answer business questions across these systems produce inconsistent and sometimes contradictory answers because they resolve the same term differently depending on which system they access.

2.2 Technical Problem

LLMs translating natural language questions to SQL or SPARQL queries against raw schemas frequently misinterpret column names, join conditions, and aggregation logic. Without a semantic layer, prompt engineering must embed physical schema details and business rules directly into every AI application — creating brittle integrations that break when schemas change and accumulate contradictory business rule definitions across applications.

2.3 Symptoms

  • Two AI applications return different "total revenue" figures for the same time period
  • AI-generated SQL queries fail intermittently due to schema changes in source systems
  • Business analysts must manually verify every AI-generated data answer against known benchmarks
  • Adding a new data source requires updating every AI application's prompt separately
  • Data governance cannot locate where a specific business definition is operationalised in AI systems

2.4 Cost of Inaction

  • Trust collapse: business users stop relying on AI-generated data insights within weeks of launch when inconsistencies are discovered
  • Compliance exposure: regulatory reporting generated by AI applications with inconsistent term definitions produces incorrect submissions
  • Engineering debt: each new AI application rebuilds the same business rule logic independently, creating N maintenance obligations
  • Data migration risk: any source system migration risks breaking all AI applications that rely on physical schema knowledge

3. Context

3.1 When to Apply

  • ≥5 source systems that must be queryable by AI applications using shared business terminology
  • Active enterprise data governance programme with a business glossary in progress or completed
  • Cross-domain AI use cases (e.g., finance + operations + customer) that require consistent term definitions
  • High query volume AI deployments where semantic caching can deliver measurable cost reduction
  • Regulatory reporting requirements that demand consistent definitions across AI outputs

3.2 When NOT to Apply

  • Single source system AI applications — the abstraction overhead is not justified
  • Organisations without a data governance programme — SDL without ontology ownership degrades to an unmaintained mapping layer
  • Real-time streaming AI use cases where query translation latency is unacceptable
  • Early MVP/PoC phases — validate AI value proposition first, add semantic governance layer when production is confirmed

3.3 Prerequisites

  • Business glossary with ≥80% coverage of key business terms used in target AI use cases
  • Data catalogue with documented source system schemas and ownership
  • Data steward function with clear domain ownership responsibilities
  • API or direct connection access to all source systems that will be mapped to the semantic layer

3.4 Industry Applicability

Industry Applicability Primary Use Case
Financial Services Critical Regulatory reporting consistency, risk metric definitions, customer exposure calculation
Healthcare High Clinical terminology standardisation (SNOMED, LOINC mapping), patient data access
Retail / CPG High Product taxonomy, sales metrics consistency, customer segmentation definitions
Manufacturing High Product hierarchy, BOM relationships, operational KPI definitions
Telecommunications High Network entity relationships, service definitions, customer hierarchy
Government High Policy term consistency, inter-agency data sharing, citizen service definitions

4. Architecture Overview

The Semantic Data Layer is structured into five functional layers that together form a pipeline from business definition to data retrieval.

4.1 Business Glossary Foundation

The business glossary is the authoritative source of business term definitions. It precedes the SDL and must be governed independently. Each glossary entry specifies: term name, canonical definition, synonyms, related terms, owning business domain, and the data steward responsible for maintaining the definition. The SDL treats the business glossary as read-only input — it does not own definitions, it operationalises them.

4.2 Ontology Layer

The ontology translates the business glossary into a machine-readable formal specification using OWL (Web Ontology Language) or a property graph schema. Business entities become ontology classes. Relationships between entities become ontology properties. Business metrics and derived measures become calculated properties with defined formulas. The ontology is maintained by the ontology governance committee and versioned in source control. Schema changes go through a formal change control process with impact analysis.

4.3 Semantic Mapping Layer

The semantic mapping layer connects the ontology to the physical source systems. For each ontology class and property, a mapping definition specifies: source system, schema/table/column path, transformation logic (type casts, aggregations, filters), validity constraints, and effective date range. Mappings are authored by data engineers in collaboration with domain data stewards. They are stored in a mapping registry — a versioned repository of mapping definitions that can be audited and rolled back.

Automated mapping suggestions use LLM-assisted column name analysis to propose initial mappings for human review, accelerating the mapping authoring process. All automated suggestions require human validation before activation. Mapping confidence is tracked: manually authored and validated mappings are marked HIGH confidence; LLM-suggested and human-validated are MEDIUM; any auto-activated mappings would be LOW (not permitted in production).

4.4 Natural Language to Query Translation

When an AI application or end user poses a natural language question, the SDL's query translation component processes it in three stages.

Semantic Disambiguation resolves ambiguous terms by referencing the ontology. If the question uses "revenue," disambiguation resolves it to the canonical FinancialMetric.GrossRevenue definition, including its precise calculation formula and applicable source systems. The disambiguated intent is represented as a structured semantic query.

Query Generation translates the structured semantic intent into an executable query (SQL, SPARQL, GraphQL, or a graph traversal) against the appropriate source system. The generated query uses the mapping definitions to navigate from ontology concepts to physical schema paths. Query templates for common patterns (aggregations, time-series, entity lookups) are pre-verified by data engineers and reused wherever possible to avoid LLM query hallucination.

Result Enrichment annotates the query result with semantic metadata: which ontology concepts were queried, which source systems were accessed, which mapping versions were used, and the data freshness timestamp. This metadata is returned to the calling application and can be surfaced to end users or logged for audit.

4.5 Semantic Caching Layer

Translated queries (NL input → structured query) are cached using a dual-key strategy: (1) exact match on the normalised NL question string; (2) semantic equivalence via embedding similarity comparison against cached question embeddings. When a semantically equivalent question is detected, the cached translated query is returned without re-invoking the LLM translation step.

Cache invalidation is triggered by: ontology changes (any change affecting the query's concept set); mapping changes for the source systems accessed; cache TTL expiry (configurable per domain based on data freshness requirements). Cache hit rates of 30–70% are typical in production deployments with diverse but patterned question sets.


5. Architecture Diagram

ARCHITECTURE DIAGRAM
flowchart TD subgraph Sources["Source Systems"] A[ERP / CRM] B[Data Warehouse] end subgraph SDL["Semantic Data Layer"] C[Business Glossary] D[Ontology + Mapping Registry] E[Semantic Cache] end subgraph Translation["Query Translation"] F[NL Disambiguation] G[Query Generator] H[Result Enricher] end C --> D I[NL Question] --> E E -->|cache hit| H E -->|cache miss| F F --> D D --> G G --> A G --> B A --> H B --> H H --> J[AI Application] H --> E style I fill:#dbeafe,stroke:#3b82f6 style C fill:#fef9c3,stroke:#eab308 style D fill:#fef9c3,stroke:#eab308 style E fill:#fef9c3,stroke:#eab308 style F fill:#f0fdf4,stroke:#22c55e style G fill:#f0fdf4,stroke:#22c55e style H fill:#f0fdf4,stroke:#22c55e style A fill:#dbeafe,stroke:#3b82f6 style B fill:#dbeafe,stroke:#3b82f6 style J fill:#d1fae5,stroke:#10b981

6. Components

Component Type Responsibility Technology Options Criticality
Business Glossary Governance Authoritative business term definitions; owned by data governance, read by SDL Collibra, Alation, Atlan, Microsoft Purview, custom metadata store Critical
Ontology Engine Governance OWL or property graph schema; formal machine-readable term definitions; version control Protégé (OWL), custom JSON-LD registry, dbt semantic layer, AtScale High
Mapping Registry Storage Versioned ontology-to-physical-schema mappings; source authoring and change history Custom PostgreSQL registry, dbt metrics layer, Cube.dev semantic layer Critical
LLM Mapping Suggestion AI Propose initial mappings via column name/description analysis OpenAI GPT-4o, Anthropic Claude, custom fine-tuned model Medium
Query Translation Engine Processing NL → ontology intent → executable query generation LangChain SQL agent, Vanna.ai, Microsoft Semantic Kernel, custom Critical
Semantic Disambiguation Module Processing Resolve NL terms to canonical ontology concepts; handle synonyms and context Vector similarity + ontology lookup, LLM with ontology context injection High
Semantic Cache Storage Cache NL queries and their translated forms; semantic equivalence matching Redis + pgvector, Weaviate, custom embedding cache Medium
Result Enricher Processing Annotate query results with semantic metadata and provenance Custom middleware layer High
Impact Analyser Governance Detect source system schema changes; assess impact on active mappings Custom schema diff tool, Monte Carlo data observability, Great Expectations Medium

7. Data Flow

7.1 Primary Data Flow — Natural Language Query to Result

Step Actor Action Output
1 End User / AI App Submits natural language question NL question string
2 Semantic Cache Checks exact and semantic match against cache Cache hit → skip to step 8; miss → continue
3 Semantic Disambiguation Resolves NL terms against ontology; identifies concept intent Structured semantic intent with resolved ontology URIs
4 Mapping Registry Looks up physical schema paths for resolved concepts Mapping definitions for each ontology concept
5 Query Generator Produces executable query from semantic intent + mappings SQL / SPARQL / GraphQL query
6 Source System Executes query; returns raw result set Raw data result
7 Result Enricher Annotates result with ontology concept labels, source system, mapping version, freshness Enriched result set with semantic metadata
8 Semantic Cache Stores NL → query mapping with embedding for future hits Cache entry written
9 Calling Application Receives enriched result Data answer with full semantic provenance

7.2 Error Flow

Error Detection Recovery Escalation
Ontology term not found (unmapped NL term) Disambiguation returns null mapping Return "term not understood" with closest suggestions; log unmapped term Ontology backlog: data steward creates new term
Source system query failure Query executor exception Retry ×2; return partial result with availability note; log failure Alert data engineering; flag source system health
Mapping mismatch (schema drift in source) Result validation fails post-enrichment; impact analyser detects column removal Deactivate affected mapping; return "data unavailable" with reason; route to steward Immediate data steward notification for mapping repair
Semantic cache stale (post-ontology change) Cache invalidation job triggered by ontology change event Flush affected cache entries; force re-translation Operational log; no escalation required if automated
Query translation hallucination (LLM produces invalid SQL) SQL validation before execution; query explainer check Reject invalid query; fall back to template-based generation if available Log hallucination instance; flag for translation model review

8. Security Considerations

8.1 Authentication and Authorisation

The SDL query API enforces attribute-based access control (ABAC): the calling application's identity determines which ontology concepts and source systems it is permitted to query. A concept-level permission model prevents an AI application authorised for "customer contact information" from accessing "customer financial information" even if those concepts share a source table. OAuth 2.0 client credentials flow is used for service-to-service authentication.

8.2 Secrets Management

Source system connection credentials are stored in a secrets vault. The SDL query execution engine retrieves credentials at query time using short-lived dynamic secrets where the source system supports it (e.g., database IAM authentication). Connection strings are never stored in the mapping registry or logged.

8.3 Data Classification

Each ontology concept is tagged with a data classification level inherited from the most sensitive source system attribute mapped to it. Query results inherit the highest classification level of any concept in the query. Results above a calling application's authorised classification are blocked at the result enricher with an access denied response and audit log entry.

8.4 Encryption

All inter-component communication uses TLS 1.3. Semantic cache entries containing query results are encrypted at rest. The mapping registry is encrypted at rest. Query logs (containing potential sensitive terms) are encrypted and access-restricted to authorised operations teams.

8.5 Auditability

Every NL query, resolved semantic intent, generated physical query, and result return is logged with: caller identity, timestamp, ontology concepts accessed, source systems queried, mapping versions used, and data classification of the result. These logs provide a complete lineage record: from the AI application's question to the physical data rows accessed, with every translation step documented.

8.6 OWASP LLM Top 10 Mapping

OWASP LLM Risk Relevance Mitigation
LLM01 Prompt Injection Adversarial NL query designed to inject SQL or manipulate query generator Parameterised query generation (LLM produces intent, not raw SQL); SQL injection prevention at execution layer
LLM02 Insecure Output Handling LLM-generated query passed directly to database execution Generated query validated by SQL parser before execution; reject queries with DML statements (INSERT/UPDATE/DELETE)
LLM03 Training Data Poisoning Mapping suggestion LLM trained on data with incorrect mappings Human validation required for all LLM-suggested mappings; training data provenance tracked
LLM04 Model Denial of Service Adversarially complex NL queries generating expensive database queries Query cost estimation before execution; maximum query cost limit; rate limiting per caller
LLM05 Supply Chain Vulnerabilities Query translation LLM dependency could be compromised Pinned model versions; model integrity verification; ability to swap translation model without architecture change
LLM06 Sensitive Information Disclosure SDL translates NL query that inadvertently exposes restricted data ABAC at concept level; classification check before result return; field-level masking for PII concepts
LLM07 Insecure Plugin Design Source system connectors as plugins to the query engine Connector authentication validated; schema-scoped connector permissions; connector code review
LLM08 Excessive Agency Query translation agent could autonomously execute DML if not constrained Read-only database connections for SDL execution; DML blocked at connection level
LLM09 Overreliance Business users over-trust AI-generated data answers from SDL Semantic metadata surfaced with every result: data freshness, mapping confidence, source system
LLM10 Model Theft Ontology and mapping registry encode proprietary business logic Access-controlled APIs; mapping registry not exposed externally; no bulk export endpoints

9. Governance Considerations

9.1 Responsible AI

The SDL makes AI data access deterministic and governed, which is itself a responsible AI control. However, the ontology encodes business choices (e.g., which revenue calculation formula is canonical) that may disadvantage certain business units or stakeholders. An ontology review process must include representation from all affected business domains. Decisions that favour one domain's definition over another must be documented with rationale.

9.2 Model Risk Management

The query translation LLM is a model risk management artefact. Its performance is measured on a golden question set (questions with known correct SQL outputs). Precision and recall on the golden set are monitored per query category. A model validation report is produced when the translation model is upgraded or its prompt is substantially changed.

9.3 Human Approval Gates

All ontology changes require data steward approval from the affected domain plus sign-off from the ontology governance committee. Mapping changes for source systems that feed regulatory reporting require a secondary approval from the compliance team. The mapping staging environment allows testing NL queries against a proposed mapping change before production activation, with results compared against a golden answer set.

9.4 Policy Ownership

The business glossary is owned by the Chief Data Officer's organisation. Ontology is jointly owned by the data architecture function and domain data stewards. Mapping definitions are owned by the data engineering function with domain data steward sign-off. Query translation model prompts and configuration are owned by the AI engineering function. Changes in any of these domains trigger impact analysis in downstream layers.

9.5 Traceability

The SDL maintains a complete provenance record for every query result: which NL question was asked → which ontology concepts were resolved → which mappings were used (with version) → which source tables/columns were accessed → which rows were returned → what the data freshness was at time of query. This provenance record satisfies regulatory requirements for AI decision auditability and supports data lineage documentation in the data catalogue.

9.6 Governance Artefacts

Artefact Owner Frequency Location
Business glossary CDO / Domain Data Stewards Continuously maintained Data governance platform (Collibra/Alation)
Ontology version history Data Architecture + Governance Committee Per change Version-controlled ontology repository
Mapping registry Data Engineering + Domain Stewards Per change Versioned mapping registry database
Golden question set Data Engineering + Business Analysts Quarterly refresh Test suite in CI/CD pipeline
Translation model performance report AI Engineering Per model version ML model registry
Query audit log Operations Continuous Immutable audit log store

10. Operational Considerations

10.1 Monitoring and SLOs

Metric SLO Target Alerting Threshold Tool
NL query end-to-end latency p95 ≤3 seconds (cache miss); ≤200ms (cache hit) >5s p95 over 5 min Prometheus + Grafana
Semantic cache hit rate ≥40% in steady-state production <20% over 1 hour Custom metric
Translation accuracy on golden set ≥90% precision on validated golden questions <85% precision Scheduled evaluation job
Mapping coverage (% active ontology concepts with valid mappings) ≥95% coverage <90% coverage Data quality dashboard
Source system query failure rate <1% of translated queries fail execution >5% failure rate Query execution metrics
Stale mapping alert rate 0 unacknowledged stale mapping alerts >24 hours Any unacknowledged alert >24h Incident management tool

10.2 Logging

Query logs are structured JSON: {timestamp, caller_id, nl_question_hash, resolved_concepts, mappings_used, source_systems, query_execution_ms, result_row_count, data_classification, cache_hit}. NL question text is hashed in operational logs; raw text is stored in the separate audit log (access-restricted). Audit logs are immutable and retained per regulatory requirements.

10.3 Incident Management

A P1 incident is declared when the SDL query API is unavailable or when the translation accuracy rate drops below 75% on the golden set. The on-call data engineering team has a 15-minute response SLA. A P2 incident covers mapping staleness affecting regulatory reporting concepts — 2-hour response SLA. Mapping outages affecting non-critical domains are P3 with next-business-day response.

10.4 Disaster Recovery

Scenario RTO RPO Recovery Procedure
Query translation service failure 5 min (restart; stateless) N/A (stateless) Container restart; validate with health check query
Mapping registry unavailable 30 min 5 min (replica promotion) Promote read replica; validate mapping count
Semantic cache corruption 15 min 0 (cache is reconstructable) Flush cache; warm from query log replay
Business glossary platform outage SDL continues with cached ontology snapshot Last cached snapshot (max 1 hour) SDL reads ontology snapshot; alert data governance to restore glossary platform

10.5 Capacity Planning

Query translation compute is CPU-intensive for cache misses (LLM invocation). At scale, 70%+ cache hit rates make the average compute cost manageable. Plan for bursty LLM API quota: cache miss spikes occur when new question types are introduced (e.g., a new AI application launches). Semantic cache storage grows at approximately 1 KB per cached entry; 1 million cached entries requires ~1 GB, which is negligible.


11. Cost Considerations

11.1 Cost Drivers

Cost Driver Description Typical Range
LLM API calls for query translation Per-query LLM cost for cache misses; depends on cache hit rate $0.002–$0.02 per cache miss query
Semantic cache infrastructure Vector similarity cache (Redis + embedding index) $500–$3,000/month
Mapping registry database PostgreSQL or equivalent; modest size but high availability required $200–$1,000/month
Data steward and ontology governance labour The dominant ongoing cost — human expertise to maintain mappings 2–5 FTE (shared across data governance programme)
Source system query compute Depends on source system pricing; SDL adds query overhead Variable; monitor via query cost analysis
Business glossary platform Commercial platforms (Collibra, Alation) carry significant licence cost $50,000–$500,000/year depending on scale

11.2 Scaling Risks

  • LLM translation cost for unique queries grows linearly without cache; organisations with highly diverse question sets see lower cache hit rates
  • Ontology and mapping maintenance labour scales with organisational complexity, not query volume — a large organisation needs proportionally more data stewards regardless of query load
  • Source system schema drift creates mapping maintenance burden that grows with the number of source systems and their change velocity

11.3 Optimisations

  • Semantic caching is the single highest-ROI optimisation — invest in cache quality and TTL tuning before any other cost reduction effort
  • Template-based query generation for the most common query patterns avoids LLM invocation entirely for those patterns
  • Lightweight open-source embedding models can replace commercial embeddings for cache similarity matching at substantially lower cost
  • Shared ontology governance across AI programmes (not SDL-specific) distributes the data steward cost across multiple value streams

11.4 Indicative Cost Ranges

Deployment Scale Monthly Infrastructure Cost Annual Total (incl. governance labour)
Single domain, 3 source systems $2,000–$5,000 $150,000–$300,000
Multi-domain, 10 source systems $8,000–$20,000 $500,000–$1,200,000
Enterprise-wide, 50+ source systems $30,000–$80,000 $2,000,000–$5,000,000

12. Trade-Off Analysis

12.1 Semantic Layer Technology Options

Option Strengths Weaknesses Best For
Custom SDL (this pattern) Maximum control; integrates with existing data catalogue; extensible High build and maintenance cost; requires strong data engineering capability Large enterprises with diverse source systems and active data governance
dbt Semantic Layer Native integration with dbt-managed data warehouse; strong SQL ecosystem Limited to SQL sources; no NL query translation built in; weak ontology expressiveness Data warehouse-centric organisations already using dbt
Cube.dev / AtScale Managed semantic layer; built-in caching; BI tool integration Commercial; primarily metric-focused; limited relationship graph expressiveness Analytics-heavy use cases; BI + AI hybrid access patterns
Microsoft Fabric Semantic Model Deep Azure/Power BI integration; enterprise support Azure lock-in; Power BI-centric; limited graph relationship support Microsoft-native organisations

12.2 Architectural Tensions

Tension Option A Option B Recommended Resolution
Ontology completeness vs. maintenance burden Complete ontology covering all business terms before any AI deployment Minimal ontology covering only terms needed for current AI use cases Incremental ontology: start with the concepts needed for the first 2–3 AI use cases; expand governed by demand; never build ahead of usage
Query translation accuracy vs. latency Thorough multi-step LLM translation with disambiguation for accuracy Single-pass template matching for low latency Hybrid: templates for high-frequency, well-understood query patterns; LLM for novel queries; cache bridges the gap
Semantic cache freshness vs. hit rate Short TTL for freshness (lower hit rate, higher cost) Long TTL for high hit rate (risk of stale results) Domain-calibrated TTL: fast-changing operational metrics get short TTL; stable reference data gets long TTL
Centralised vs. federated SDL Single centralised SDL for enterprise-wide consistency Domain-federated semantic layers with federation standards Centralised ontology and business glossary; domain-federated mapping registries for source system ownership

13. Failure Modes

Failure Likelihood Impact Detection Recovery
Ontology definition conflict (two domains define same term differently) High High — SDL produces contradictory answers depending on resolver Data steward conflict reports; inconsistent AI answers Ontology governance committee arbitration; canonical definition documented; alternate terms for domain-specific variants
Mapping staleness (source schema change breaks mapping) High High for affected concepts; scoped to specific queries Impact analyser detects schema drift; query failures for affected mappings Mapping repair by data engineer; automated schema drift alerts minimise time to detection
Semantic cache poisoning (incorrect translation cached) Low Medium — affects all queries hitting that cache entry Golden set regression; user-reported incorrect answers Flush affected cache entries; identify root cause (hallucination or mapping error); fix translation
Translation LLM unavailability Medium High if no fallback — all cache-miss queries fail LLM API health check; query failure rate spike Fallback to template-only translation for known query patterns; queue novel queries for retry
Business glossary platform outage Low Medium — SDL continues with snapshot; new glossary updates not reflected Glossary platform health check SDL operates from last cached ontology snapshot; alert data governance; acceptable degradation for max 4 hours

13.1 Cascading Failure Scenarios

Scenario 1: Mass Mapping Invalidation. A source ERP system undergoes a major version upgrade, changing 40% of table/column names. The impact analyser flags 312 mapping invalidations simultaneously. The human review queue floods beyond capacity. The SDL switches to "degraded mode" — only serving queries against concepts with valid mappings, returning "data temporarily unavailable" for others. Resolution requires a war room with data engineering and ERP administrators; a mapping batch repair tool is executed to accelerate the re-mapping process.

Scenario 2: Ontology Terminology Change Cascade. The data governance committee renames a core concept ("Client" → "Customer") to align with a new CRM system. The SDL flushes all cache entries containing the old term. All AI applications must update their prompts to use the new term. In the interim, AI applications asking about "clients" receive no results because the old ontology term is deprecated. The lesson: ontology term renames require a deprecation period where both old and new terms are accepted, with a migration window before the old term is removed.


14. Regulatory Considerations

Regulation Relevant Clause Requirement How SDL Addresses It
APRA CPS 230 §36–§38 (Service Continuity) Critical data services must have documented availability and recovery plans SDL availability SLOs, DR procedures, and degraded-mode operation documented
APRA CPS 234 §15 (Information Asset Management) Information assets classified proportionate to sensitivity Data classification on every ontology concept and query result
Australian Privacy Act 1988 APP 6 (Use or Disclosure) Personal information only used for the purpose it was collected ABAC at concept level prevents AI applications from accessing personal data outside their authorised purpose
EU AI Act Article 13 (Transparency) High-risk AI decisions must be explainable Semantic metadata on every query result provides the translation chain: NL → concept → physical data
EU GDPR Article 5(1)(b) (Purpose Limitation) Data only processed for specified, explicit, legitimate purposes Purpose-scoped access control enforced at the ontology concept level
ISO/IEC 42001 §8.4 (AI system transparency) Organisations must document AI system data inputs and transformations Mapping registry + query audit log provides full input documentation
NIST AI RMF MAP 2.2 (AI Risk Characterisation) Risks from AI data access characterised and documented Mapping confidence levels and data classification labels quantify data access risk

15. Reference Implementations

15.1 AWS

Component AWS Service
Ontology / mapping registry Aurora PostgreSQL with custom schema
Query translation LLM Amazon Bedrock (Claude or Titan)
Semantic cache ElastiCache Redis + custom embedding index
Business glossary AWS Glue Data Catalog (limited) or third-party Collibra on EC2
Source system connectivity Amazon Athena (data lake), RDS direct connection, Redshift
Monitoring CloudWatch + Managed Prometheus/Grafana
Access control AWS IAM + Lake Formation fine-grained access

15.2 Azure

Component Azure Service
Ontology / mapping registry Azure SQL Database
Query translation LLM Azure OpenAI Service (GPT-4o)
Semantic cache Azure Cache for Redis + Azure AI Search
Business glossary Microsoft Purview Data Catalog
Source system connectivity Azure Synapse Analytics, Azure SQL, Fabric OneLake
Monitoring Azure Monitor + Grafana
Access control Azure AD ABAC + Purview data policies

15.3 GCP

Component GCP Service
Ontology / mapping registry Cloud SQL PostgreSQL
Query translation LLM Vertex AI Gemini
Semantic cache Memorystore Redis + Vertex AI Vector Search
Business glossary Dataplex Data Catalog
Source system connectivity BigQuery, Cloud SQL, AlloyDB
Monitoring Cloud Monitoring + Grafana

15.4 On-Premises

Component Technology
Ontology / mapping registry PostgreSQL + custom API layer
Query translation Self-hosted Ollama (Llama 3.x) or on-prem LLM
Semantic cache Redis Enterprise + pgvector
Business glossary Collibra on-prem or open-source Amundsen/DataHub
Source connectivity Direct JDBC/ODBC; Airbyte for data movement

Pattern ID Pattern Name Relationship Type Notes
EAAPL-KNW001 Enterprise Knowledge Graph Complementary SDL provides the semantic interface to the knowledge graph; together they create governed NL-to-knowledge access
EAAPL-KNW003 AI Knowledge Corpus Management Upstream Corpus documents are richer when the semantic layer provides entity and term context for ingestion
EAAPL-KNW006 Corpus Quality Assurance Supporting Quality assurance validates that corpus documents use terms consistently with the SDL ontology
EAAPL-RAG002 Text-to-SQL Specialisation Text-to-SQL is a simpler version of the SDL concept — SDL adds ontology governance and multi-source abstraction
EAAPL-GOV001 AI Data Governance Dependency SDL is an implementation of AI data governance principles — requires a functioning data governance programme
EAAPL-SEC001 AI Data Access Control Supporting SDL's ABAC implementation is an application of the AI data access control pattern

17. Maturity Assessment

Overall Maturity Label: Proven

Dimension Score (1–5) Rationale
Technology readiness 4 NL-to-SQL translation is production-proven; semantic caching is well-understood; managed semantic layers from dbt/Cube are commercial grade
Organisational capability 2 Requires mature data governance including a business glossary and data steward function — rare below large enterprise level
Standards availability 3 OWL/RDF/SPARQL are mature; property graph query standards (GQL) are emerging; semantic layer API standards are fragmented
Vendor ecosystem 4 Multiple commercial semantic layer products; multiple LLM options for translation; strong open-source tooling
Case evidence 3 Strong evidence in analytics-heavy domains (BI semantic layers); AI-specific SDL evidence is growing but less documented
Regulatory alignment 5 SDL directly addresses regulatory transparency, purpose limitation, and auditability requirements for AI data access
Overall 3.5 / 5 Proven with strong regulatory alignment; primary constraint is the prerequisite data governance programme maturity

18. Revision History

Version Date Author Changes
1.0 2026-06-12 EAAPL Editorial Board Initial publication — covers ontology governance, semantic mapping, NL query translation, semantic caching, business glossary integration, and mapping validation
← Back to LibraryMore Knowledge Management