[EAAPL-PLT001] Enterprise AI Platform
Category: Platform Engineering
Sub-category: Foundation Platform
Version: 1.4
Maturity: Mature
Tags: platform-engineering, internal-developer-platform, golden-path, shared-services, model-serving, developer-experience
Regulatory Relevance: APRA CPS230, CPS234, EU AI Act (Article 9 Risk Management), ISO 42001, NIST AI RMF (GOVERN 1.1)
1. Executive Summary
The Enterprise AI Platform pattern establishes a shared, governed infrastructure layer that enables product teams to consume AI capabilities safely and efficiently without each team solving foundational concerns independently. Rather than allowing every business unit to procure models, build integrations, and manage compliance in isolation—creating exponential risk surface and duplicated cost—this pattern centralises platform concerns while preserving product team autonomy.
The platform delivers measurable outcomes: 60–80% reduction in time-to-first-AI-feature for new teams, consolidated cost visibility with per-team chargeback, a single control plane for policy enforcement (data classification, model access tiers, rate limits), and an audit trail satisfying regulatory obligations across all AI usage. The platform team operates as an internal product team serving engineering consumers, not a gatekeeping function. Adoption is driven through golden paths—opinionated, well-documented routes to common AI use cases—that make the right thing the easy thing. This pattern is the prerequisite upon which all other EAAPL platform patterns depend.
2. Problem Statement
Business Problem
Enterprises face uncoordinated AI adoption: each team independently evaluates models, negotiates vendor contracts, builds bespoke integrations, and manages compliance obligations. This creates duplicated investment, inconsistent risk posture, and no executive visibility into total AI spend or exposure. AI incidents (data leakage, hallucination in customer-facing output, cost overruns) are discovered reactively with no systematic controls.
Technical Problem
Without a shared platform, teams build thin wrappers around foundation model APIs, each implementing authentication, logging, error handling, and cost tracking differently. There is no consistent mechanism for prompt versioning, model failover, semantic caching, or response auditing. Security review is performed ad hoc. Infrastructure drift compounds over time.
Symptoms
- Multiple AWS/Azure/GCP AI accounts with no consolidated billing or spend alerts
- Product engineers spending >30% of AI feature development time on infrastructure concerns
- Security team performing point-in-time reviews rather than continuous enforcement
- No audit trail mapping AI outputs to the model version and prompt that produced them
- Data residency violations discovered post-deployment as teams use public endpoints without restriction
- Duplicate vendor contracts for the same model provider across business units
Cost of Inaction
- Regulatory non-compliance penalties (APRA operational risk, EU AI Act fines up to 3% global turnover)
- AI security incidents with no forensic trail, increasing breach disclosure obligations
- Cost inefficiency of 30–50% above market rate due to absence of volume commitments and caching
- 6–12 month delays in AI capability delivery as teams rebuild foundational patterns from scratch
3. Context
When to Apply
- Organisation has ≥3 product teams independently consuming or planning to consume AI services
- Enterprise has data classification requirements that must be enforced before prompts leave the perimeter
- AI spend is untracked or exceeds $50K/year across business units without consolidated visibility
- Regulatory obligations (APRA, EU AI Act, privacy legislation) require audit trails for AI-assisted decisions
- Platform or infrastructure team exists with mandate to provide shared engineering services
When NOT to Apply
- Single-product startup with one team: overhead of platform exceeds benefit; use a managed API gateway directly
- Proof-of-concept or time-boxed experiment: build direct integrations, migrate to platform post-validation
- Fully air-gapped deployment with no shared infrastructure capability: consider a simplified on-premises variant
Prerequisites
- Identity provider (IdP) capable of issuing service account credentials (OIDC/OAuth2)
- Centralised secrets management (HashiCorp Vault, AWS Secrets Manager, Azure Key Vault)
- Observability stack (metrics, logs, traces) available for platform instrumentation
- Executive sponsorship and cross-BU agreement on platform adoption mandate (voluntary adoption rarely scales past early adopters)
- Cloud landing zone or on-premises infrastructure with network segmentation capability
Industry Applicability
| Industry |
Applicability |
Primary Driver |
| Financial Services (Banking, Insurance) |
Very High |
APRA CPS230/234, data residency, audit trails for AI-assisted decisions |
| Healthcare |
Very High |
Patient data privacy, clinical AI regulatory approval, audit requirements |
| Government |
High |
Data sovereignty, security classification, procurement rules |
| Retail / E-commerce |
High |
Cost at scale, multi-team coordination, personalisation pipelines |
| Media & Entertainment |
Medium |
Cost efficiency, content moderation, creator tools |
| Technology / SaaS |
Medium-High |
Developer productivity, model diversification, cost optimisation |
4. Architecture Overview
The Enterprise AI Platform is structured as five horizontal layers stacked atop shared cross-cutting services. Each layer has a clear ownership boundary and a defined interface contract. The deliberate separation of concerns between layers is what allows the platform to evolve (e.g., swapping model providers, adding new compute tiers) without disrupting product teams.
Layer 1 — Infrastructure and Compute provides the physical and virtual compute substrate: GPU/accelerator clusters for self-hosted model serving, cloud provider AI endpoints (Amazon Bedrock, Azure OpenAI, Google Vertex AI), and VPC/network controls enforcing data residency. This layer is owned by the Platform Infrastructure team and changes infrequently. The critical design decision here is whether to use a shared GPU pool, dedicated per-tenant compute, or a hybrid—this choice has profound cost and isolation implications addressed in the Trade-Off Analysis.
Layer 2 — Model Serving and Registry abstracts individual model deployment concerns. It hosts the Model Registry (model metadata, capability cards, approved versions, deprecation notices), the Serving Layer (OpenAI-compatible inference endpoints whether models are self-hosted via vLLM/TorchServe or proxied from cloud providers), and the Model Lifecycle Manager. The OpenAI-compatible API surface is a deliberate choice: it maximises ecosystem compatibility and allows product teams to switch underlying models with zero code change.
Layer 3 — AI API Gateway is the primary integration point for all platform consumers. It enforces authentication (API keys/OIDC JWT), authorisation (RBAC/ABAC on model and capability access), rate limiting per consumer/team, cost allocation tagging, prompt/response logging for audit, semantic caching, and circuit breaking. Every request transits this layer—there are no side-door paths to models. This is the enforcement perimeter for all security and governance controls.
Layer 4 — Developer Services includes the capabilities that accelerate product team velocity: the Prompt Registry (versioned prompts with promotion workflows), the Evaluation Framework (automated benchmarking against golden datasets), the Experimentation Service (A/B routing for model comparison), and the RAG Orchestration Service. These are optional services product teams can adopt; the gateway is mandatory.
Layer 5 — Developer Portal is the human-facing surface: API catalogue, self-service onboarding, per-team dashboards, AI playgrounds, policy transparency, and documentation. This layer drives adoption and reduces platform team support burden. The portal is built as a product—it has a roadmap, user research input, and a feedback loop with consuming teams.
Cross-cutting Shared Services underpin all layers: Identity and Access Management, Secrets Management, Observability (metrics/logs/traces), Policy Engine (OPA or equivalent), Cost Management and Chargeback, and Data Classification Service. These services are not AI-specific—they extend the existing enterprise platform—but they must be explicitly wired into the AI platform's control plane.
The Platform Team vs. Product Team operating model is critical. The platform team owns Layers 1–3 and the shared services. Product teams own their applications, their prompts, and their AI feature logic. Layer 4 services are joint-owned with a platform team as service provider and product teams as co-designers. The golden path concept operationalises this: the platform team publishes opinionated starter templates, SDKs, and runbooks that encode best practice so product teams can onboard a new AI capability in hours rather than weeks.
5. Architecture Diagram
flowchart TD
subgraph Consumers["Product Teams"]
A[Applications + Pipelines]
B[Developer Portal]
end
subgraph Platform["Platform Layers"]
C[AI API Gateway]
D[Model Registry]
E[Developer Services]
end
subgraph Infra["Infrastructure + Compute"]
F[Self-Hosted GPU]
G[Cloud AI Endpoints]
end
A --> C
B -.->|onboard| A
C --> D
C --> E
D --> F
D --> G
C --> H[(Audit + Cost Store)]
style A fill:#dbeafe,stroke:#3b82f6
style B fill:#dbeafe,stroke:#3b82f6
style C fill:#f0fdf4,stroke:#22c55e
style D fill:#fef9c3,stroke:#eab308
style E fill:#f0fdf4,stroke:#22c55e
style F fill:#fef9c3,stroke:#eab308
style G fill:#fef9c3,stroke:#eab308
style H fill:#fef9c3,stroke:#eab308
6. Components
Layer 1 — Infrastructure and Compute
| Component |
Type |
Responsibility |
Technology Options |
Criticality |
| GPU / Accelerator Cluster |
Infrastructure |
Self-hosted model inference compute |
AWS EC2 P4/P5, Azure NDv4, GCP A3, on-prem NVIDIA DGX |
High |
| Cloud AI Endpoints |
Managed Service |
Access to frontier models with SLA |
AWS Bedrock, Azure OpenAI, GCP Vertex AI, Anthropic API |
Critical |
| VPC / Network Controls |
Infrastructure |
Data residency, private connectivity, egress control |
AWS VPC + PrivateLink, Azure VNet + Private Endpoint, GCP VPC-SC |
Critical |
| Data Residency Enforcer |
Policy |
Block requests violating data sovereignty rules |
Custom middleware, OPA, Cloudflare Zero Trust |
High |
Layer 2 — Model Serving and Registry
| Component |
Type |
Responsibility |
Technology Options |
Criticality |
| Model Registry |
Service |
Catalogue of approved models with metadata, capability cards, risk ratings |
MLflow, Hugging Face Hub (private), custom DB |
High |
| OpenAI-Compatible Inference |
Service |
Standardised API surface for self-hosted models |
vLLM, TGI (Hugging Face), NVIDIA Triton, BentoML |
High |
| Cloud Provider Proxy |
Service |
Unified endpoint abstracting cloud provider differences |
LiteLLM, custom proxy, Kong AI Gateway |
High |
| Model Lifecycle Manager |
Service |
Versioning, deprecation, rollout orchestration |
Custom, Argo Rollouts, Spinnaker |
Medium |
Layer 3 — AI API Gateway
| Component |
Type |
Responsibility |
Technology Options |
Criticality |
| AI API Gateway |
Service |
Authn, authz, rate limiting, routing, logging, caching |
Kong AI Gateway, AWS API Gateway + Lambda, Azure APIM, Apigee, LiteLLM Proxy |
Critical |
| Rate Limiter |
Policy |
Token-based and request-based rate limits per consumer/team |
Redis + sliding window, Kong rate-limit-advanced |
Critical |
| Semantic Cache |
Service |
Cache near-identical prompt responses to reduce cost/latency |
Redis + vector index, GPTCache, Momento |
High |
| Audit Logger |
Service |
Immutable record of all requests and responses |
Kinesis → S3, Kafka → object store, OpenTelemetry → SIEM |
Critical |
| Circuit Breaker |
Reliability |
Prevent cascade failure when model endpoints degrade |
Resilience4j, custom middleware, Envoy |
High |
Layer 4 — Developer Services
| Component |
Type |
Responsibility |
Technology Options |
Criticality |
| Prompt Registry |
Service |
Version-controlled prompt store with promotion workflow |
Custom Git-backed store, LangSmith, Promptflow |
High |
| Evaluation Framework |
Service |
Automated benchmarking of model/prompt combinations |
Ragas, DeepEval, custom harness |
Medium |
| Experimentation Service |
Service |
A/B and shadow routing for model comparison |
Custom feature-flag backed, LaunchDarkly + gateway |
Medium |
| RAG Orchestration |
Service |
Retrieval-augmented generation pipeline management |
LangChain, LlamaIndex, custom |
Medium |
Layer 5 — Developer Portal
| Component |
Type |
Responsibility |
Technology Options |
Criticality |
| API Catalogue |
Portal |
Discoverable inventory of all AI capabilities |
Backstage, Apigee Developer Portal, custom |
High |
| Self-Service Onboarding |
Portal |
Automated provisioning of API keys, rate limits, team namespaces |
Backstage scaffolder, custom workflow |
High |
| Usage Dashboards |
Portal |
Per-team cost, request volume, error rate visibility |
Grafana, Superset, PowerBI |
Medium |
| AI Playground |
Portal |
Interactive testing environment without production blast radius |
Custom, Promptflow Studio |
Medium |
7. Data Flow
Primary Flow — Product Team AI Request
| Step |
Actor |
Action |
Output |
| 1 |
Product Team Application |
Issue HTTPS POST to AI API Gateway with JWT/API key and prompt payload |
Authenticated request at gateway ingress |
| 2 |
AI API Gateway — AuthN |
Validate JWT against IdP or validate API key hash |
Authenticated identity + team namespace |
| 3 |
AI API Gateway — AuthZ |
Check RBAC/ABAC: does this team/identity have access to the requested model? |
Authorised or 403 rejection |
| 4 |
AI API Gateway — Classification |
Data Classification Service inspects prompt for PII, sensitive data, classification level |
Classification label attached to request context |
| 5 |
AI API Gateway — Policy |
OPA evaluates: is this classification allowed for this model endpoint per policy? |
Policy allow/deny decision |
| 6 |
AI API Gateway — Rate Limit |
Check token bucket / sliding window for this consumer |
Allow or 429 rate limit response |
| 7 |
AI API Gateway — Semantic Cache |
Hash prompt embedding; check vector cache for near-match |
Cache hit (return cached response) or cache miss (continue) |
| 8 |
AI API Gateway — Cost Tag |
Attach cost allocation tag (team, project, environment) to request |
Tagged request context |
| 9 |
AI API Gateway — Audit Pre-Log |
Write request record (prompt hash, metadata, timestamp) to audit log |
Immutable pre-request audit record |
| 10 |
Model Router |
Select optimal model endpoint based on routing rules (capability, cost, latency) |
Upstream target selected |
| 11 |
Model Serving Layer |
Forward request to cloud provider API or self-hosted inference endpoint |
Raw model response |
| 12 |
AI API Gateway — Response |
Return response to caller; emit token usage to cost management |
Response to product team + cost event |
| 13 |
AI API Gateway — Audit Post-Log |
Write response record (response hash, token counts, latency) to audit log |
Immutable post-response audit record |
Error Flow
| Error Condition |
Detection Point |
Action |
Consumer Experience |
| Model endpoint unavailable |
Circuit breaker (Layer 3) |
Open circuit; route to fallback model or return 503 with Retry-After |
Graceful degradation or explicit error |
| Policy denial (data classification) |
Policy engine (Layer 3) |
Reject request; log policy violation event |
403 with policy violation code |
| Rate limit exceeded |
Rate limiter (Layer 3) |
Reject with 429; include Retry-After header |
Explicit rate limit response |
| Prompt injection detected |
Guardrails layer |
Reject or sanitise; raise security alert |
400 Bad Request or sanitised response |
| Model returns error (5xx) |
Gateway upstream handler |
Retry with exponential backoff; failover if retries exhausted |
Transparent retry then degraded fallback |
8. Security Considerations
Authentication and Authorisation
- All consumers authenticate via short-lived OIDC JWT tokens or rotatable API keys stored in Secrets Manager; long-lived static credentials are prohibited
- RBAC model:
model-viewer, model-invoker, prompt-editor, platform-admin; ABAC extends this with data classification attributes
- Service-to-service communication within the platform uses mTLS with certificates managed by the service mesh (Istio/Linkerd)
Secrets Management
- All model provider API keys (OpenAI, Anthropic, AWS Bedrock IAM roles) are stored in HashiCorp Vault or cloud-native secrets manager; zero hardcoded credentials
- Secrets rotation is automated; gateway refreshes credentials on a schedule without downtime
- Audit log of every secret access event
Data Classification and Encryption
- All prompts and responses classified at ingress by the Data Classification Service; classification label persists through the audit trail
- Data at rest: AES-256 encryption for audit log store, vector cache, and model registry
- Data in transit: TLS 1.3 minimum for all internal and external communication
- PII in prompts: masked or tokenised before sending to third-party cloud endpoints if data residency policy requires
Auditability
- Cryptographic hash of every prompt and response stored in the audit log; enables non-repudiation
- Audit log is append-only and stored in a separate security account with no delete permissions for platform operators
- Audit events emitted to SIEM (Splunk/Sentinel/Chronicle) in real time
OWASP LLM Top 10 Controls
| OWASP LLM Risk |
Control Implemented in Platform |
| LLM01 Prompt Injection |
Input guardrails at gateway layer; prompt injection classifier as policy check |
| LLM02 Insecure Output Handling |
Response sanitisation middleware; output schema validation for structured outputs |
| LLM03 Training Data Poisoning |
Model Registry approvals gate; only approved model versions from trusted registries |
| LLM04 Model Denial of Service |
Rate limiting per consumer; token budget enforcement; circuit breaker |
| LLM05 Supply Chain Vulnerabilities |
Model provenance tracking in Registry; SBoM for self-hosted models; vendor attestation |
| LLM06 Sensitive Information Disclosure |
Data classification at ingress; PII masking before third-party routing; audit logging |
| LLM07 Insecure Plugin Design |
API scoping for AI-initiated actions; OAuth2 scopes on all downstream APIs called by agents |
| LLM08 Excessive Agency |
Human-in-the-loop gates for agentic actions; action whitelist in policy engine |
| LLM09 Overreliance |
Confidence thresholds; output labelling as AI-generated; mandatory human review for critical decisions |
| LLM10 Model Theft |
Self-hosted model weights encrypted at rest; access logs for model artifact downloads; network egress controls |
9. Governance Considerations
Responsible AI Framework
- Every model onboarded to the registry must have a completed Model Risk Card covering intended use, limitations, bias evaluation results, and regulatory classification
- High-risk AI use cases (as defined by EU AI Act Annex III or organisational risk policy) require additional approval and enhanced monitoring
- Data used for model fine-tuning must go through the Data Ethics Review process
Model Risk Management
- Models are classified by risk tier: Low (content summarisation), Medium (customer-facing recommendations), High (automated decisions affecting individuals)
- High-tier models require a signed-off Model Risk Assessment before production promotion
- Ongoing model monitoring for performance drift, bias drift, and output quality degradation
Human Approval Gates
- Changes to platform-wide policies (rate limits, model access tiers, data classification rules) require approval from the AI Platform Governance Board
- High-risk model promotions to production require Platform Owner + Chief Risk Officer sign-off
- Agentic use cases that can initiate real-world actions (send emails, execute transactions) require explicit human-in-the-loop gate design
Policy and Traceability
| Governance Artefact |
Owner |
Cadence |
Storage Location |
| Model Risk Card |
Model Owner + Risk Team |
Per model version |
Model Registry |
| Data Classification Policy |
Data Governance Team |
Annual review |
Policy Engine configuration |
| API Usage Policy |
Platform Team |
Quarterly review |
Developer Portal |
| Audit Log Retention Policy |
Legal / Compliance |
Annual review |
Platform Runbook |
| AI Incident Register |
CISO + Platform Team |
Per incident |
GRC system |
| Platform Governance Board Minutes |
Platform Owner |
Monthly |
Confluence / SharePoint |
| Cost Allocation Report |
FinOps / Platform Team |
Monthly |
Finance system |
10. Operational Considerations
Monitoring
| Signal |
Source |
Alert Threshold |
Owner |
| Gateway error rate |
API Gateway metrics |
>1% 5xx over 5 min |
Platform Team |
| Model endpoint latency P99 |
Tracing |
>5s for interactive, >30s for batch |
Platform Team |
| Circuit breaker state |
Circuit breaker events |
Any circuit opening |
Platform Team + Model Owner |
| Cost anomaly |
Cost management service |
>20% day-over-day spend increase |
FinOps + Platform Team |
| Audit log ingestion lag |
Log pipeline metrics |
>60s lag |
Platform Team + Security |
| Cache hit rate |
Semantic cache metrics |
<20% hit rate sustained 1h (signals cache misconfiguration) |
Platform Team |
SLOs
| SLO |
Target |
Measurement Window |
| Gateway availability |
99.9% |
Rolling 30 days |
| Interactive request P95 latency (excluding model inference) |
<100ms |
Rolling 7 days |
| Audit log completeness |
100% of requests logged |
Rolling 24 hours |
| Policy enforcement correctness |
Zero bypass incidents |
Rolling 90 days |
| Self-service onboarding success rate |
>95% of new team onboards complete without platform team intervention |
Monthly |
Logging
- Structured JSON logs emitted by all platform components; correlated by
x-request-id and x-team-id headers
- Log levels: INFO for all gateway transactions, WARN for policy near-misses, ERROR for circuit openings and auth failures
- Security-sensitive events (policy violations, auth failures) shipped to SIEM within 60 seconds
- Log retention: 90 days hot (searchable), 7 years cold (compliance archive)
Incident Response
| Incident Type |
Detection |
Response |
RTO |
| Complete gateway outage |
Synthetic probes + error rate alert |
Failover to secondary region; page platform on-call |
5 minutes |
| Model provider outage |
Circuit breaker + health check |
Switch to fallback model; notify consuming teams |
10 minutes |
| Security breach (prompt data leak) |
SIEM alert |
Isolate affected namespace; revoke credentials; notify CISO |
15 minutes |
| Cost runaway |
Cost anomaly alert |
Rate limit enforcement tightened; notify FinOps + team lead |
30 minutes |
Disaster Recovery
| Component |
RPO |
RTO |
Strategy |
| AI API Gateway |
0 (stateless) |
2 min |
Multi-AZ active-active; DNS failover |
| Audit Log Store |
<1 min |
15 min |
Cross-region replication; immutable S3 buckets |
| Model Registry |
5 min |
30 min |
Database replication; Git-backed as secondary |
| Semantic Cache |
1 hour |
5 min |
Cache is soft state; rebuild from model calls; acceptable cold-start |
| Prompt Registry |
0 |
10 min |
Git-backed; replicated; restore from tag |
11. Cost Considerations
Cost Drivers
| Driver |
Description |
Typical % of Total |
| Model inference (cloud APIs) |
Token charges for GPT-4, Claude, Gemini calls |
60–75% |
| GPU compute (self-hosted) |
On-demand or reserved GPU instances for self-hosted models |
10–20% |
| Semantic cache |
Vector store hosting + Redis cache tier |
3–8% |
| Observability infrastructure |
Log storage, metrics, tracing at platform scale |
5–10% |
| Developer portal hosting |
Always-on service, relatively low cost |
1–3% |
| Platform team labour |
Engineering + operations headcount |
Excluded (CapEx/OpEx accounting) |
Scaling Risks
- Token cost scales super-linearly with context window abuse: no-context-limit requests from one team can dominate platform spend
- Uncontrolled model tier usage: teams defaulting to most expensive model for every use case without routing intelligence
- Cache cold-start: new deployment or cache eviction causes temporary cost spike as cache warms
Optimisations
- Semantic caching: 20–40% token reduction on repetitive workloads (FAQ, summarisation)
- Model tier routing: route simple tasks to cheaper models (GPT-4o-mini, Claude Haiku); reserve frontier for complex reasoning
- Prompt compression: strip whitespace, compress system prompts via shared library; 10–15% token reduction
- Batch API for non-interactive: use provider batch APIs at 50% discount for overnight processing
- Reserved capacity: negotiate reserved throughput with cloud AI providers for predictable workloads
Indicative Cost Range
| Scale |
Monthly AI Platform Infra Cost |
Notes |
| Small (1–5 teams, <1M tokens/day) |
$3,000–$12,000 |
Mostly cloud API costs; minimal self-hosted |
| Medium (5–20 teams, 1–10M tokens/day) |
$15,000–$80,000 |
Mix of cloud API + some self-hosted; semantic cache delivers ROI |
| Large (20+ teams, >10M tokens/day) |
$80,000–$400,000+ |
Self-hosted frontier models become cost-competitive; FinOps team warranted |
12. Trade-Off Analysis
Compute Architecture Options
| Option |
Description |
Pros |
Cons |
Best For |
| Cloud-Only (Managed APIs) |
All inference via cloud provider managed APIs (Bedrock, Azure OpenAI, Vertex) |
Zero infrastructure ops; rapid access to frontier models; SLA-backed |
Data residency constraints; vendor lock-in; highest per-token cost at scale |
Organisations <$50K/month AI spend; strict no-GPU-ops mandate |
| Hybrid (Cloud + Self-Hosted) |
Cloud APIs for frontier models; self-hosted open-weight models for high-volume/lower-complexity |
Cost optimisation; data residency for sensitive workloads; model diversity |
GPU ops expertise required; model update operational burden |
Most enterprises at medium-large scale |
| Self-Hosted First |
Maximise self-hosted; cloud only for capabilities not replicable |
Maximum data control; no per-token cost; customisable |
High infrastructure investment; frontier model gap; GPU scarcity; ops complexity |
Air-gapped environments; sovereign AI requirements |
Tenant Isolation Options
| Option |
Description |
Pros |
Cons |
Best For |
| Shared Pool |
All tenants share gateway + inference endpoints; namespace isolation in software |
Lowest cost; highest utilisation |
Noisy neighbour risk; complex policy enforcement |
Internal enterprise teams with trust relationships |
| Dedicated Namespace |
Separate gateway instances per tenant; shared compute |
Balance of isolation and cost |
More infrastructure complexity |
External-facing B2B platforms |
| Dedicated Compute |
Separate inference endpoints per tenant |
Strongest isolation; predictable performance |
Highest cost; most ops overhead |
Regulated industries with data-separation requirements |
Architectural Tensions
| Tension |
Option A |
Option B |
Resolution |
| Developer autonomy vs. governance control |
Teams choose any model freely |
Platform mandates approved model list |
Approved model list with fast-track review process for new models |
| Cost optimisation vs. performance |
Route to cheapest model always |
Route to best model always |
Routing rules based on use-case classification; teams declare use case |
| Openness of audit logs vs. privacy |
Full prompt/response logging |
No logging of content |
Log metadata and hashes; content only on explicit high-risk classification |
| Platform team velocity vs. consumer customisation |
Platform publishes fixed golden paths |
Teams fully self-serve |
Golden paths as starting templates; teams can fork within policy guardrails |
13. Failure Modes
| Failure |
Likelihood |
Impact |
Detection |
Recovery |
| AI API Gateway complete outage |
Low |
Critical — all AI features unavailable |
Synthetic probes, zero traffic alert |
Multi-AZ failover; circuit breaker routes to fallback |
| Cloud model provider outage (e.g., OpenAI 5xx) |
Medium |
High — affects all consumers of that provider |
Circuit breaker opens; error rate spike |
Failover to alternate provider or self-hosted model |
| Semantic cache poisoning (incorrect cached response served) |
Low |
High — incorrect responses served silently |
Response quality monitoring; user feedback |
Cache flush; cache validation before reintroduction |
| Token budget exhaustion for a team |
High |
Medium — team's AI features degrade gracefully |
Cost management alert; 429 from gateway |
Increase quota with approval; implement back-pressure in consuming app |
| Data classification false negative (sensitive data reaches wrong model) |
Low |
Critical — data residency or privacy breach |
Retrospective audit log scan; SIEM alert |
Incident response; vendor notification if required; root cause fix to classifier |
| Prompt registry unavailable |
Medium |
Medium — teams cannot load latest prompts |
Health check failure; latency spike |
Fall back to last-known-good prompt version cached in gateway |
| Model Registry corruption |
Low |
High — wrong model versions deployed |
Registry integrity check on startup |
Restore from Git-backed backup; re-validate model versions |
Cascading Failure Scenarios
- Semantic cache failure → cold-start cost spike: Cache failure causes all requests to hit model directly; combined with a traffic spike this can exhaust token budgets across multiple teams simultaneously and trigger cloud provider rate limits. Mitigation: circuit breaker on cache with graceful bypass; pre-emptive capacity buffer in token budgets.
- Policy engine outage → open or closed failure: If OPA becomes unavailable, the gateway must fail open (allow all, risk policy bypass) or fail closed (deny all, block all AI features). This is a critical design choice; most enterprises should fail closed with a break-glass procedure.
- Identity provider outage → complete gateway authentication failure: If the IdP issuing JWTs is unavailable, all JWT-authenticated requests fail. Mitigation: API key fallback path for critical production consumers; IdP HA configuration.
14. Regulatory Considerations
APRA CPS 230 (Operational Risk)
- The platform must be classified as a Critical or Important Business Service if AI features are material to regulated activities; this triggers BCP/DR obligations including RTO/RPO targets above
- Third-party model providers (OpenAI, Anthropic) must be assessed under CPS 230 third-party risk management obligations; contracts must include sub-contracting visibility, audit rights, and incident notification requirements
- Operational incidents affecting AI services must be reportable to APRA if material
APRA CPS 234 (Information Security)
- The audit log is an information asset requiring classification, protection, and retention per CPS 234
- All platform components handling sensitive data must be within the CPS 234 information security capability boundary
- Penetration testing of the AI API Gateway is required at least annually and after significant changes
Privacy Act 1988 (Australia) / GDPR (EU)
- Personal information in prompts and responses must be handled in accordance with the Privacy Act; prompt logging of PII-containing interactions requires a Privacy Impact Assessment
- Data minimisation principle applies: prompts should not contain more PII than necessary for the AI task
- Data residency controls must enforce storage of Australian personal information within Australia if required by APP 8 considerations
EU AI Act
- Article 9 requires risk management systems for high-risk AI applications; the Model Risk Card and platform governance artefacts satisfy this requirement
- Article 13 transparency obligations require AI-generated content to be identifiable as such in consumer-facing applications
- Article 17 quality management system requirements are met by the prompt version control, evaluation framework, and change governance processes
ISO 42001 (AI Management System)
- The platform governance artefacts (Model Risk Cards, audit logs, governance board minutes) constitute the AI management system records required by ISO 42001 Clause 7
- Continual improvement processes (evaluation framework, post-incident review) satisfy Clause 10
NIST AI RMF
- GOVERN 1.1: AI risk tolerance defined via model risk tiers and data classification policies
- MAP 2.1: AI risk context mapped through Model Risk Cards and use case classification
- MEASURE 2.3: Metrics for AI risk tracked through observability stack and governance dashboards
- MANAGE 3.1: Response plans for AI incidents documented in platform runbook
15. Reference Implementations
AWS
| Component |
AWS Service |
| AI API Gateway |
Amazon API Gateway + AWS Lambda authoriser, or Kong on EKS |
| Model Serving (cloud) |
Amazon Bedrock (Claude, Llama, Titan) |
| Model Serving (self-hosted) |
Amazon SageMaker Endpoints or EKS + vLLM on P4/P5 |
| Semantic Cache |
Amazon ElastiCache (Redis) + Amazon OpenSearch for vector index |
| Audit Log |
Amazon Kinesis Data Streams → S3 (Glacier for cold) → Athena for query |
| Policy Engine |
AWS Lambda + OPA sidecar, or AWS Verified Permissions |
| Secrets |
AWS Secrets Manager |
| Observability |
Amazon CloudWatch + AWS X-Ray + OpenTelemetry |
| Developer Portal |
AWS Service Catalog + Backstage on ECS |
| Cost Management |
AWS Cost Explorer + Cost Allocation Tags + AWS Budgets |
Azure
| Component |
Azure Service |
| AI API Gateway |
Azure API Management (APIM) with AI policies |
| Model Serving (cloud) |
Azure OpenAI Service |
| Model Serving (self-hosted) |
AKS + vLLM on NC-series |
| Semantic Cache |
Azure Cache for Redis + Azure AI Search |
| Audit Log |
Azure Event Hubs → Azure Data Lake Gen2 |
| Policy Engine |
Azure Policy + OPA on AKS |
| Secrets |
Azure Key Vault |
| Observability |
Azure Monitor + Application Insights |
| Developer Portal |
Azure API Management built-in developer portal |
| Cost Management |
Azure Cost Management + Tags |
GCP
| Component |
GCP Service |
| AI API Gateway |
Cloud Endpoints / Apigee |
| Model Serving (cloud) |
Vertex AI (Gemini, Claude via Model Garden) |
| Model Serving (self-hosted) |
GKE + vLLM on A3 |
| Semantic Cache |
Memorystore (Redis) + Vertex AI Vector Search |
| Audit Log |
Cloud Pub/Sub → BigQuery |
| Policy Engine |
Binary Authorization + OPA on GKE |
| Secrets |
Secret Manager |
| Observability |
Cloud Monitoring + Cloud Trace + OpenTelemetry |
| Developer Portal |
Apigee Developer Portal |
| Cost Management |
Cloud Billing + Labels + Budget Alerts |
On-Premises
| Component |
Technology |
| AI API Gateway |
Kong Enterprise or NGINX + custom Lua/Python middleware |
| Model Serving |
vLLM or TGI on bare-metal GPU servers (NVIDIA A100/H100) |
| Semantic Cache |
Redis Enterprise + Qdrant or Weaviate |
| Audit Log |
Apache Kafka → MinIO (S3-compatible) |
| Policy Engine |
OPA (open source) |
| Secrets |
HashiCorp Vault |
| Observability |
Prometheus + Grafana + Tempo + Loki |
| Developer Portal |
Backstage (CNCF) |
| Cost Management |
Custom chargeback reporting from Kafka cost events |
| Pattern ID |
Name |
Relationship |
| EAAPL-PLT002 |
AI API Gateway |
Child pattern — PLT001 Layer 3 is implemented by PLT002 |
| EAAPL-PLT003 |
Model Routing |
Child pattern — model routing is a capability within PLT001 Layer 3 |
| EAAPL-PLT004 |
LLM Cost Control |
Specialisation — cost control mechanisms are instantiated within PLT001 |
| EAAPL-PLT005 |
Prompt Version Control |
Child pattern — Prompt Registry is Layer 4 of PLT001 |
| EAAPL-PLT006 |
LLM Caching Layer |
Child pattern — Semantic Cache is a component of PLT001 Layer 3 |
| EAAPL-PLT007 |
Multi-Tenant AI Platform |
Extension — PLT007 elaborates tenant isolation within PLT001 |
| EAAPL-PLT008 |
AI Experiment Tracking |
Child pattern — Evaluation Framework is Layer 4 of PLT001 |
| EAAPL-PLT010 |
AI Developer Portal |
Child pattern — Developer Portal is Layer 5 of PLT001 |
| EAAPL-INT001 |
Enterprise AI Service Bus |
Complementary — event bus integrates with PLT001 for async AI workflows |
| EAAPL-GOV001 |
AI Governance Framework |
Dependency — PLT001 is the enforcement vehicle for governance policies |
17. Maturity Assessment
Overall Maturity: Mature
This pattern is in production at multiple large enterprises across financial services, healthcare, and technology verticals. Reference implementations are available for all major cloud providers. Tooling ecosystem (Kong, LiteLLM, Backstage, vLLM) is stable and production-proven.
Scoring Matrix
| Dimension |
Score (1–5) |
Rationale |
| Pattern Completeness |
5 |
All 18 sections documented; no gaps |
| Implementation Evidence |
5 |
Production deployments at Fortune 500 scale documented |
| Tooling Ecosystem Stability |
4 |
Core tools stable; AI-specific gateway features still evolving rapidly |
| Regulatory Alignment |
5 |
Explicitly mapped to APRA, EU AI Act, ISO 42001, NIST AI RMF |
| Operational Complexity |
Medium |
Requires dedicated platform team; not suitable for single-team orgs |
| Cost Efficiency at Scale |
High |
Proven 30–50% cost reduction vs. unmanaged direct API access |
| Time to First Value |
Medium |
6–12 weeks to MVP platform; full capability 6–12 months |
18. Revision History
| Version |
Date |
Author |
Changes |
| 1.0 |
2024-01-15 |
EAAPL Working Group |
Initial pattern publication |
| 1.1 |
2024-04-20 |
EAAPL Working Group |
Added semantic caching component; expanded cost model |
| 1.2 |
2024-08-10 |
EAAPL Working Group |
EU AI Act Article 9/13/17 alignment; updated OWASP LLM Top 10 to 2024 edition |
| 1.3 |
2025-01-08 |
EAAPL Working Group |
Added agentic use case governance; updated reference implementations for Bedrock/Vertex |
| 1.4 |
2025-06-12 |
EAAPL Working Group |
Multi-tenant isolation options expanded; DR table updated; cost ranges recalibrated |