EAAPLEnterprise AI Architecture Pattern Library
EAAPLLibraryAI SecurityEAAPL-SEC001
EAAPL-SEC001Proven

AI Gateway

🔐 AI SecurityAPRA CPS234EU AI Act2 incidents mapped

[EAAPL-SEC001] AI Gateway

Category: Security / API Control Plane Sub-category: Traffic Management & Policy Enforcement Version: 2.1 Maturity: Mature Tags: api-gateway rate-limiting authentication cost-allocation circuit-breaker policy-enforcement ai-operations Regulatory Relevance: APRA CPS234, EU AI Act Art. 9 (Risk Management), ISO 42001 §6.1, NIST AI RMF GOVERN 1.2


1. Executive Summary

The AI Gateway pattern establishes a centralised, enterprise-grade control plane through which all AI traffic flows — inbound requests from applications and users, and outbound calls to model providers. It functions as the "first and last line of defence" for every AI interaction in the enterprise.

From a business perspective, the AI Gateway solves three compounding problems that emerge when AI usage scales without discipline: uncontrolled spend (teams independently acquiring model API keys lead to budget overruns with no visibility), inconsistent security posture (each team re-inventing authentication, logging, and abuse controls), and regulatory exposure (no single audit trail for AI interactions).

The gateway provides authentication and authorisation for every AI request, enforces rate limits and cost budgets per team or product, routes traffic intelligently across multiple model providers, captures structured logs for compliance, and breaks the circuit when downstream models are degraded. Organisations that deploy this pattern typically report 30–50% reduction in AI spend waste through visibility and quota enforcement, and can demonstrate AI audit trails to regulators within 24 hours of a request.

This pattern is the foundation upon which all other AI security and observability patterns depend. It should be the first pattern deployed in any enterprise AI programme.


2. Problem Statement

Business Problem

Enterprise organisations adopting AI at scale face ungoverned sprawl: dozens of teams independently calling OpenAI, Anthropic, Azure OpenAI, and other providers with individual API keys. There is no budget control, no unified audit trail, no abuse detection, and no single point where policy can be enforced. A single misconfigured application or compromised key can generate hundreds of thousands of dollars in model API spend within hours. Regulatory bodies (APRA, ASIC, EU regulators) increasingly require organisations to demonstrate comprehensive audit trails for AI-assisted decisions — an impossibility without a centralised control point.

Technical Problem

Without a gateway:

  • Each application must independently implement auth, rate limiting, retry logic, and logging — creating N inconsistent implementations.
  • Model provider credentials are distributed across dozens of services, dramatically increasing the blast radius of a credential leak.
  • There is no circuit breaker: a degraded model provider cascades into application failures.
  • Cost attribution is impossible: spend cannot be allocated to teams, products, or use cases.
  • Routing logic (e.g., fallback to a cheaper model for low-complexity requests) must be duplicated across every consuming application.

Symptoms of Absence

  • Unexplained spikes in model API bills.
  • Security incidents involving leaked model API keys.
  • Different applications enforcing different content policies, creating inconsistent user experiences.
  • Inability to produce AI usage reports for compliance audits.
  • Cascading application failures when a model provider has an outage.
  • No capacity to enforce organisational AI usage policies (e.g., "no patient data to external models").

Cost of Inaction

Dimension Impact
Financial Uncontrolled model API spend; potential for runaway costs from abuse or bugs
Regulatory Cannot demonstrate AI audit trail to APRA/EU AI Act auditors; enforcement risk
Security Distributed credentials; no unified threat detection; full blast radius on key leak
Operational N × duplicated retry/rate-limit/log implementations; no unified model health visibility
Reputational Policy violations reach users (harmful content, data leakage) without a filter layer

3. Context

When to Apply

  • Organisation has more than one team or application calling AI model APIs.
  • AI model API spend exceeds $5,000/month or is forecast to.
  • Organisation operates in a regulated industry (financial services, healthcare, government).
  • Multiple model providers are in use or planned.
  • Security team requires audit trails for AI interactions.
  • AI applications are user-facing and require content policy enforcement.

When NOT to Apply

  • Single-team proof-of-concept with a 90-day sunset — gateway adds operational overhead disproportionate to PoC scope.
  • Fully offline/on-premises model inference where the model is a library call within the same process — a gateway adds latency without security benefit at the network boundary.
  • When a cloud-native AI platform (e.g., Azure AI Studio with built-in APIM integration) already provides all required controls natively and team can accept vendor lock-in.

Prerequisites

Prerequisite Detail
Identity Provider OIDC/SAML IdP capable of issuing JWT tokens to calling applications
Secrets Management Vault or equivalent for model provider credentials
Observability Stack Log aggregation and metrics platform to receive gateway telemetry
Network Topology Gateway must be reachable by all AI-consuming applications; egress to model providers permitted
API Catalogue Inventory of existing AI API calls to route through the gateway

Industry Applicability

Industry Applicability Key Driver
Financial Services High APRA CPS234, audit trails, cost governance
Healthcare High Patient data controls, regulatory AI traceability
Government High Sovereignty, audit, classification enforcement
Retail / E-commerce Medium Cost control, content policy
Technology / SaaS Medium Multi-team cost allocation, developer platform
Education Medium Content policy, budget governance

4. Architecture Overview

The AI Gateway is deployed as a horizontally scalable reverse proxy that sits at the intersection of all AI-consuming workloads and all model provider endpoints. It is not a simple HTTP proxy — it is a stateful policy engine with its own data plane (real-time request processing) and control plane (policy configuration, key management, quota administration).

Why a dedicated gateway rather than embedding controls in each application?

The fundamental architectural reason is that cross-cutting concerns — authentication, rate limiting, cost allocation, audit logging, circuit breaking — are almost always implemented inconsistently when distributed across teams. The gateway externalises these concerns into a single, auditable, independently operated service. This mirrors the established API gateway pattern for REST/GraphQL APIs, extended with AI-specific capabilities.

Request Path Design

Inbound requests arrive from applications carrying a service identity token (mTLS client certificate or JWT). The gateway's authentication middleware validates the token against the enterprise IdP before any processing occurs. This ensures that unauthenticated requests fail fast and are never forwarded to model providers — preventing credential abuse if an internal application is compromised.

After authentication, the policy engine evaluates the request against a rule set: Does this caller have permission to use this model? Does this request exceed the caller's rate quota? Does this request carry a data classification label that prohibits forwarding to the requested external provider? Policy decisions are made in-process against an in-memory policy cache (refreshed from the policy store every 60 seconds) to keep decision latency under 1ms.

Routing and Provider Abstraction

The gateway abstracts model provider APIs behind a unified internal schema. Consuming applications call a single internal endpoint (/v1/chat/completions) regardless of whether the request will be served by GPT-4, Claude 3.7, or an on-premises Llama deployment. The routing layer maps requests to providers based on model name, caller preference, load, cost optimisation rules, and provider health. This abstraction is critical: it allows organisations to switch providers, add fallbacks, or introduce shadow routing for model evaluation without changing consuming applications.

Why circuit breaking at the gateway?

Model providers have variable availability SLAs, and LLM inference latency is orders of magnitude higher than typical microservice calls. Without a circuit breaker at the gateway, a degraded provider causes cascading timeouts across all consuming applications. The gateway's circuit breaker monitors error rates and latency per provider, opens the circuit when thresholds are breached (e.g., >10% 5xx over 60 seconds), routes traffic to the fallback provider, and attempts provider recovery with exponential backoff. This dramatically improves overall application resilience.

Cost Allocation Architecture

Each request is tagged with a cost allocation key (team, product, use-case, user) at ingress. The gateway calculates cost in real-time by multiplying token counts (extracted from the provider response) by the current pricing table (refreshed daily from a configuration store). Cost events are written to a time-series cost ledger. Budget monitors subscribe to this ledger and emit alerts or enforcement actions (soft-block, hard-block) when budgets are approached or exceeded. This gives finance teams the ability to allocate AI spend on a monthly basis without manual reconciliation.

Audit Logging

Every request and response traverses the audit logger, which writes a structured log record to an immutable audit log store (append-only, tamper-evident). The log record captures: caller identity, request timestamp, model requested, model served, token counts, cost, policy decisions made, response status, and a truncated hash of the request content (full content logging is optional and controlled by data classification). This log is the evidentiary foundation for regulatory compliance.


5. Architecture Diagram

ARCHITECTURE DIAGRAM
flowchart TD subgraph Consumers["AI Consumers"] A[Application Request] end subgraph Gateway["AI Gateway"] B[Auth + Policy] C[Rate Limiter] D[Prompt Firewall] E[Router + Circuit Breaker] F[Output Filter] end subgraph Backend["Providers + Observability"] G[Model Providers] H[Audit Log] I[Cost Ledger] end A -->|mTLS / JWT| B B -->|allow| C --> D --> E --> G G --> F --> A E --> H F --> I style A fill:#dbeafe,stroke:#3b82f6 style B fill:#f0fdf4,stroke:#22c55e style C fill:#f0fdf4,stroke:#22c55e style D fill:#f0fdf4,stroke:#22c55e style E fill:#f0fdf4,stroke:#22c55e style F fill:#f0fdf4,stroke:#22c55e style G fill:#fef9c3,stroke:#eab308 style H fill:#fef9c3,stroke:#eab308 style I fill:#fef9c3,stroke:#eab308

6. Components

Component Type Responsibility Technology Options Criticality
mTLS / JWT Auth Security Middleware Validates caller identity on every request; terminates unauthenticated requests immediately Envoy, Kong, custom Go/Rust middleware Critical
Policy Engine Decision Engine Evaluates per-request policy rules (model access, data classification, content type) against policy store Open Policy Agent (OPA), Cedar, custom rule engine Critical
Rate Limiter Traffic Control Enforces per-caller, per-model, and global token/request quotas; returns 429 on breach Redis + Lua, Envoy rate limit service, Kong rate limiting plugin Critical
Request Router Routing Layer Maps requests to model providers based on model name, load, cost, health; enables fallback routing Envoy, Kong, NGINX + Lua, custom Go service High
Prompt Firewall Security Filter Inline prompt injection and policy violation detection (see EAAPL-SEC002) Custom classifier, AWS Guardrails, Azure Content Safety High
Output Filter Security Filter Post-generation content and PII filtering (see EAAPL-SEC006) Microsoft Presidio, AWS Comprehend, custom NLP pipeline High
Cost Calculator Cost Accounting Real-time cost computation from token counts × pricing table; writes cost events Custom service with pricing API, FinOps platform integration Medium
Circuit Breaker Resilience Monitors provider health; opens/closes circuit; routes to fallback on failure Hystrix, Resilience4j, Envoy outlier detection High
Audit Logger Compliance Writes immutable, structured audit records for every request/response Kafka → S3/GCS immutable store, Splunk, Datadog Critical
Policy Store Configuration Authoritative store of gateway policies (model ACLs, data classification rules, content policies) OPA Bundles, AWS S3 + IAM, HashiCorp Vault Critical
Quota Store State Store Real-time quota counters per caller, per model, per period Redis Cluster, DynamoDB, Dragonfly High
Key Vault Secrets Stores and dispenses model provider credentials; see EAAPL-SEC008 HashiCorp Vault, AWS Secrets Manager, Azure Key Vault Critical
Cost Ledger Financial Time-series store of cost events for dashboarding and budget enforcement InfluxDB, Prometheus, BigQuery, Snowflake Medium

7. Data Flow

Primary Flow

Step Actor Action Output
1 Consumer Application Sends HTTP POST to gateway /v1/chat/completions with mTLS client cert + JWT Bearer token in Authorization header Inbound request at gateway TLS terminator
2 Auth Middleware Validates mTLS client certificate against CA; validates JWT signature and claims (iss, aud, exp, scope) against IdP JWKS endpoint Authenticated identity context attached to request
3 Policy Engine Looks up caller in policy store; evaluates model access ACL, data classification label on request, and content type rules ALLOW or DENY decision; deny returns 403 immediately
4 Rate Limiter Atomically increments caller's token and request counters in Redis; checks against quota for current period ALLOW or 429 Too Many Requests
5 Prompt Firewall Scans request body for prompt injection patterns, PII, and policy violations Sanitised request body or 400 Bad Request
6 Request Router Evaluates routing rules; selects target model provider based on requested model, provider health, and load Routing decision + provider credentials retrieved from vault
7 Circuit Breaker Checks provider circuit state (CLOSED/OPEN/HALF-OPEN); if OPEN, routes to fallback provider Forwarded request or fallback routing
8 Model Provider Processes request; returns response with token usage metadata Raw model response
9 Output Filter Inspects response for PII leakage, harmful content, and policy violations Filtered response or 502 if blocked
10 Cost Calculator Extracts prompt_tokens + completion_tokens from response; multiplies by provider pricing; writes cost event Cost-annotated response headers
11 Audit Logger Writes structured log record (identity, model, tokens, cost, policy decisions, response status, content hash) Audit record in immutable log store
12 Consumer Application Receives filtered, cost-annotated response Business logic continues

Error Flow

Error Condition Gateway Behaviour HTTP Status Alert Triggered
Invalid/expired JWT Reject at auth middleware; log failed auth attempt 401 Auth anomaly alert if >10/min
Policy DENY Reject at policy engine; log policy violation 403 Policy violation alert
Rate limit exceeded Reject at rate limiter; return Retry-After header 429 Quota alert to team budget owner
Prompt injection detected Reject at prompt firewall; log sanitised indicator 400 Security incident alert
Provider circuit OPEN Route to fallback; if no fallback, return 503 503 Provider health alert
Output policy violation Block response; return opaque error to caller 502 Content policy alert
Vault unavailable Fail closed: all requests rejected until vault recovers 503 Critical infrastructure alert

8. Security Considerations

Authentication & Authorisation

  • Mutual TLS (mTLS): All inbound connections from consumer applications require a client certificate issued by the enterprise CA. This provides cryptographic identity that cannot be forged with a stolen JWT alone.
  • JWT Validation: Bearer tokens carry caller identity, scope (which models are accessible), and expiry. Tokens are validated against the IdP's JWKS endpoint with a local cache (refreshed every 5 minutes). Short token lifetimes (15–60 minutes) limit the window of a compromised token.
  • Service-to-Service Identity: Consumer applications authenticate as service principals, not human users. Human-facing applications should not forward end-user tokens to the gateway — the application authenticates as itself and includes user context as a claim.
  • Scope-Based Model Access: JWT scopes define which model families a caller may access. A customer service application should not have scope to access GPT-4 if it only requires GPT-3.5-turbo. Principle of least privilege applies to model access.

Secrets Management

  • Model provider API keys are never stored in gateway configuration files, environment variables, or source code. All credentials are retrieved at runtime from a vault (see EAAPL-SEC008).
  • Gateway retrieves short-lived, dynamically generated credentials where the provider supports it (e.g., AWS Bedrock via IAM role assumption, Azure OpenAI via managed identity).
  • Gateway logs never include raw API keys; only key IDs are logged for traceability.

Data Classification

  • Requests carrying data classification labels above a permitted threshold for a given provider are blocked by the policy engine. For example, requests labelled CONFIDENTIAL may not be routed to external commercial model providers; only to on-premises inference endpoints.
  • Data classification labels are injected by the consuming application or inferred by the prompt firewall (EAAPL-SEC005).

Encryption

  • All traffic in transit uses TLS 1.3. TLS 1.0/1.1/1.2 are disabled.
  • Audit logs are encrypted at rest using AES-256. Log encryption keys are managed separately from gateway operational keys.
  • Request/response content stored in audit logs (if enabled) is encrypted with per-record keys, limiting the impact of a log store breach.

Auditability

  • Every request generates an audit record with: caller identity, timestamp (nanosecond precision), model requested, model served, token counts, cost, policy decisions, response HTTP status, and a SHA-256 hash of the request body.
  • Audit logs are written to an append-only store (S3 Object Lock, Azure Immutable Blob Storage) with a minimum 7-year retention for regulated entities.

OWASP LLM Top 10 Coverage

OWASP LLM Risk Gateway Mitigation Coverage
LLM01: Prompt Injection Prompt Firewall (SEC002) inline at gateway; pattern and semantic analysis High
LLM02: Insecure Output Handling Output Filter (SEC006) inspects all responses before delivery High
LLM03: Training Data Poisoning Out of scope for gateway (training-time control); gateway logs anomalous output patterns for investigation Low
LLM04: Model Denial of Service Rate limiting per caller and globally; circuit breaker prevents provider overload from cascading High
LLM05: Supply Chain Vulnerabilities Provider allow-list enforced at routing layer; only approved providers are routable Medium
LLM06: Sensitive Information Disclosure Output Filter detects PII in responses; input sanitisation prevents PII from entering prompts High
LLM07: Insecure Plugin Design Secure Tool Invocation pattern (SEC004) enforced as a gateway policy for agent tool calls Medium
LLM08: Excessive Agency Human approval gates can be enforced at gateway for high-risk request types Medium
LLM09: Overreliance Out-of-scope for gateway; addressed in application layer None
LLM10: Model Theft Model provider credentials protected in vault; no credential exposure via gateway APIs High

9. Governance Considerations

Responsible AI

  • The gateway is the enforcement point for the organisation's AI Acceptable Use Policy. Policy rules in the policy store codify the AUP into enforceable controls.
  • Every AI interaction is logged with sufficient context to support post-hoc review of AI-assisted decisions — a core requirement of responsible AI frameworks.

Model Risk Management

  • The gateway's routing rules enforce which models may be used for which use cases. High-risk use cases (credit decisions, medical triage) can be restricted to approved, validated models only.
  • Model version pinning at the gateway ensures that model updates do not reach production applications without going through the change management process.

Human Approval Gates

  • The policy engine can require human approval for request types flagged as high-risk (e.g., requests to execute code, send communications, or modify records). Human approval workflows are triggered via an integration with the organisation's ITSM platform.

Policy Management

  • AI usage policies are maintained as code (OPA Rego or Cedar policies) in a version-controlled repository. Changes undergo PR review, automated policy testing, and staged rollout through gateway environments (dev → staging → production).

Traceability

  • Every policy decision is logged with the rule ID that triggered it, enabling governance teams to audit which policies are most frequently triggered, identify policy gaps, and demonstrate regulatory compliance.

Governance Artefacts

Artefact Owner Frequency Purpose
AI Usage Policy (OPA/Cedar) AI Governance Team Reviewed quarterly; updated as needed Codifies AUP into enforceable gateway rules
Model Access Control List AI Platform Team Updated with each new model onboarding Defines which teams may use which models
Audit Log Export Compliance Team Monthly extract; on-demand for incidents Regulatory evidence; incident investigation
Cost Allocation Report Finance + AI Platform Monthly AI spend governance; budget vs actuals
Policy Violation Report Security Operations Weekly Identifies abuse patterns; tuning of policy rules
Circuit Breaker Runbook AI Platform / SRE Reviewed after each provider incident Operational response to provider degradation

10. Operational Considerations

Monitoring

  • Gateway metrics must be collected at sub-second granularity: request rate, error rate, p50/p95/p99 latency per provider, token throughput, quota utilisation per caller, circuit breaker state, and cost rate.
  • Dashboards provide both real-time operational view (SRE) and 30-day trend view (governance).

SLOs

SLO Target Measurement Method
Gateway availability 99.95% Synthetic health checks from all availability zones every 30s
Request latency added by gateway (p99) <10ms (excluding model latency) Distributed trace: gateway entry → provider forward timestamp
Authentication success rate >99.9% Count of 401s / total requests
Policy decision latency (p99) <2ms Internal span: policy_engine_start → policy_engine_end
Audit log write durability 100% (zero lost records) Log record count reconciliation; dead-letter queue for failed writes
Circuit breaker false positive rate <0.1% Manual review of circuit open events

Logging

  • Structured JSON logs. Mandatory fields: trace_id, span_id, caller_id, model_requested, model_served, request_tokens, response_tokens, cost_usd, policy_decision, http_status, latency_ms, timestamp_utc.
  • Log level INFO for all requests; WARN for policy violations; ERROR for auth failures and circuit breaker events; AUDIT for all request/response pairs (separate immutable log stream).

Incident Management

  • P1: Gateway unavailable — all AI workloads impacted. Pager alert to AI Platform SRE + escalation to Architecture owner within 5 minutes.
  • P2: Provider circuit open with no healthy fallback. Pager alert; initiate fallback provider activation.
  • P3: Sustained rate of policy violations (>1% of requests). Alert to Security Operations for investigation.

DR

Scenario RTO RPO Recovery Approach
Single gateway instance failure 30s 0 (stateless data plane) Load balancer removes unhealthy instance; autoscaling adds replacement
Redis quota store failure 5min Accept brief over-quota traffic Fail-open mode: allow traffic with alert; quota store cluster with automatic failover
Vault unavailable 2min 0 Gateway fails closed (no credentials = no traffic); vault HA cluster
Full gateway region failure 15min 0 Active-active multi-region deployment; Route 53/Azure Traffic Manager DNS failover

Capacity

  • Gateway is stateless in the data plane (policy decisions made against in-process cache). Scale horizontally with demand.
  • Redis quota store: size for peak token rate × TTL. At 10M tokens/minute with 1-minute rolling window: ~10M counters × 20 bytes = ~200MB — comfortably within Redis memory limits.
  • Provision for 3× normal peak to absorb burst without autoscaling lag.

11. Cost Considerations

Cost Drivers

Cost Driver Description Relative Impact
Compute (gateway instances) CPU/memory for request processing, policy evaluation, TLS termination Medium
Redis (quota store) Managed Redis cluster for rate limiting state Low
Vault (secrets management) HashiCorp Vault Enterprise or cloud-native equivalent Low–Medium
Log storage (audit logs) Immutable log storage for 7 years; grows linearly with request volume Medium (long-term)
Egress (model provider calls) Dominates total cost; gateway adds ~0.1% overhead per request Low (gateway-specific)
Engineering (operations) SRE time to operate, tune, and evolve the gateway Medium

Scaling Risks

  • Audit log storage grows unboundedly. Implement tiered storage (hot → warm → cold → archive) with automated lifecycle policies.
  • Redis memory pressure at extreme token volumes. Use token bucket algorithm with decay to limit state size.

Optimisations

  • Cache policy decisions for stable caller+model combinations (1-minute TTL) to avoid OPA evaluation on every request.
  • Use spot/preemptible instances for non-stateful gateway replicas (failover to on-demand automatically).
  • Compress audit logs before writing to object storage (LZ4/Zstandard): typical 70–80% compression ratio on structured JSON.

Indicative Cost Range

Scale Monthly AWS Cost (USD) Notes
Small (< 1M requests/day) $500–$1,500 2 ECS Fargate tasks, ElastiCache t3.medium, CloudWatch Logs
Medium (1M–50M requests/day) $2,000–$8,000 4–8 ECS tasks, ElastiCache r6g.large cluster, S3 immutable logs
Large (> 50M requests/day) $15,000–$40,000 EKS cluster, ElastiCache r6g.4xlarge, dedicated log pipeline

These figures cover gateway infrastructure only. Model provider API costs are the dominant expenditure and are not included.


12. Trade-Off Analysis

Option Comparison

Option Description Pros Cons Best For
A: Build custom gateway Develop gateway as an internal service using Envoy or Kong as a base Full control; can add AI-specific features; no vendor lock-in High development and maintenance cost; requires specialist expertise Large enterprises with unique AI governance requirements
B: Cloud-native AI gateway Use Azure APIM + Azure AI Content Safety, or AWS API Gateway + Bedrock Guardrails Low operational overhead; native integration with cloud AI services; managed SLA Vendor lock-in; limited multi-cloud support; less flexible policy engine Organisations committed to a single cloud provider
C: Commercial AI gateway product LiteLLM Proxy (open-source), Portkey, Martian, or similar Purpose-built for LLM use cases; fast time-to-value; community support Less mature enterprise features; vendor viability risk; may not meet all compliance requirements Mid-market organisations; teams needing quick deployment
D: Service mesh with AI extensions Extend existing Envoy-based service mesh (Istio/Consul) with AI-specific filters Reuses existing investment; consistent with microservices observability Significant customisation required; LLM-specific features (token counting, streaming) require custom WASM/Lua filters Organisations with mature service mesh already deployed

Architectural Tensions

Tension Trade-Off
Security vs Latency Every security check (auth, policy, prompt firewall) adds latency. Target: <10ms gateway overhead. Achieve through in-process caching, async audit logging, and hardware-accelerated TLS.
Observability vs Privacy Full request/response logging maximises audit capability but risks logging sensitive data. Resolution: log content hashes by default; full content logging opt-in per data classification level, with field-level redaction.
Centralisation vs Resilience A gateway is a single logical control point; if poorly designed, it becomes a single point of failure. Resolution: active-active multi-region deployment; fail-open for quota (not for auth) to maintain availability.
Policy Strictness vs Developer Productivity Overly strict policies block legitimate use; overly permissive policies defeat the purpose. Resolution: graduated enforcement (warn → soft-block → hard-block) with developer-visible explanations.
Cost Visibility vs Performance Fine-grained cost tagging (per-request, per-user) requires token counting and cost ledger writes on every request. Resolution: async cost event writes to a queue; batch persist to ledger every 5 seconds.

13. Failure Modes

Failure Likelihood Impact Detection Recovery
Gateway instance crash Low High (if single instance) Load balancer health check failure → alert Autoscaling replaces instance; deploy minimum 3 instances in production
Redis quota store timeout Medium Medium (brief over-quota traffic) Latency spike on rate limit check → alert Fail-open for quota; Redis Sentinel/Cluster for HA
Vault unreachable Low Critical (all traffic blocked) 503 spike → critical alert Vault HA cluster; cached credentials TTL 5min as emergency fallback
Policy store stale Medium Medium (stale policy decisions) Policy cache age metric → alert Cache TTL 60s; background refresh; explicit invalidation API
Prompt firewall false positive rate spike Medium High (legitimate traffic blocked) 400 rate spike from prompt firewall → alert Tuning runbook; emergency bypass flag per caller (audited)
Audit log write failure Low Critical (regulatory compliance gap) Dead-letter queue depth > 0 → critical alert Retry with exponential backoff; dead-letter queue with separate drain process
TLS certificate expiry Low Critical (all traffic blocked) Certificate expiry monitoring → 30-day warning Automated certificate rotation via cert-manager or ACM
Model provider mass outage Medium High Circuit breaker opens for multiple providers simultaneously Fallback to on-premises model; queue non-urgent requests; alert users

Cascading Failure Scenarios

Scenario 1: Vault + Redis simultaneous failure If both Vault (credential store) and Redis (quota store) fail simultaneously, the gateway cannot retrieve credentials AND cannot enforce quotas. The gateway must fail closed (return 503) — accepting quota bypass (fail-open) while Vault is down would allow unlimited uncredentialed requests to reach providers once credentials are cached. Mitigation: Vault and Redis must be deployed on independent infrastructure with no shared failure domain.

Scenario 2: Policy store becomes unavailable during a security incident If the policy store becomes unavailable at the same time as a security event requiring policy update (e.g., a compromised caller key), the gateway will continue serving the last cached policy. Mitigation: emergency policy override API that writes directly to the in-memory cache on each gateway instance; secured with break-glass credentials.


14. Regulatory Considerations

Regulation Requirement Gateway Implementation
APRA CPS234 (Information Security) Maintain information security controls for third-party service providers Model provider access through gateway enforces ACL; audit trail demonstrates access governance
APRA CPS230 (Operational Risk) Identify and manage risks from third-party dependencies Circuit breaker provides operational resilience; provider health metrics enable risk monitoring
Australian Privacy Act 1988 Personal information must not be disclosed to overseas recipients without consent Data classification enforcement in policy engine blocks requests containing PI from routing to non-approved providers
EU AI Act Article 9 (Risk Management) High-risk AI systems must implement risk management measures Gateway enforces model access controls for high-risk use cases; audit log supports risk documentation
EU AI Act Article 12 (Record-Keeping) High-risk AI systems must maintain logs enabling post-hoc audit Immutable audit log with 7-year retention satisfies this requirement
ISO/IEC 42001 §6.1 (Risk Treatment) Implement controls for identified AI risks Gateway operationalises risk treatment actions from AI risk register
NIST AI RMF GOVERN 1.2 Accountability mechanisms for AI systems Caller identity + audit log creates clear accountability chain for every AI request
NIST AI RMF MANAGE 2.4 Monitor AI system performance Gateway metrics and alerts implement continuous AI performance monitoring

15. Reference Implementations

AWS

Component AWS Service
Gateway compute ECS Fargate (Kong or custom Go service) or API Gateway with Lambda authoriser
Auth Cognito (IdP) + Lambda JWT authoriser
Policy engine Lambda function hosting OPA with S3 policy bundle
Rate limiting ElastiCache for Redis (token bucket counters)
Secrets AWS Secrets Manager with automatic rotation
Routing Application Load Balancer + ECS service discovery
Audit logs Kinesis Firehose → S3 with Object Lock (WORM)
Cost tracking Custom Lambda → Cost and Usage Report + Athena
Monitoring CloudWatch + X-Ray distributed tracing

Azure

Component Azure Service
Gateway Azure API Management (APIM) with custom policies
Auth Azure AD + APIM JWT validation policy
Policy engine APIM policy expressions + Azure Functions for complex rules
Rate limiting APIM built-in rate limiting + Azure Cache for Redis
Secrets Azure Key Vault with managed identity
Routing APIM backends + Azure Application Gateway
Audit logs Event Hub → Azure Immutable Blob Storage
Content safety Azure AI Content Safety (integrates natively with APIM)
Monitoring Azure Monitor + Application Insights

GCP

Component GCP Service
Gateway Apigee API Management or Cloud Run (Kong/Envoy)
Auth Google Identity Platform + Cloud IAP
Policy engine Cloud Run (OPA) with Cloud Storage policy bundles
Rate limiting Memorystore for Redis
Secrets Secret Manager
Audit logs Cloud Logging → Cloud Storage with retention lock
Monitoring Cloud Monitoring + Cloud Trace

On-Premises

Component Technology
Gateway Kong Enterprise or Envoy Proxy with custom filters
Auth Active Directory Federation Services + OAuth2 Proxy
Policy engine OPA deployed as sidecar or standalone service
Rate limiting Redis Sentinel cluster
Secrets HashiCorp Vault Enterprise
Audit logs Kafka → Elasticsearch with ILM immutability policy
Monitoring Prometheus + Grafana + Jaeger

Pattern ID Relationship
Prompt Firewall EAAPL-SEC002 Deployed inline within gateway; gateway calls firewall as a filter stage
LLM Input Sanitisation EAAPL-SEC005 Complementary to prompt firewall; deeper PII/schema validation within gateway pipeline
AI Output Filtering EAAPL-SEC006 Deployed as post-generation filter within gateway; shares audit log infrastructure
Zero-Trust AI Pipeline EAAPL-SEC007 Gateway is the primary enforcement point for zero-trust policy; SEC007 extends to intra-pipeline trust
Secrets Management for AI EAAPL-SEC008 Gateway depends on this pattern for all model provider credentials
AI Data Classification EAAPL-SEC009 Classification labels consumed by gateway policy engine for routing decisions
AI Telemetry EAAPL-OBS001 Gateway is the primary source of AI telemetry (token counts, latency, errors)
AI Cost Observability EAAPL-OBS006 Gateway's cost ledger is the primary data source for cost observability
Model Isolation EAAPL-SEC003 Gateway enforces network boundaries that complement model isolation at the compute layer

17. Maturity Assessment

Overall Maturity: Mature

Dimension Score (1–5) Rationale
Pattern definition clarity 5 Well-defined, unambiguous scope and responsibilities
Technology availability 5 Mature OSS and commercial options available across all major clouds
Industry adoption 4 Widely adopted in financial services and regulated industries; emerging in other sectors
Operational tooling 4 Strong monitoring and operations tooling; some AI-specific metrics require custom implementation
Regulatory alignment 5 Directly addresses APRA CPS234, EU AI Act, Privacy Act requirements
Reference implementation availability 4 Reference implementations available for all major clouds; AI-specific extensions require custom work
Community knowledge 4 Strong API gateway community; LLM-specific extensions are an emerging body of knowledge

18. Revision History

Version Date Author Changes
1.0 2024-01-15 AI Architecture Team Initial pattern definition
1.1 2024-04-20 AI Architecture Team Added EU AI Act regulatory mapping; expanded DR scenarios
2.0 2024-09-10 AI Architecture Team Major revision: added streaming support guidance; updated OWASP LLM Top 10 to 2024 edition; added GCP reference implementation
2.1 2025-03-01 AI Architecture Team Added cost observability integration; expanded failure mode analysis; aligned with ISO 42001 §6.1
← Back to LibraryMore AI Security