EAAPL-SEC001Proven

8 signals→

AI Gateway

AI SecurityAPRA CPS234EU AI ActField-tested in AU

[EAAPL-SEC001] AI Gateway

Category: Security / API Control Plane Sub-category: Traffic Management & Policy Enforcement Version: 2.1 Maturity: Mature Tags: api-gateway rate-limiting authentication cost-allocation circuit-breaker policy-enforcement ai-operations Regulatory Relevance: APRA CPS234, EU AI Act Art. 9 (Risk Management), ISO 42001 §6.1, NIST AI RMF GOVERN 1.2

1. Executive Summary

The AI Gateway pattern establishes a centralised, enterprise-grade control plane through which all AI traffic flows — inbound requests from applications and users, and outbound calls to model providers. It functions as the "first and last line of defence" for every AI interaction in the enterprise.

From a business perspective, the AI Gateway solves three compounding problems that emerge when AI usage scales without discipline: uncontrolled spend (teams independently acquiring model API keys lead to budget overruns with no visibility), inconsistent security posture (each team re-inventing authentication, logging, and abuse controls), and regulatory exposure (no single audit trail for AI interactions).

The gateway provides authentication and authorisation for every AI request, enforces rate limits and cost budgets per team or product, routes traffic intelligently across multiple model providers, captures structured logs for compliance, and breaks the circuit when downstream models are degraded. Organisations that deploy this pattern typically report 30–50% reduction in AI spend waste through visibility and quota enforcement, and can demonstrate AI audit trails to regulators within 24 hours of a request.

This pattern is the foundation upon which all other AI security and observability patterns depend. It should be the first pattern deployed in any enterprise AI programme.

2. Problem Statement

Business Problem

Enterprise organisations adopting AI at scale face ungoverned sprawl: dozens of teams independently calling OpenAI, Anthropic, Azure OpenAI, and other providers with individual API keys. There is no budget control, no unified audit trail, no abuse detection, and no single point where policy can be enforced. A single misconfigured application or compromised key can generate hundreds of thousands of dollars in model API spend within hours. Regulatory bodies (APRA, ASIC, EU regulators) increasingly require organisations to demonstrate comprehensive audit trails for AI-assisted decisions — an impossibility without a centralised control point.

Technical Problem

Without a gateway:

Each application must independently implement auth, rate limiting, retry logic, and logging — creating N inconsistent implementations.
Model provider credentials are distributed across dozens of services, dramatically increasing the blast radius of a credential leak.
There is no circuit breaker: a degraded model provider cascades into application failures.
Cost attribution is impossible: spend cannot be allocated to teams, products, or use cases.
Routing logic (e.g., fallback to a cheaper model for low-complexity requests) must be duplicated across every consuming application.

Symptoms of Absence

Unexplained spikes in model API bills.
Security incidents involving leaked model API keys.
Different applications enforcing different content policies, creating inconsistent user experiences.
Inability to produce AI usage reports for compliance audits.
Cascading application failures when a model provider has an outage.
No capacity to enforce organisational AI usage policies (e.g., "no patient data to external models").

Cost of Inaction

Dimension	Impact
Financial	Uncontrolled model API spend; potential for runaway costs from abuse or bugs
Regulatory	Cannot demonstrate AI audit trail to APRA/EU AI Act auditors; enforcement risk
Security	Distributed credentials; no unified threat detection; full blast radius on key leak
Operational	N × duplicated retry/rate-limit/log implementations; no unified model health visibility
Reputational	Policy violations reach users (harmful content, data leakage) without a filter layer

3. Context

When to Apply

Organisation has more than one team or application calling AI model APIs.
AI model API spend exceeds $5,000/month or is forecast to.
Organisation operates in a regulated industry (financial services, healthcare, government).
Multiple model providers are in use or planned.
Security team requires audit trails for AI interactions.
AI applications are user-facing and require content policy enforcement.

When NOT to Apply

Single-team proof-of-concept with a 90-day sunset — gateway adds operational overhead disproportionate to PoC scope.
Fully offline/on-premises model inference where the model is a library call within the same process — a gateway adds latency without security benefit at the network boundary.
When a cloud-native AI platform (e.g., Azure AI Studio with built-in APIM integration) already provides all required controls natively and team can accept vendor lock-in.

Prerequisites

Prerequisite	Detail
Identity Provider	OIDC/SAML IdP capable of issuing JWT tokens to calling applications
Secrets Management	Vault or equivalent for model provider credentials
Observability Stack	Log aggregation and metrics platform to receive gateway telemetry
Network Topology	Gateway must be reachable by all AI-consuming applications; egress to model providers permitted
API Catalogue	Inventory of existing AI API calls to route through the gateway

Industry Applicability

Industry	Applicability	Key Driver
Financial Services	High	APRA CPS234, audit trails, cost governance
Healthcare	High	Patient data controls, regulatory AI traceability
Government	High	Sovereignty, audit, classification enforcement
Retail / E-commerce	Medium	Cost control, content policy
Technology / SaaS	Medium	Multi-team cost allocation, developer platform
Education	Medium	Content policy, budget governance

4. Architecture Overview

The AI Gateway is deployed as a horizontally scalable reverse proxy that sits at the intersection of all AI-consuming workloads and all model provider endpoints. It is not a simple HTTP proxy — it is a stateful policy engine with its own data plane (real-time request processing) and control plane (policy configuration, key management, quota administration).

Why a dedicated gateway rather than embedding controls in each application?

The fundamental architectural reason is that cross-cutting concerns — authentication, rate limiting, cost allocation, audit logging, circuit breaking — are almost always implemented inconsistently when distributed across teams. The gateway externalises these concerns into a single, auditable, independently operated service. This mirrors the established API gateway pattern for REST/GraphQL APIs, extended with AI-specific capabilities.

Request Path Design

Inbound requests arrive from applications carrying a service identity token (mTLS client certificate or JWT). The gateway's authentication middleware validates the token against the enterprise IdP before any processing occurs. This ensures that unauthenticated requests fail fast and are never forwarded to model providers — preventing credential abuse if an internal application is compromised.

After authentication, the policy engine evaluates the request against a rule set: Does this caller have permission to use this model? Does this request exceed the caller's rate quota? Does this request carry a data classification label that prohibits forwarding to the requested external provider? Policy decisions are made in-process against an in-memory policy cache (refreshed from the policy store every 60 seconds) to keep decision latency under 1ms.

Routing and Provider Abstraction

The gateway abstracts model provider APIs behind a unified internal schema. Consuming applications call a single internal endpoint (/v1/chat/completions) regardless of whether the request will be served by GPT-4, Claude 3.7, or an on-premises Llama deployment. The routing layer maps requests to providers based on model name, caller preference, load, cost optimisation rules, and provider health. This abstraction is critical: it allows organisations to switch providers, add fallbacks, or introduce shadow routing for model evaluation without changing consuming applications.

Why circuit breaking at the gateway?

Model providers have variable availability SLAs, and LLM inference latency is orders of magnitude higher than typical microservice calls. Without a circuit breaker at the gateway, a degraded provider causes cascading timeouts across all consuming applications. The gateway's circuit breaker monitors error rates and latency per provider, opens the circuit when thresholds are breached (e.g., >10% 5xx over 60 seconds), routes traffic to the fallback provider, and attempts provider recovery with exponential backoff. This dramatically improves overall application resilience.

Cost Allocation Architecture

Each request is tagged with a cost allocation key (team, product, use-case, user) at ingress. The gateway calculates cost in real-time by multiplying token counts (extracted from the provider response) by the current pricing table (refreshed daily from a configuration store). Cost events are written to a time-series cost ledger. Budget monitors subscribe to this ledger and emit alerts or enforcement actions (soft-block, hard-block) when budgets are approached or exceeded. This gives finance teams the ability to allocate AI spend on a monthly basis without manual reconciliation.

Audit Logging

Every request and response traverses the audit logger, which writes a structured log record to an immutable audit log store (append-only, tamper-evident). The log record captures: caller identity, request timestamp, model requested, model served, token counts, cost, policy decisions made, response status, and a truncated hash of the request content (full content logging is optional and controlled by data classification). This log is the evidentiary foundation for regulatory compliance.

5. Architecture Diagram

ARCHITECTURE DIAGRAM

flowchart TD subgraph Consumers["AI Consumers"] A[Application Request] end subgraph Gateway["AI Gateway"] B[Auth + Policy] C[Rate Limiter] D[Prompt Firewall] E[Router + Circuit Breaker] F[Output Filter] end subgraph Backend["Providers + Observability"] G[Model Providers] H[Audit Log] I[Cost Ledger] end A -->|mTLS / JWT| B B -->|allow| C --> D --> E --> G G --> F --> A E --> H F --> I style A fill:#dbeafe,stroke:#3b82f6 style B fill:#f0fdf4,stroke:#22c55e style C fill:#f0fdf4,stroke:#22c55e style D fill:#f0fdf4,stroke:#22c55e style E fill:#f0fdf4,stroke:#22c55e style F fill:#f0fdf4,stroke:#22c55e style G fill:#fef9c3,stroke:#eab308 style H fill:#fef9c3,stroke:#eab308 style I fill:#fef9c3,stroke:#eab308

6. Components

Component	Type	Responsibility	Technology Options	Criticality
mTLS / JWT Auth	Security Middleware	Validates caller identity on every request; terminates unauthenticated requests immediately	Envoy, Kong, custom Go/Rust middleware	Critical
Policy Engine	Decision Engine	Evaluates per-request policy rules (model access, data classification, content type) against policy store	Open Policy Agent (OPA), Cedar, custom rule engine	Critical
Rate Limiter	Traffic Control	Enforces per-caller, per-model, and global token/request quotas; returns 429 on breach	Redis + Lua, Envoy rate limit service, Kong rate limiting plugin	Critical
Request Router	Routing Layer	Maps requests to model providers based on model name, load, cost, health; enables fallback routing	Envoy, Kong, NGINX + Lua, custom Go service	High
Prompt Firewall	Security Filter	Inline prompt injection and policy violation detection (see EAAPL-SEC002)	Custom classifier, AWS Guardrails, Azure Content Safety	High
Output Filter	Security Filter	Post-generation content and PII filtering (see EAAPL-SEC006)	Microsoft Presidio, AWS Comprehend, custom NLP pipeline	High
Cost Calculator	Cost Accounting	Real-time cost computation from token counts × pricing table; writes cost events	Custom service with pricing API, FinOps platform integration	Medium
Circuit Breaker	Resilience	Monitors provider health; opens/closes circuit; routes to fallback on failure	Hystrix, Resilience4j, Envoy outlier detection	High
Audit Logger	Compliance	Writes immutable, structured audit records for every request/response	Kafka → S3/GCS immutable store, Splunk, Datadog	Critical
Policy Store	Configuration	Authoritative store of gateway policies (model ACLs, data classification rules, content policies)	OPA Bundles, AWS S3 + IAM, HashiCorp Vault	Critical
Quota Store	State Store	Real-time quota counters per caller, per model, per period	Redis Cluster, DynamoDB, Dragonfly	High
Key Vault	Secrets	Stores and dispenses model provider credentials; see EAAPL-SEC008	HashiCorp Vault, AWS Secrets Manager, Azure Key Vault	Critical
Cost Ledger	Financial	Time-series store of cost events for dashboarding and budget enforcement	InfluxDB, Prometheus, BigQuery, Snowflake	Medium

7. Data Flow

Primary Flow

Step	Actor	Action	Output
1	Consumer Application	Sends HTTP POST to gateway `/v1/chat/completions` with mTLS client cert + JWT Bearer token in Authorization header	Inbound request at gateway TLS terminator
2	Auth Middleware	Validates mTLS client certificate against CA; validates JWT signature and claims (iss, aud, exp, scope) against IdP JWKS endpoint	Authenticated identity context attached to request
3	Policy Engine	Looks up caller in policy store; evaluates model access ACL, data classification label on request, and content type rules	ALLOW or DENY decision; deny returns 403 immediately
4	Rate Limiter	Atomically increments caller's token and request counters in Redis; checks against quota for current period	ALLOW or 429 Too Many Requests
5	Prompt Firewall	Scans request body for prompt injection patterns, PII, and policy violations	Sanitised request body or 400 Bad Request
6	Request Router	Evaluates routing rules; selects target model provider based on requested model, provider health, and load	Routing decision + provider credentials retrieved from vault
7	Circuit Breaker	Checks provider circuit state (CLOSED/OPEN/HALF-OPEN); if OPEN, routes to fallback provider	Forwarded request or fallback routing
8	Model Provider	Processes request; returns response with token usage metadata	Raw model response
9	Output Filter	Inspects response for PII leakage, harmful content, and policy violations	Filtered response or 502 if blocked
10	Cost Calculator	Extracts prompt_tokens + completion_tokens from response; multiplies by provider pricing; writes cost event	Cost-annotated response headers
11	Audit Logger	Writes structured log record (identity, model, tokens, cost, policy decisions, response status, content hash)	Audit record in immutable log store
12	Consumer Application	Receives filtered, cost-annotated response	Business logic continues

Error Flow

Error Condition	Gateway Behaviour	HTTP Status	Alert Triggered
Invalid/expired JWT	Reject at auth middleware; log failed auth attempt	401	Auth anomaly alert if >10/min
Policy DENY	Reject at policy engine; log policy violation	403	Policy violation alert
Rate limit exceeded	Reject at rate limiter; return Retry-After header	429	Quota alert to team budget owner
Prompt injection detected	Reject at prompt firewall; log sanitised indicator	400	Security incident alert
Provider circuit OPEN	Route to fallback; if no fallback, return 503	503	Provider health alert
Output policy violation	Block response; return opaque error to caller	502	Content policy alert
Vault unavailable	Fail closed: all requests rejected until vault recovers	503	Critical infrastructure alert

8. Security Considerations

Authentication & Authorisation

Mutual TLS (mTLS): All inbound connections from consumer applications require a client certificate issued by the enterprise CA. This provides cryptographic identity that cannot be forged with a stolen JWT alone.
JWT Validation: Bearer tokens carry caller identity, scope (which models are accessible), and expiry. Tokens are validated against the IdP's JWKS endpoint with a local cache (refreshed every 5 minutes). Short token lifetimes (15–60 minutes) limit the window of a compromised token.
Service-to-Service Identity: Consumer applications authenticate as service principals, not human users. Human-facing applications should not forward end-user tokens to the gateway — the application authenticates as itself and includes user context as a claim.
Scope-Based Model Access: JWT scopes define which model families a caller may access. A customer service application should not have scope to access GPT-4 if it only requires GPT-3.5-turbo. Principle of least privilege applies to model access.

Secrets Management

Model provider API keys are never stored in gateway configuration files, environment variables, or source code. All credentials are retrieved at runtime from a vault (see EAAPL-SEC008).
Gateway retrieves short-lived, dynamically generated credentials where the provider supports it (e.g., AWS Bedrock via IAM role assumption, Azure OpenAI via managed identity).
Gateway logs never include raw API keys; only key IDs are logged for traceability.

Data Classification

Requests carrying data classification labels above a permitted threshold for a given provider are blocked by the policy engine. For example, requests labelled CONFIDENTIAL may not be routed to external commercial model providers; only to on-premises inference endpoints.
Data classification labels are injected by the consuming application or inferred by the prompt firewall (EAAPL-SEC005).

Encryption

All traffic in transit uses TLS 1.3. TLS 1.0/1.1/1.2 are disabled.
Audit logs are encrypted at rest using AES-256. Log encryption keys are managed separately from gateway operational keys.
Request/response content stored in audit logs (if enabled) is encrypted with per-record keys, limiting the impact of a log store breach.

Auditability

Every request generates an audit record with: caller identity, timestamp (nanosecond precision), model requested, model served, token counts, cost, policy decisions, response HTTP status, and a SHA-256 hash of the request body.
Audit logs are written to an append-only store (S3 Object Lock, Azure Immutable Blob Storage) with a minimum 7-year retention for regulated entities.

OWASP LLM Top 10 Coverage

OWASP LLM Risk	Gateway Mitigation	Coverage
LLM01: Prompt Injection	Prompt Firewall (SEC002) inline at gateway; pattern and semantic analysis	High
LLM02: Insecure Output Handling	Output Filter (SEC006) inspects all responses before delivery	High
LLM03: Training Data Poisoning	Out of scope for gateway (training-time control); gateway logs anomalous output patterns for investigation	Low
LLM04: Model Denial of Service	Rate limiting per caller and globally; circuit breaker prevents provider overload from cascading	High
LLM05: Supply Chain Vulnerabilities	Provider allow-list enforced at routing layer; only approved providers are routable	Medium
LLM06: Sensitive Information Disclosure	Output Filter detects PII in responses; input sanitisation prevents PII from entering prompts	High
LLM07: Insecure Plugin Design	Secure Tool Invocation pattern (SEC004) enforced as a gateway policy for agent tool calls	Medium
LLM08: Excessive Agency	Human approval gates can be enforced at gateway for high-risk request types	Medium
LLM09: Overreliance	Out-of-scope for gateway; addressed in application layer	None
LLM10: Model Theft	Model provider credentials protected in vault; no credential exposure via gateway APIs	High

9. Governance Considerations

Responsible AI

The gateway is the enforcement point for the organisation's AI Acceptable Use Policy. Policy rules in the policy store codify the AUP into enforceable controls.
Every AI interaction is logged with sufficient context to support post-hoc review of AI-assisted decisions — a core requirement of responsible AI frameworks.

Model Risk Management

The gateway's routing rules enforce which models may be used for which use cases. High-risk use cases (credit decisions, medical triage) can be restricted to approved, validated models only.
Model version pinning at the gateway ensures that model updates do not reach production applications without going through the change management process.

Human Approval Gates

The policy engine can require human approval for request types flagged as high-risk (e.g., requests to execute code, send communications, or modify records). Human approval workflows are triggered via an integration with the organisation's ITSM platform.

Policy Management

AI usage policies are maintained as code (OPA Rego or Cedar policies) in a version-controlled repository. Changes undergo PR review, automated policy testing, and staged rollout through gateway environments (dev → staging → production).

Traceability

Every policy decision is logged with the rule ID that triggered it, enabling governance teams to audit which policies are most frequently triggered, identify policy gaps, and demonstrate regulatory compliance.

Governance Artefacts

Artefact	Owner	Frequency	Purpose
AI Usage Policy (OPA/Cedar)	AI Governance Team	Reviewed quarterly; updated as needed	Codifies AUP into enforceable gateway rules
Model Access Control List	AI Platform Team	Updated with each new model onboarding	Defines which teams may use which models
Audit Log Export	Compliance Team	Monthly extract; on-demand for incidents	Regulatory evidence; incident investigation
Cost Allocation Report	Finance + AI Platform	Monthly	AI spend governance; budget vs actuals
Policy Violation Report	Security Operations	Weekly	Identifies abuse patterns; tuning of policy rules
Circuit Breaker Runbook	AI Platform / SRE	Reviewed after each provider incident	Operational response to provider degradation

10. Operational Considerations

Monitoring

Gateway metrics must be collected at sub-second granularity: request rate, error rate, p50/p95/p99 latency per provider, token throughput, quota utilisation per caller, circuit breaker state, and cost rate.
Dashboards provide both real-time operational view (SRE) and 30-day trend view (governance).

SLOs

SLO	Target	Measurement Method
Gateway availability	99.95%	Synthetic health checks from all availability zones every 30s
Request latency added by gateway (p99)	<10ms (excluding model latency)	Distributed trace: gateway entry → provider forward timestamp
Authentication success rate	>99.9%	Count of 401s / total requests
Policy decision latency (p99)	<2ms	Internal span: policy_engine_start → policy_engine_end
Audit log write durability	100% (zero lost records)	Log record count reconciliation; dead-letter queue for failed writes
Circuit breaker false positive rate	<0.1%	Manual review of circuit open events

Logging

Structured JSON logs. Mandatory fields: trace_id, span_id, caller_id, model_requested, model_served, request_tokens, response_tokens, cost_usd, policy_decision, http_status, latency_ms, timestamp_utc.
Log level INFO for all requests; WARN for policy violations; ERROR for auth failures and circuit breaker events; AUDIT for all request/response pairs (separate immutable log stream).

Incident Management

P1: Gateway unavailable — all AI workloads impacted. Pager alert to AI Platform SRE + escalation to Architecture owner within 5 minutes.
P2: Provider circuit open with no healthy fallback. Pager alert; initiate fallback provider activation.
P3: Sustained rate of policy violations (>1% of requests). Alert to Security Operations for investigation.

DR

Scenario	RTO	RPO	Recovery Approach
Single gateway instance failure	30s	0 (stateless data plane)	Load balancer removes unhealthy instance; autoscaling adds replacement
Redis quota store failure	5min	Accept brief over-quota traffic	Fail-open mode: allow traffic with alert; quota store cluster with automatic failover
Vault unavailable	2min	0	Gateway fails closed (no credentials = no traffic); vault HA cluster
Full gateway region failure	15min	0	Active-active multi-region deployment; Route 53/Azure Traffic Manager DNS failover

Capacity

Gateway is stateless in the data plane (policy decisions made against in-process cache). Scale horizontally with demand.
Redis quota store: size for peak token rate × TTL. At 10M tokens/minute with 1-minute rolling window: ~10M counters × 20 bytes = ~200MB — comfortably within Redis memory limits.
Provision for 3× normal peak to absorb burst without autoscaling lag.

11. Cost Considerations

Cost Drivers

Cost Driver	Description	Relative Impact
Compute (gateway instances)	CPU/memory for request processing, policy evaluation, TLS termination	Medium
Redis (quota store)	Managed Redis cluster for rate limiting state	Low
Vault (secrets management)	HashiCorp Vault Enterprise or cloud-native equivalent	Low–Medium
Log storage (audit logs)	Immutable log storage for 7 years; grows linearly with request volume	Medium (long-term)
Egress (model provider calls)	Dominates total cost; gateway adds ~0.1% overhead per request	Low (gateway-specific)
Engineering (operations)	SRE time to operate, tune, and evolve the gateway	Medium

Scaling Risks

Audit log storage grows unboundedly. Implement tiered storage (hot → warm → cold → archive) with automated lifecycle policies.
Redis memory pressure at extreme token volumes. Use token bucket algorithm with decay to limit state size.

Optimisations

Cache policy decisions for stable caller+model combinations (1-minute TTL) to avoid OPA evaluation on every request.
Use spot/preemptible instances for non-stateful gateway replicas (failover to on-demand automatically).
Compress audit logs before writing to object storage (LZ4/Zstandard): typical 70–80% compression ratio on structured JSON.

Indicative Cost Range

Scale	Monthly AWS Cost (USD)	Notes
Small (< 1M requests/day)	$500–$1,500	2 ECS Fargate tasks, ElastiCache t3.medium, CloudWatch Logs
Medium (1M–50M requests/day)	$2,000–$8,000	4–8 ECS tasks, ElastiCache r6g.large cluster, S3 immutable logs
Large (> 50M requests/day)	$15,000–$40,000	EKS cluster, ElastiCache r6g.4xlarge, dedicated log pipeline

These figures cover gateway infrastructure only. Model provider API costs are the dominant expenditure and are not included.

12. Trade-Off Analysis

Option Comparison

Option	Description	Pros	Cons	Best For
A: Build custom gateway	Develop gateway as an internal service using Envoy or Kong as a base	Full control; can add AI-specific features; no vendor lock-in	High development and maintenance cost; requires specialist expertise	Large enterprises with unique AI governance requirements
B: Cloud-native AI gateway	Use Azure APIM + Azure AI Content Safety, or AWS API Gateway + Bedrock Guardrails	Low operational overhead; native integration with cloud AI services; managed SLA	Vendor lock-in; limited multi-cloud support; less flexible policy engine	Organisations committed to a single cloud provider
C: Commercial AI gateway product	LiteLLM Proxy (open-source), Portkey, Martian, or similar	Purpose-built for LLM use cases; fast time-to-value; community support	Less mature enterprise features; vendor viability risk; may not meet all compliance requirements	Mid-market organisations; teams needing quick deployment
D: Service mesh with AI extensions	Extend existing Envoy-based service mesh (Istio/Consul) with AI-specific filters	Reuses existing investment; consistent with microservices observability	Significant customisation required; LLM-specific features (token counting, streaming) require custom WASM/Lua filters	Organisations with mature service mesh already deployed

Architectural Tensions

Tension	Trade-Off
Security vs Latency	Every security check (auth, policy, prompt firewall) adds latency. Target: <10ms gateway overhead. Achieve through in-process caching, async audit logging, and hardware-accelerated TLS.
Observability vs Privacy	Full request/response logging maximises audit capability but risks logging sensitive data. Resolution: log content hashes by default; full content logging opt-in per data classification level, with field-level redaction.
Centralisation vs Resilience	A gateway is a single logical control point; if poorly designed, it becomes a single point of failure. Resolution: active-active multi-region deployment; fail-open for quota (not for auth) to maintain availability.
Policy Strictness vs Developer Productivity	Overly strict policies block legitimate use; overly permissive policies defeat the purpose. Resolution: graduated enforcement (warn → soft-block → hard-block) with developer-visible explanations.
Cost Visibility vs Performance	Fine-grained cost tagging (per-request, per-user) requires token counting and cost ledger writes on every request. Resolution: async cost event writes to a queue; batch persist to ledger every 5 seconds.

13. Failure Modes

Failure	Likelihood	Impact	Detection	Recovery
Gateway instance crash	Low	High (if single instance)	Load balancer health check failure → alert	Autoscaling replaces instance; deploy minimum 3 instances in production
Redis quota store timeout	Medium	Medium (brief over-quota traffic)	Latency spike on rate limit check → alert	Fail-open for quota; Redis Sentinel/Cluster for HA
Vault unreachable	Low	Critical (all traffic blocked)	503 spike → critical alert	Vault HA cluster; cached credentials TTL 5min as emergency fallback
Policy store stale	Medium	Medium (stale policy decisions)	Policy cache age metric → alert	Cache TTL 60s; background refresh; explicit invalidation API
Prompt firewall false positive rate spike	Medium	High (legitimate traffic blocked)	400 rate spike from prompt firewall → alert	Tuning runbook; emergency bypass flag per caller (audited)
Audit log write failure	Low	Critical (regulatory compliance gap)	Dead-letter queue depth > 0 → critical alert	Retry with exponential backoff; dead-letter queue with separate drain process
TLS certificate expiry	Low	Critical (all traffic blocked)	Certificate expiry monitoring → 30-day warning	Automated certificate rotation via cert-manager or ACM
Model provider mass outage	Medium	High	Circuit breaker opens for multiple providers simultaneously	Fallback to on-premises model; queue non-urgent requests; alert users

Cascading Failure Scenarios

Scenario 1: Vault + Redis simultaneous failure If both Vault (credential store) and Redis (quota store) fail simultaneously, the gateway cannot retrieve credentials AND cannot enforce quotas. The gateway must fail closed (return 503) — accepting quota bypass (fail-open) while Vault is down would allow unlimited uncredentialed requests to reach providers once credentials are cached. Mitigation: Vault and Redis must be deployed on independent infrastructure with no shared failure domain.

Scenario 2: Policy store becomes unavailable during a security incident If the policy store becomes unavailable at the same time as a security event requiring policy update (e.g., a compromised caller key), the gateway will continue serving the last cached policy. Mitigation: emergency policy override API that writes directly to the in-memory cache on each gateway instance; secured with break-glass credentials.

14. Regulatory Considerations

Regulation	Requirement	Gateway Implementation
APRA CPS234 (Information Security)	Maintain information security controls for third-party service providers	Model provider access through gateway enforces ACL; audit trail demonstrates access governance
APRA CPS230 (Operational Risk)	Identify and manage risks from third-party dependencies	Circuit breaker provides operational resilience; provider health metrics enable risk monitoring
Australian Privacy Act 1988	Personal information must not be disclosed to overseas recipients without consent	Data classification enforcement in policy engine blocks requests containing PI from routing to non-approved providers
EU AI Act Article 9 (Risk Management)	High-risk AI systems must implement risk management measures	Gateway enforces model access controls for high-risk use cases; audit log supports risk documentation
EU AI Act Article 12 (Record-Keeping)	High-risk AI systems must maintain logs enabling post-hoc audit	Immutable audit log with 7-year retention satisfies this requirement
ISO/IEC 42001 §6.1 (Risk Treatment)	Implement controls for identified AI risks	Gateway operationalises risk treatment actions from AI risk register
NIST AI RMF GOVERN 1.2	Accountability mechanisms for AI systems	Caller identity + audit log creates clear accountability chain for every AI request
NIST AI RMF MANAGE 2.4	Monitor AI system performance	Gateway metrics and alerts implement continuous AI performance monitoring

15. Reference Implementations

AWS

Component	AWS Service
Gateway compute	ECS Fargate (Kong or custom Go service) or API Gateway with Lambda authoriser
Auth	Cognito (IdP) + Lambda JWT authoriser
Policy engine	Lambda function hosting OPA with S3 policy bundle
Rate limiting	ElastiCache for Redis (token bucket counters)
Secrets	AWS Secrets Manager with automatic rotation
Routing	Application Load Balancer + ECS service discovery
Audit logs	Kinesis Firehose → S3 with Object Lock (WORM)
Cost tracking	Custom Lambda → Cost and Usage Report + Athena
Monitoring	CloudWatch + X-Ray distributed tracing

Azure

Component	Azure Service
Gateway	Azure API Management (APIM) with custom policies
Auth	Azure AD + APIM JWT validation policy
Policy engine	APIM policy expressions + Azure Functions for complex rules
Rate limiting	APIM built-in rate limiting + Azure Cache for Redis
Secrets	Azure Key Vault with managed identity
Routing	APIM backends + Azure Application Gateway
Audit logs	Event Hub → Azure Immutable Blob Storage
Content safety	Azure AI Content Safety (integrates natively with APIM)
Monitoring	Azure Monitor + Application Insights

GCP

Component	GCP Service
Gateway	Apigee API Management or Cloud Run (Kong/Envoy)
Auth	Google Identity Platform + Cloud IAP
Policy engine	Cloud Run (OPA) with Cloud Storage policy bundles
Rate limiting	Memorystore for Redis
Secrets	Secret Manager
Audit logs	Cloud Logging → Cloud Storage with retention lock
Monitoring	Cloud Monitoring + Cloud Trace

On-Premises

Component	Technology
Gateway	Kong Enterprise or Envoy Proxy with custom filters
Auth	Active Directory Federation Services + OAuth2 Proxy
Policy engine	OPA deployed as sidecar or standalone service
Rate limiting	Redis Sentinel cluster
Secrets	HashiCorp Vault Enterprise
Audit logs	Kafka → Elasticsearch with ILM immutability policy
Monitoring	Prometheus + Grafana + Jaeger

Pattern	ID	Relationship
Prompt Firewall	EAAPL-SEC002	Deployed inline within gateway; gateway calls firewall as a filter stage
LLM Input Sanitisation	EAAPL-SEC005	Complementary to prompt firewall; deeper PII/schema validation within gateway pipeline
AI Output Filtering	EAAPL-SEC006	Deployed as post-generation filter within gateway; shares audit log infrastructure
Zero-Trust AI Pipeline	EAAPL-SEC007	Gateway is the primary enforcement point for zero-trust policy; SEC007 extends to intra-pipeline trust
Secrets Management for AI	EAAPL-SEC008	Gateway depends on this pattern for all model provider credentials
AI Data Classification	EAAPL-SEC009	Classification labels consumed by gateway policy engine for routing decisions
AI Telemetry	EAAPL-OBS001	Gateway is the primary source of AI telemetry (token counts, latency, errors)
AI Cost Observability	EAAPL-OBS006	Gateway's cost ledger is the primary data source for cost observability
Model Isolation	EAAPL-SEC003	Gateway enforces network boundaries that complement model isolation at the compute layer

17. Maturity Assessment

Overall Maturity: Mature

Dimension	Score (1–5)	Rationale
Pattern definition clarity	5	Well-defined, unambiguous scope and responsibilities
Technology availability	5	Mature OSS and commercial options available across all major clouds
Industry adoption	4	Widely adopted in financial services and regulated industries; emerging in other sectors
Operational tooling	4	Strong monitoring and operations tooling; some AI-specific metrics require custom implementation
Regulatory alignment	5	Directly addresses APRA CPS234, EU AI Act, Privacy Act requirements
Reference implementation availability	4	Reference implementations available for all major clouds; AI-specific extensions require custom work
Community knowledge	4	Strong API gateway community; LLM-specific extensions are an emerging body of knowledge

18. Revision History

Version	Date	Author	Changes
1.0	2024-01-15	AI Architecture Team	Initial pattern definition
1.1	2024-04-20	AI Architecture Team	Added EU AI Act regulatory mapping; expanded DR scenarios
2.0	2024-09-10	AI Architecture Team	Major revision: added streaming support guidance; updated OWASP LLM Top 10 to 2024 edition; added GCP reference implementation
2.1	2025-03-01	AI Architecture Team	Added cost observability integration; expanded failure mode analysis; aligned with ISO 42001 §6.1

Track this pattern for APRA/ASIC review

← Back to Library More AI Security →