EAAPLEnterprise AI Architecture Pattern Library
EAAPLLibraryPlatform EngineeringEAAPL-PLT004
EAAPL-PLT004Proven
⇄ Compare

LLM Cost Control

[EAAPL-PLT004] LLM Cost Control

Category: Platform Engineering Sub-category: FinOps / Cost Management Version: 1.2 Maturity: Proven Tags: finops, cost-management, token-budget, prompt-caching, model-tier-routing, cost-alerting, chargeback, spending-dashboards Regulatory Relevance: APRA CPS 230 (Operational Risk — cost controls), ISO 42001


1. Executive Summary

LLM inference costs exhibit a dangerous property shared with no previous enterprise technology: they scale with usage in ways that are invisible until the cloud bill arrives. A single poorly-scoped prompt with an unbounded context window can consume more compute in one request than an hour of traditional API calls. Without systematic controls, a single runaway AI feature or misconfigured pipeline can generate tens of thousands of dollars in unexpected spend within hours.

This pattern establishes a comprehensive cost control framework covering the full lifecycle of an LLM request: upfront budget enforcement (token limits per request, per consumer, per time period), intelligent routing to cost-appropriate model tiers, prompt caching to eliminate redundant computation, batch versus real-time optimisation, and real-time spend alerting with dashboard visibility for FinOps and engineering leadership. Organisations that implement this pattern systematically report 40–60% reduction in LLM spend compared to unmanaged baseline while maintaining feature quality, enabling AI investment to scale with genuine business value rather than inefficiency.


2. Problem Statement

Business Problem

LLM API costs appear as undifferentiated cloud charges with no attribution to products, teams, or decisions. When spend spikes, root cause analysis takes days. Budget sign-off for AI initiatives is difficult because cost projections are unreliable. AI spend is growing faster than business value in organisations without controls, triggering executive concern about AI investment ROI.

Technical Problem

Individual LLM requests have highly variable token consumption based on prompt construction, context window usage, and response length. Without per-request token limits, a buggy prompt template can send 100K-token requests when 2K was intended. Without model tier routing, all requests use frontier model pricing. Without caching, identical or near-identical prompts are computed fresh on every call. Without budget enforcement, a single batch job can exhaust a monthly budget.

Symptoms

  • Monthly AI cloud bills with variance >50% month-to-month without corresponding business activity change
  • No ability to attribute AI spend to individual products, teams, or features
  • Alerts for AI cost anomalies discovered retrospectively when the bill arrives
  • All AI traffic routed to the most expensive model regardless of task requirements
  • Identical FAQ-style prompts computed fresh on every call with no caching

Cost of Inaction

  • AI spend growing to unsustainable levels, threatening AI investment programme shutdown
  • Executive loss of confidence in AI ROI due to uncontrolled cost growth
  • Inability to negotiate volume discounts with providers without consolidated spend data
  • Cross-team cost externalities: one team's runaway workload degrades token budget for all teams

3. Context

When to Apply

  • Organisation's monthly AI API spend exceeds $5,000 or is projected to exceed this within 3 months
  • Multiple teams or use cases share AI infrastructure without cost isolation
  • FinOps team requires per-team or per-product cost attribution
  • AI cost efficiency is an explicit KPI for the AI programme

When NOT to Apply

  • Single small-scale proof of concept: overhead of full cost control not warranted
  • Single team with a single predictable, fixed-cost workload: direct budget monitoring sufficient
  • Air-gapped self-hosted deployments with no per-token cost: infrastructure cost management applies instead

Prerequisites

  • AI API Gateway (PLT002) as enforcement point for budget controls
  • Cost allocation taxonomy agreed between FinOps and engineering (team/product/environment dimensions)
  • Observability stack for real-time cost event ingestion
  • Stakeholder agreement on what constitutes a budget threshold and escalation path

Industry Applicability

Industry Applicability Key Cost Driver
Technology / SaaS Very High AI features at scale; customer-facing token consumption
Retail / E-commerce Very High Product descriptions, search, personalisation at catalog scale
Financial Services High Research automation, document processing, customer service
Healthcare High Clinical documentation, patient communication at volume
Media / Content Very High Content generation, summarisation, moderation at scale
Government Medium Document processing; typically lower volume

4. Architecture Overview

The LLM Cost Control pattern operates across three time horizons: per-request controls that enforce hard limits on individual calls, per-period budget controls that enforce cumulative spending limits over time windows (daily/weekly/monthly), and strategic optimisations that systematically reduce the per-token cost of all traffic.

Per-Request Token Budget Enforcement is the first line of defence. Every request entering the AI API Gateway is evaluated for its estimated token consumption. The max_tokens parameter is enforced as a hard ceiling; requests without an explicit max_tokens receive a platform default (configurable per model tier and use case). Input token limits per request prevent context window abuse: a request exceeding the configured input token limit for its use case classification is rejected with a 413 response and a recommendation to use the batch API instead. This single control eliminates the most common cause of surprise cost spikes.

Consumer and Team Budget Tracking maintains real-time token consumption counters per consumer, team, project, and environment. These counters are maintained in a Redis data structure (sorted sets for time-windowed aggregation) and updated atomically on every request completion. Budget thresholds are configured at multiple levels: a soft warning threshold (80% of period budget consumed → alert to team lead), a hard throttle threshold (100% → requests rate-limited to a configured percentage), and an emergency ceiling (110% → requests blocked entirely until human approval to extend). The tiered response prevents hard stops from creating operational incidents while still enforcing accountability.

Model Tier Routing (see EAAPL-PLT003 for full treatment) is the largest lever for strategic cost reduction. The cost control layer maintains a cost model for each available model endpoint (cost per 1K input tokens, cost per 1K output tokens) and uses this in conjunction with the routing strategy to route each request to the cheapest model meeting the quality requirement. The cost model is updated automatically from provider pricing APIs where available. A/B routing experiments track cost efficiency alongside quality to inform routing policy updates.

Prompt Caching operates at two levels. Provider-side prompt caching (supported by Anthropic Claude and OpenAI) caches the KV computation for prompt prefixes at the model provider level; this requires structuring prompts with stable system prompt prefixes at the beginning of the context. Platform-side semantic caching (PLT006) caches full responses for near-identical prompts at the gateway level. Both mechanisms reduce effective token consumption; the platform cost model tracks cache hit rates and attributable savings separately so the value of caching investment is visible.

Batch vs. Real-Time Optimisation provides a structural cost reduction for non-interactive workloads. The cost control layer routes requests tagged as execution-mode: batch through provider batch APIs (OpenAI Batch API, Anthropic Message Batches) which offer 50% token cost reduction at the expense of 24-hour latency. Product teams are guided to tag their use cases appropriately during onboarding; the developer portal surfaces the cost differential to encourage correct classification.

Cost Alerting and Dashboards provide the operational visibility layer. Real-time cost events from all requests are streamed to the Cost Management Service, which aggregates by team/product/environment dimensions and evaluates against configured budget thresholds. Alerts are delivered via PagerDuty (emergency), Slack (warning), and email (daily digest). The FinOps dashboard (Grafana or Superset) provides spend-by-team, spend-by-model, cache savings, and projection-to-period-end views.


5. Architecture Diagram

ARCHITECTURE DIAGRAM
flowchart TD subgraph Enforcement["Request Enforcement"] A[Incoming Request] B[Token Limit Check] C{Budget Tracker} end subgraph Routing["Cost-Aware Routing"] D[Model Tier Router] E[Prompt Cache Check] end subgraph Models["Model Endpoints"] F[Efficiency Model] G[Frontier Model] H[Batch API] end A --> B B --> C C -->|within budget| D C -->|over budget| I[Block + Alert] D --> E E -->|cache miss| F E -->|complex task| G E -->|batch tag| H F --> J[(Token Counter)] G --> J H --> J J --> K[FinOps Dashboard] style A fill:#dbeafe,stroke:#3b82f6 style B fill:#f0fdf4,stroke:#22c55e style C fill:#f3e8ff,stroke:#a855f7 style D fill:#f0fdf4,stroke:#22c55e style E fill:#fef9c3,stroke:#eab308 style F fill:#d1fae5,stroke:#10b981 style G fill:#dbeafe,stroke:#3b82f6 style H fill:#dbeafe,stroke:#3b82f6 style I fill:#fee2e2,stroke:#ef4444 style J fill:#fef9c3,stroke:#eab308 style K fill:#d1fae5,stroke:#10b981

6. Components

Component Type Responsibility Technology Options Criticality
Input Token Limit Enforcer Middleware Validate max_tokens parameter; enforce input token ceiling per use case Custom gateway middleware; token counting library (tiktoken) High
Consumer Budget Tracker Service Maintain real-time token consumption counters per consumer/team/period Redis sorted sets (ZADD/ZRANGEBYSCORE for time windows) Critical
Budget Threshold Evaluator Service Evaluate thresholds; trigger warnings and blocks Custom service backed by Redis Critical
Cost Model Store Service Maintain per-model pricing data; update from provider pricing APIs Redis hash or PostgreSQL table High
Model Tier Router Service Select cheapest adequate model for request (see PLT003) LiteLLM cost-based routing, custom rule engine Critical
Provider Prompt Cache Manager Service Structure prompts for provider-side KV cache; track cache hit rates Custom, provider SDK integration High
Semantic Cache (Platform-Side) Service Cache full responses for near-identical prompts (see PLT006) GPTCache, Redis + vector index High
Batch Route Classifier Service Classify requests as batch-eligible based on execution mode tag Custom rule-based classifier Medium
Cost Event Publisher Service Emit per-request cost events for aggregation Kafka producer, CloudWatch PutMetricData Critical
Alert Engine Service Evaluate budget thresholds; dispatch alerts PagerDuty, Slack webhook, email (SES/Sendgrid) High
Cost Dashboard Service Real-time and historical spend visualisation Grafana, Apache Superset, PowerBI Medium
Chargeback Report Generator Service Monthly per-team cost attribution reports Custom SQL on cost events, Metabase Medium

7. Data Flow

Primary Flow — Request with Budget Enforcement

Step Actor Action Output
1 Consumer Application Submit request with max_tokens: 2048, use-case: summarisation, team: marketing Request at gateway cost control stage
2 Input Token Limit Enforcer Count input tokens using tiktoken; compare to use-case ceiling (summarisation: 8192 input tokens) Tokens within limit; proceed
3 Consumer Budget Tracker Query Redis for marketing team's tokens used this month vs. monthly budget Remaining: 2.4M tokens (80% used → warning threshold crossed)
4 Budget Threshold Evaluator 80% threshold crossed; emit warning alert to team-marketing Slack channel Warning alert dispatched; request continues
5 Cost Model Lookup Retrieve cost model for routing: Claude Haiku ($0.0001/1K input, $0.0002/1K output) vs. Claude Sonnet ($0.003/$0.015) Cost delta available for routing decision
6 Model Tier Router Complexity LOW; select Claude Haiku (cost-based); circuit breaker CLOSED Selected: Claude Haiku
7 Provider Prompt Cache Check Check if prompt prefix is in Anthropic KV cache; cache HIT Provider cache hit; 90% of prompt tokens not charged
8 Upstream Call Forward to Claude Haiku; cache hit reduces effective input tokens Response returned; actual billed tokens: ~200 (uncached suffix)
9 Cost Event Publish Emit cost event: {team: marketing, model: claude-haiku, input_tokens: 200 (cache hit), output_tokens: 512, cost_usd: 0.000122} Cost event in stream
10 Budget Counter Update Update Redis counter for marketing team: +712 effective tokens Counter updated atomically

Error Flow

Error Condition Detection Response
Input token count exceeds use-case ceiling Token counter at step 2 413 Request Entity Too Large with token count details
Team budget at hard 100% limit Budget tracker at step 3 429 with budget-exhausted code; 24h until reset or manual approval needed
Cost model stale/unavailable Cost model service timeout Log warning; proceed with routing using last-known-good cost model
Batch API unavailable for batch-tagged request Batch route check failure Fall back to real-time API; log cost increase for later review

8. Security Considerations

  • Budget bypass attempts (manually setting max_tokens above the enforced ceiling) are rejected at the gateway; the ceiling is a platform-enforced control, not a suggestion
  • Consumer token counters are stored in a dedicated Redis instance with no direct consumer write access; only the gateway cost accounting service can increment counters
  • Cost event stream is read-only for consumers; teams can view their own consumption data but not other teams'
  • Chargeback reports are distributed per-team; cross-team visibility requires FinOps-level access

OWASP LLM Top 10 Controls

OWASP LLM Risk Cost Control Layer
LLM04 Model DoS Token budget per consumer prevents any single consumer exhausting platform capacity; this is both a cost and availability control
LLM08 Excessive Agency Agentic loops are bounded by per-session token budgets; runaway agent loops are expensive before they are harmful

9. Governance Considerations

Budget Governance

  • Monthly token budgets per team are approved by the AI Governance Board and FinOps team jointly; requests for budget increases require business case documentation
  • Emergency budget extensions (overriding the hard ceiling) require explicit sign-off from the team's engineering manager and FinOps; all extensions are logged

Chargeback Model

  • Costs attributed via the cost event stream are the official basis for internal chargeback; teams are responsible for their attributed costs
  • The cost model is updated quarterly as provider pricing changes; teams are notified 30 days in advance of pricing model changes

Governance Artefacts

Artefact Owner Cadence Location
Team budget schedule FinOps + AI Governance Board Annual (reviewed quarterly) Platform configuration + finance system
Budget extension approvals Engineering Manager + FinOps Per-event GRC system
Monthly chargeback report Platform Team Monthly Finance system + team dashboards
Cost model pricing updates Platform Team Quarterly Platform configuration
Cost optimisation roadmap FinOps + Platform Team Quarterly Internal wiki

10. Operational Considerations

Monitoring

Signal Source Alert Threshold Owner
Team budget at 80% Budget tracker Event-driven FinOps + Team Lead
Team budget at 100% Budget tracker Event-driven (high urgency) FinOps + Team Lead + Engineering Manager
Daily spend > 1.5× previous day average Cost event aggregation Daily window FinOps On-Call
Abnormal token counts per request (P99 spike) Request metrics >200% of rolling P99 baseline Platform On-Call
Cache hit rate drop Cache metrics <10% sustained 1 hour Platform Team
Budget tracker service unavailable Health check Immediate Platform On-Call

SLOs

SLO Target Window
Cost event ingestion latency <5 seconds from request completion Rolling 7 days
Budget counter accuracy <1% variance from actual provider charges Monthly reconciliation
Alert delivery latency <60 seconds from threshold breach Per-event
Dashboard data freshness <5 minutes lag Rolling 7 days

Disaster Recovery

Component RPO RTO Strategy
Budget counter (Redis) 5 min 5 min Redis Sentinel; brief window of over-limit requests acceptable
Cost event stream (Kafka) <1 min 10 min Cross-region replication
Dashboard (read-only) 1 hour 30 min Acceptable staleness for non-critical service
Chargeback report data 0 24 hours Recomputable from cost event archive

11. Cost Considerations

Cost Drivers

Driver Description Relative Weight
Redis for budget counters Minimal memory footprint; high throughput needed Very Low
Cost event stream (Kafka/Kinesis) Volume proportional to request rate Low
Dashboard hosting Read-only service; moderate cost Low
LLM API costs (controlled) Primary cost being managed; all controls aimed here Dominant

Indicative Cost Range

Scale Monthly Cost Control Infra LLM Savings from Controls
Small (<1M tokens/day) $100–$300 $500–$2,000 from tier routing + caching
Medium (1–50M tokens/day) $500–$2,000 $5,000–$30,000 from combined controls
Large (>50M tokens/day) $2,000–$8,000 $30,000–$150,000+ from combined controls

12. Trade-Off Analysis

Budget Enforcement Options

Option Description Pros Cons Best For
Hard Stop at 100% Block all requests at budget limit Absolute cost certainty Operational incidents if budget misconfigured Finance-controlled AI programmes; strict cost accountability
Soft Throttle at 100% Allow requests at reduced rate (e.g., 10% of normal) after limit Degraded not dead Still accumulates cost above budget Product-focused teams; uptime priority
Alert Only No enforcement; only alerts at thresholds No operational impact No cost control; only cost visibility Initial rollout; trust-based environment

Caching Strategy Options

Option Description Pros Cons Best For
Provider-Side Cache Only Use provider KV cache for prefix caching Zero additional infrastructure; reduces input tokens Only prefix-level caching; no cross-request caching Workloads with long stable system prompts
Semantic Cache Only Platform-level near-match response caching Cross-request caching; higher hit rate potential Privacy considerations; false positive risk FAQ, classification, search augmentation
Combined Provider + Semantic Both layers active Maximum cost reduction Complexity; requires careful TTL management High-volume mixed workloads

Architectural Tensions

Tension Option A Option B Resolution
Strict per-request limits vs. flexible prompting Hard input token ceiling Soft guidance Configurable per use-case class; creative use cases have higher limits
Team autonomy vs. cost governance Teams set own budgets Central FinOps sets all budgets FinOps sets envelope; teams allocate within envelope by product/feature
Cache freshness vs. cost savings Low TTL (fresh) High TTL (cheap) TTL per corpus type; static knowledge bases: long TTL; dynamic context: short/no TTL

13. Failure Modes

Failure Likelihood Impact Detection Recovery
Budget tracker (Redis) failure Medium Medium — budget enforcement suspended Redis health check fail Fail-safe: revert to rate limiting only; alert FinOps
Cost model stale (pricing outdated) Medium Low — routing decisions suboptimal Automatic freshness check alert Manual pricing update; automated via provider pricing API
Token counter drift (Redis vs. actual spend) Low Medium — budget accountability gap Monthly reconciliation vs. provider invoice Reconciliation report triggers manual correction
Alert fatigue (too many budget warnings) High Low-Medium — alerts ignored Alert volume metrics Tune thresholds; consolidate daily digest vs. real-time alerts
Batch API failure causing real-time fallback Medium Medium — unexpected cost increase Batch failure rate spike Alert FinOps; teams approve real-time cost increase or pause workload

14. Regulatory Considerations

APRA CPS 230 (Operational Risk)

  • Cost control mechanisms are operational risk controls for the AI platform; the budget enforcement system must itself be resilient
  • AI cost overruns that materially affect the organisation's operational budget may constitute an operational risk event reportable under CPS 230

Financial Reporting

  • Internal cost attribution data must be accurate enough to support financial reporting; the cost event reconciliation process ensures chargeback data matches actual provider invoices

15. Reference Implementations

AWS

Component AWS Service
Budget counters ElastiCache Redis
Cost events Kinesis Data Streams → S3 → Athena
Alerts CloudWatch Alarms + SNS → PagerDuty / Slack
Dashboard CloudWatch custom dashboards + Grafana
Chargeback reports Athena queries + S3 + QuickSight
Provider pricing API Bedrock pricing API (where available)

Azure

Component Azure Service
Budget counters Azure Cache for Redis
Cost events Event Hubs → Azure Data Lake Gen2
Alerts Azure Monitor Alerts + Action Groups
Dashboard Azure Monitor Workbooks + Grafana

On-Premises

Component Technology
Budget counters Redis Enterprise
Cost events Apache Kafka → ClickHouse
Dashboard Grafana + ClickHouse data source
Alerts Alertmanager → PagerDuty

Pattern ID Name Relationship
EAAPL-PLT002 AI API Gateway Host — budget enforcement implemented within gateway pipeline
EAAPL-PLT003 Model Routing Component — cost-based routing is a cost control mechanism
EAAPL-PLT006 LLM Caching Layer Complementary — caching reduces effective token consumption
EAAPL-PLT001 Enterprise AI Platform Parent — cost management is a shared service
EAAPL-INT005 Batch AI Processing Complementary — batch routing reduces cost for async workloads

17. Maturity Assessment

Overall Maturity: Proven Token budget enforcement and model tier routing are production-proven at scale. Provider-side prompt caching is a relatively recent feature (2024) that is proving high-value. The combined pattern has strong ROI evidence across multiple enterprise deployments.

Scoring Matrix

Dimension Score (1–5) Rationale
Pattern Completeness 5 All sections documented
Implementation Evidence 4 Core controls proven; provider cache integration emerging
ROI Evidence 5 Consistent 40–60% spend reduction documented
Tooling Maturity 4 Redis counters and dashboards mature; provider pricing APIs variable
Operational Complexity Medium Budget configuration requires FinOps discipline; manageable

18. Revision History

Version Date Author Changes
1.0 2024-04-01 EAAPL Working Group Initial publication
1.1 2024-11-10 EAAPL Working Group Provider-side prompt caching section added; batch API cost models updated
1.2 2025-06-12 EAAPL Working Group Cost range data updated; tiered budget enforcement (soft/hard/ceiling) documented
← Back to LibraryMore Platform Engineering