EAAPL-AGT010 — AI Agent Cost Governance Architecture
Status: Proven
Tags: agent cost-optimisation observability llm medium-complexity
Version: 1.1
Last Updated: 2026-06-12
Author: Enterprise AI Architecture Pattern Library
1. Executive Summary
Agentic AI systems that autonomously execute multi-step workflows—calling tools, querying APIs, writing and executing code, and iterating on results—introduce a category of financial risk that traditional IT cost management does not address: runaway agent execution. A single misconfigured agent task can consume tens of thousands of dollars in LLM API tokens, tool execution costs, and third-party API calls within minutes, with no human in the loop to apply a circuit breaker. Enterprise deployments of AI agents without cost governance controls have experienced individual incidents consuming USD 10,000–100,000 in a single errant run.
This pattern provides a comprehensive cost governance architecture for AI agent deployments. It covers: per-agent token budget enforcement; per-task cost ceilings with hard stops; pre-flight cost estimation before task execution; real-time cost tracking mid-execution; cost anomaly detection with configurable thresholds; kill switch mechanisms for runaway agents; cost chargeback models to business units; cost dashboards per agent type; and cost optimisation strategies including model tier routing, tool result caching, and tool call batching. The architecture is designed to be agent-framework agnostic and is compatible with LangChain, AutoGen, CrewAI, Anthropic's Claude agent SDK, and custom agent implementations.
2. Problem Statement
Business Problem
Without cost governance, AI agent deployments create uncontrolled financial exposure. Business units deploying agents for automation tasks—document processing, research, code generation, customer service—may trigger agent runs that iterate excessively, call expensive tools repeatedly, or enter infinite loops. These incidents are invisible until the cloud billing alert fires at end-of-month. Monthly AI spend becomes unpredictable, chargebacks to business units are impossible to defend, and the CFO has no visibility into per-agent or per-department AI costs.
Technical Problem
LLM APIs price by token, with costs varying significantly by model tier. An agent that uses GPT-4o at USD 10/million output tokens versus GPT-4o-mini at USD 0.60/million can increase costs 16x for the same task if model selection is not governed. Agent loops—where the agent makes tool calls, receives results, reasons about next steps, makes more tool calls—can iterate dozens or hundreds of times on a single task if no termination condition fires. Tool calls themselves (web search APIs, code execution, database queries) may have per-call pricing. The aggregate cost of a complex agentic task is difficult to estimate in advance.
Symptoms
- Monthly LLM API spend is 3–5x higher than forecast with no clear explanation
- Individual agent runs have occasionally consumed USD 500–5,000 in tokens before timing out
- Business unit CTOs cannot explain their AI spend to finance because there is no per-use-case cost attribution
- Agents are deployed using the most capable (most expensive) model for all tasks regardless of task complexity
- No alerts fire when an agent run is 10x more expensive than average
Cost of Inaction
| Dimension |
Consequence |
| Financial |
Uncontrolled AI spend; budget overruns; individual runaway incidents costing USD 10K–100K |
| Operational |
Agent tasks compete for token budgets with production workloads; cascading slowdowns |
| Strategic |
Finance refuses to expand AI budget due to lack of cost predictability; AI programme stalls |
| Governance |
Inability to demonstrate ROI per business unit; AI investment decisions based on incomplete data |
3. Context
When to Apply
- Any production deployment of AI agents that make LLM API calls and/or tool calls
- Multi-agent systems where agents spawn sub-agents or parallel agent workflows
- Platforms offering self-service agent capabilities to multiple business units or teams
- AI agents performing open-ended research, code generation, or document processing tasks where task scope can expand unpredictably
When NOT to Apply
- Single-turn LLM API calls (chatbots, simple completions) — token budget enforcement is sufficient; full agent cost governance is over-engineered
- Agents running exclusively on fixed-cost infrastructure (self-hosted models with no per-token pricing) — cost ceiling controls are less relevant but monitoring still applies
- Development and testing environments — apply lighter-weight controls; save full governance for production
Prerequisites
| Prerequisite |
Description |
| LLM API Access with Usage Metering |
API provider exposes token usage per call in response headers or logs |
| Agent Observability Infrastructure |
Ability to instrument agent execution with per-step metrics |
| Cost Allocation Model |
Budget owners identified per agent type or business unit |
| Agent Execution Platform |
Agent framework through which governance controls can be injected |
Industry Applicability
| Industry |
Key Agent Cost Risk |
Governance Priority |
| Financial Services |
Automated research agents; regulatory document processing |
High — budget control + audit trail |
| Legal / Professional Services |
Document review agents; contract analysis |
High — per-matter cost attribution |
| Software Development |
Code generation agents; automated PR review |
Medium — per-repository cost tracking |
| Healthcare |
Clinical literature review; medical coding agents |
High — per-patient/per-task cost |
| Media and Publishing |
Content generation; translation agents |
Medium — per-content-item cost |
| E-commerce |
Product description; customer service agents |
Medium — per-SKU/per-session cost |
4. Architecture Overview
The Agent Cost Governance Architecture implements six layers of cost control that apply at different stages of agent execution.
Layer 1 — Pre-Flight Cost Estimation
Before an agent task is submitted for execution, the Pre-Flight Cost Estimator generates an estimated cost range based on: the task type (which has historical cost distributions from completed tasks); the configured model tier; the agent's tool set (each tool has an average cost-per-call from historical data); and the expected number of iterations (based on similar past tasks). The estimate is surfaced to the requesting system or user with three tiers: expected cost, 90th-percentile cost, and maximum permissible cost (the ceiling). If the 90th-percentile estimate exceeds the task's configured cost ceiling, the task is rejected before execution begins with a recommendation to adjust scope or model tier. This prevents obviously over-budget tasks from starting at all.
Layer 2 — Per-Agent Token Budget Enforcement
Each agent type has a configured token budget per execution: a combined input+output token limit. The Token Budget Enforcer intercepts every LLM API call made by the agent and maintains a running token counter. Before each LLM call, it checks whether the remaining budget is sufficient for the anticipated call (estimated by the prompt length + a configurable output reserve). If the remaining budget is insufficient, the agent is halted cleanly: a summarisation prompt is injected to generate a partial result from work completed so far, and the execution is terminated with a budget-exhausted status. This prevents gradual token drift from running unchecked across a long agent execution.
Layer 3 — Per-Task Cost Ceiling (Hard Stop)
The Per-Task Cost Ceiling operates in addition to the token budget—it tracks total monetary cost across all cost sources: LLM tokens (priced per model tier), tool calls (external API costs), code execution compute, and vector store query costs. Each task has a configurable monetary ceiling. When the real-time cost tracker detects that the cumulative cost has reached the ceiling, a hard stop is triggered: the agent's execution thread receives a termination signal, the partial result is saved, and a cost-limit-exceeded status is recorded. Unlike the token budget (which can be exhausted by a single very large prompt), the monetary ceiling catches runaway costs from repeated tool calls or expensive external API calls even when individual LLM calls are within budget.
Layer 4 — Real-Time Cost Tracking and Anomaly Detection
The Cost Monitor aggregates all cost signals from all executing agents in real time: token usage per LLM call (from API response headers), tool call costs (from tool execution middleware), and external API costs (from billing webhooks or mock pricing models). The anomaly detector compares each agent run's cumulative cost against the baseline distribution for that agent type. If the run exceeds 3x the average cost at any point in execution, an anomaly is flagged: an alert is raised to the platform team, and the agent run enters a "watchlist" state where cost is monitored more frequently. If the run reaches a configurable multiple of the ceiling (e.g., 80% of ceiling), a pre-stop warning is sent to the requesting system.
Layer 5 — Kill Switch and Emergency Stop
The Kill Switch provides human-initiated emergency stop capability for any running agent task. It is exposed as: an API endpoint accessible to platform operators and privileged users; a UI control in the agent cost dashboard; and an automated trigger when the cost anomaly detector fires at the critical threshold. When the kill switch is activated, the agent's execution context receives a graceful-shutdown signal. The agent is designed to respond to this signal by: completing the current LLM call (or aborting mid-call if above a time threshold); saving all work-in-progress to the task result store; and returning a partial result with a killed status. A hard kill (process termination) is available as a fallback if graceful shutdown does not complete within 30 seconds.
Layer 6 — Cost Attribution and Chargeback
The Cost Attribution Engine assigns costs to business units, cost centres, teams, and individual users based on agent execution metadata (agent type, requesting user, department tag, project code). Costs are aggregated in near-real-time into a Cost Dashboard that shows: current month spend by business unit; spend by agent type; top-cost tasks; anomaly history; and budget utilisation vs allocation. Monthly cost allocation reports are generated for finance chargeback. Budget alerts are sent to business unit owners when utilisation reaches configured thresholds (50%, 80%, 100%).
5. Architecture Diagram
flowchart TD
subgraph Request["Task Request Stage"]
TASKDEF[Task Definition\nAgent Type + Scope + Model]
PREFLIGHT[Pre-Flight Cost Estimator\nEstimate + Ceiling Check]
REJECT{Estimate Exceeds\nCeiling?}
end
subgraph Execution["Agent Execution Stage"]
AGENT[Agent Runtime\nLangChain / AutoGen / Custom]
TOKENBUDGET[Token Budget Enforcer\nPer-Agent Token Limit]
COSTTRACK[Real-Time Cost Tracker\nLLM + Tools + Compute]
CEILING[Cost Ceiling Monitor\nHard Stop at Ceiling]
end
subgraph Tools["Tool Execution Layer"]
TOOL1[Tool: Web Search\nCost: ~USD 0.003/call]
TOOL2[Tool: Code Execution\nCost: ~USD 0.01/run]
TOOL3[Tool: Database Query\nCost: ~USD 0.001/query]
TOOLMID[Tool Cost Middleware\nRecord + Accumulate]
end
subgraph Anomaly["Anomaly Detection + Kill Switch"]
ANOMALY[Anomaly Detector\n3x Baseline = Alert]
WATCHLIST[Watchlist Monitoring\nIncreased Sampling]
KILLSWITCH[Kill Switch\nManual + Automated]
RESULT[(Partial Result Store\nSave on Stop)]
end
subgraph Attribution["Cost Attribution"]
ATTRIB[Cost Attribution Engine\nTag by BU / Team / User]
DASHBOARD[Cost Dashboard\nSpend by Agent + BU + Task]
CHARGEBACK[Monthly Chargeback\nReport for Finance]
BUDGETALERT[Budget Alerts\n50% / 80% / 100%]
end
TASKDEF --> PREFLIGHT
PREFLIGHT --> REJECT
REJECT -->|Yes| DENIED[Task Rejected\nAdjust Scope or Model]
REJECT -->|No| AGENT
AGENT --> TOKENBUDGET
TOKENBUDGET -->|Budget OK| TOOL1
TOKENBUDGET -->|Budget OK| TOOL2
TOKENBUDGET -->|Budget OK| TOOL3
TOKENBUDGET -->|Budget Exhausted| SUMMARIZE[Inject Summarisation\nReturn Partial Result]
TOOL1 --> TOOLMID
TOOL2 --> TOOLMID
TOOL3 --> TOOLMID
TOOLMID --> COSTTRACK
AGENT --> COSTTRACK
COSTTRACK --> CEILING
CEILING -->|Ceiling Hit| KILLSWITCH
COSTTRACK --> ANOMALY
ANOMALY -->|3x Baseline| WATCHLIST
WATCHLIST -->|Critical| KILLSWITCH
KILLSWITCH --> RESULT
COSTTRACK --> ATTRIB
ATTRIB --> DASHBOARD
ATTRIB --> CHARGEBACK
ATTRIB --> BUDGETALERT
6. Components
| Component |
Type |
Responsibility |
Technology Options |
Criticality |
| Pre-Flight Cost Estimator |
Processing |
Estimate task cost from historical data; reject if over ceiling |
Custom service; AWS Lambda; Azure Function |
High |
| Token Budget Enforcer |
Processing |
Track per-agent token consumption; halt on budget exhaustion |
Agent middleware (LangChain callback); custom interceptor |
Critical |
| Real-Time Cost Tracker |
Processing |
Aggregate costs from LLM + tools + compute in real time |
Custom aggregator; Apache Kafka; Redis + atomic counters |
Critical |
| Cost Ceiling Monitor |
Processing |
Compare running cost against ceiling; trigger hard stop |
Custom monitor; Lambda triggered by cost tracker |
Critical |
| Tool Cost Middleware |
Processing |
Intercept all tool calls; record cost per call; accumulate |
LangChain ToolCallbackHandler; custom decorator; OpenTelemetry |
High |
| Anomaly Detector |
Analytics |
Compare agent run cost against historical baseline; flag 3x outliers |
Custom statistical model; AWS CloudWatch Anomaly Detection; Datadog |
High |
| Kill Switch API |
Operations |
Accept manual or automated kill commands; trigger graceful shutdown |
REST API + WebSocket notification to agent runtime |
Critical |
| Partial Result Store |
Storage |
Preserve work-in-progress on kill or budget exhaustion |
S3 + DynamoDB; Azure Blob + Cosmos DB; Redis |
High |
| Cost Attribution Engine |
Analytics |
Tag costs by business unit, team, user, project code |
Custom tagging + aggregation; Databricks; BigQuery |
High |
| Cost Dashboard |
Reporting |
Real-time spend visualisation by agent type, BU, and task |
Grafana; Power BI; Retool; custom React dashboard |
Medium |
| Budget Alert System |
Operations |
Notify budget owners at 50/80/100% utilisation |
Email + Slack; PagerDuty; SNS |
Medium |
| Model Tier Router |
Optimisation |
Route tasks to lowest-cost model tier capable of the task |
Custom capability-cost matrix; LiteLLM; RouteLLM |
High |
| Tool Result Cache |
Optimisation |
Cache tool call results to avoid repeat API calls |
Redis; DynamoDB; Elasticache |
Medium |
7. Data Flow
Primary Flow
| Step |
Actor |
Action |
Output |
| 1 |
Requesting System |
Submit task with: agent type, scope, model preference, cost ceiling |
Task definition record |
| 2 |
Pre-Flight Estimator |
Look up historical cost distribution for this agent type and scope |
Estimated cost range: expected / P90 / maximum |
| 3 |
Pre-Flight Gate |
Compare P90 estimate against task ceiling; reject or approve |
Approved: task execution begins / Rejected: reason returned |
| 4 |
Model Tier Router |
Assign model tier based on task complexity vs cost optimisation policy |
Model assignment (e.g., gpt-4o-mini for initial reasoning; gpt-4o only for final synthesis) |
| 5 |
Agent Runtime |
Begin execution; Token Budget Enforcer starts counter |
Execution context with budget counter |
| 6 |
Tool Middleware |
Each tool call recorded: tool name, parameters, cost, latency |
Tool call record |
| 7 |
Real-Time Cost Tracker |
Aggregate: LLM token cost + tool call costs; update running total |
Updated running total per task |
| 8 |
Anomaly Detector |
Compare running total against P99 baseline for this agent type |
Alert if 3x baseline exceeded |
| 9 |
Agent Runtime |
Complete task; return result |
Task result with full cost record |
| 10 |
Cost Attribution Engine |
Tag final cost by BU, team, user, project; write to cost store |
Cost record in attribution store |
| 11 |
Dashboard |
Update real-time cost dashboard |
Live cost metrics visible |
Error Flow
| Step |
Failure |
Detection |
Recovery |
| Pre-Flight Estimator Cold Start |
No historical data for new agent type; cannot estimate |
New agent type flag |
Apply conservative default ceiling (e.g., USD 5); escalate after first run to set baseline |
| Token Budget Exhausted |
Agent runs out of token budget mid-task |
Token counter reaches limit |
Inject summarisation prompt; return partial result; log budget-exhausted status |
| Cost Ceiling Hit |
Running cost reaches ceiling before task completion |
Cost Monitor ceiling check |
Hard stop; save partial result; notify requesting system; alert budget owner |
| Kill Switch API Unavailable |
Cannot terminate runaway agent |
Health check on Kill Switch API |
Fallback: terminate agent container/process directly; alert platform team |
| Tool Cost Middleware Miss |
Tool call cost not recorded; running total underestimates |
Post-execution reconciliation against vendor billing |
Reconcile; update cost record; fix middleware |
8. Security Considerations
Security Controls
| Domain |
Control |
Implementation |
Notes |
| Authentication |
Kill Switch API requires elevated authentication; only platform operators and automated monitors can invoke |
OAuth 2.0 + RBAC; API key with rate limit |
Prevent malicious halting of legitimate agent tasks |
| Authorisation |
Cost attribution data accessible only to authorised roles per BU; finance has read-all |
RBAC on cost dashboard and attribution store |
Prevent cross-BU cost data leakage |
| Secrets |
LLM API keys stored in secrets manager; never hardcoded in agent execution context |
AWS Secrets Manager; HashiCorp Vault |
Prevent API key leakage through agent logs |
| Auditability |
All kill switch invocations logged with initiator identity, reason, and agent state at time of kill |
Immutable audit log |
Provides investigation trail for anomalous kills |
| Agent Output Security |
Partial results saved on kill may contain sensitive data; access-controlled |
Same security controls as full task results |
|
OWASP LLM Top 10 — Cost Governance Interaction
| OWASP LLM Risk |
Cost Governance Relevance |
Control |
| LLM01 Prompt Injection |
Attacker injects prompt causing agent to execute expensive, unnecessary tool calls |
Input validation; tool call whitelist; anomaly detection catches cost spike |
| LLM02 Insecure Output Handling |
Agent output passed to another expensive agent in a loop |
Output validation; inter-agent cost tracking; circuit breaker on agent chains |
| LLM03 Training Data Poisoning |
Not directly a cost risk |
N/A for cost governance |
| LLM04 Model Denial of Service |
Deliberate high-cost task submission to exhaust budget |
Pre-flight ceiling; rate limiting on task submission; per-user budget quotas |
| LLM05 Supply Chain Vulnerabilities |
Third-party tool or plugin executes expensive operations unexpectedly |
Tool cost middleware captures all tool costs; anomaly detection catches surprises |
| LLM06 Sensitive Information Disclosure |
Expensive vector store queries to find sensitive data exfiltration targets |
Cost governance is detective; combine with DLP controls |
| LLM07 Insecure Plugin Design |
Plugin with excessive permissions makes many expensive API calls |
Tool call whitelist; per-tool call budget; tool call count limit |
| LLM08 Excessive Agency |
Autonomous agent takes open-ended actions generating unlimited costs |
Cost ceiling is the primary control; human approval gate for tasks above threshold |
| LLM09 Overreliance |
User accepts agent output without checking quality; pays for expensive bad results |
Not a cost control issue; quality monitoring is separate |
| LLM10 Model Theft |
Not directly a cost risk |
N/A for cost governance |
9. Governance Considerations
Cost Governance Framework
| Domain |
Requirement |
Owner |
Cadence |
| Budget Allocation |
Annual AI agent budget allocated to each business unit |
Finance + BU heads |
Annual; revised quarterly |
| Cost Ceiling Policy |
Organisation-wide policy on per-task cost ceilings by agent type and risk tier |
AI Platform team + Finance |
Reviewed quarterly |
| Anomaly Response Playbook |
Documented process for investigating and responding to cost anomalies |
AI Platform team |
On anomaly |
| Chargeback Model |
Agreed methodology for allocating shared AI platform costs to BUs |
Finance |
Reviewed annually |
| Model Tier Policy |
Approved model tier assignments by task type |
AI Platform team |
Reviewed on model price changes |
Governance Artefacts
| Artefact |
Description |
Retention |
| Cost Ceiling Policy |
Per-agent-type monetary ceilings and token budgets |
Current version + 3 years |
| Monthly Cost Attribution Reports |
Per-BU, per-agent-type, per-user cost summaries for chargeback |
7 years |
| Anomaly Investigation Records |
Investigation and root cause for each flagged cost anomaly |
3 years |
| Kill Switch Audit Log |
All kill switch invocations with identity, reason, agent state |
3 years |
| Budget Utilisation Reports |
Monthly BU budget utilisation vs allocation |
7 years |
10. Operational Considerations
Monitoring and SLOs
| SLO |
Target |
Measurement |
Breach Action |
| Pre-Flight Estimation Accuracy |
P90 estimate within 2x of actual cost |
Estimate vs actual comparison after completion |
Retrain estimator; adjust confidence multiplier |
| Runaway Agent Incidents |
0 tasks exceed 5x their configured ceiling |
Tasks with cost > 5x ceiling / month |
Root cause analysis; tighten anomaly threshold |
| Kill Switch Response Latency |
<5 seconds from kill command to agent termination |
Kill-to-stop time metric |
Investigate agent shutdown logic; fallback to process kill |
| Cost Dashboard Latency |
Cost data visible within 60 seconds of task completion |
Dashboard data freshness metric |
Investigate ingestion pipeline |
| Budget Alert Delivery |
100% of threshold breaches generate alert within 2 minutes |
Alert delivery rate and latency |
Investigate alerting pipeline |
Disaster Recovery
| Scenario |
Impact |
Recovery |
| Real-Time Cost Tracker Outage |
Cost ceilings cannot be enforced; anomaly detection offline |
Halt agent task submission; restore tracker; batch-reconcile costs post-restoration |
| Kill Switch API Outage |
Cannot terminate runaway agents |
Fallback: terminate agent container directly via platform API; escalate to on-call |
| Cost Attribution Store Outage |
Costs not attributed during outage period; chargeback gap |
Restore from backup; estimate attributions from execution logs |
11. Cost Considerations
Cost Optimisation Strategies
| Strategy |
Description |
Estimated Saving |
Implementation |
| Model Tier Routing |
Route simple subtasks to cheaper models (GPT-4o-mini, Haiku, Llama) |
40–80% cost reduction for mixed-complexity workloads |
RouteLLM; LiteLLM; custom capability-cost matrix |
| Tool Result Caching |
Cache tool call results (web search, database queries) for reuse within session |
20–50% reduction in tool call costs for repetitive tasks |
Redis cache on tool middleware; TTL-based invalidation |
| Tool Call Batching |
Batch multiple tool calls in a single API request where tool supports it |
10–30% reduction in per-call overhead |
Tool wrapper with batching logic |
| Prompt Compression |
Compress verbose context before injection into LLM prompt |
20–40% input token reduction |
LLMLingua; selective context; summarisation of prior history |
| Agent Loop Limit |
Hard limit on maximum iterations before forced summarisation |
Prevents open-ended iteration loops consuming unlimited tokens |
Agent configuration parameter; enforced by framework |
Indicative Cost Range
| Agent Type |
Typical Cost Per Run (Without Governance) |
With Governance (Model Routing + Caching) |
Notes |
| Document Summarisation |
USD 0.10–2.00 |
USD 0.05–0.80 |
Significant savings from model routing |
| Research Agent (10 web searches) |
USD 0.50–5.00 |
USD 0.20–2.00 |
Tool caching on repeat searches |
| Code Generation + Test |
USD 1.00–10.00 |
USD 0.50–4.00 |
Model routing for planning; premium for generation |
| Multi-Agent Orchestration (5 agents) |
USD 5.00–50.00 |
USD 2.00–20.00 |
Cumulative savings across all agents |
| Runaway Agent (unbounded loop) |
USD 100–10,000 |
USD 0.50–50 (ceiling enforced) |
Ceiling is the critical control |
12. Trade-Off Analysis
Architecture Options
| Option |
Description |
Pros |
Cons |
Recommended For |
| Option A: Budget Enforcer Only |
Implement only token budget enforcement; no pre-flight or anomaly detection |
Low complexity; fast to implement |
No pre-emptive cost estimation; no real-time monetary ceiling |
Early-stage agent deployments; low agent task volume |
| Option B: Full Cost Governance Stack |
All six layers: pre-flight, token budget, monetary ceiling, anomaly detection, kill switch, attribution |
Maximum cost control; complete visibility |
Higher implementation complexity; marginal latency overhead |
Production multi-agent platforms; significant AI spend; multiple BUs |
| Option C: Cloud Provider Native Controls |
Use cloud provider billing alerts and budget controls only |
Zero implementation cost; no architecture changes |
Alerts are after-the-fact (billing cycle lag); cannot stop runaway mid-execution |
Acceptable only as a supplement, not a primary control |
Architectural Tensions
| Tension |
Trade-Off |
Resolution |
| Cost Control vs Agent Capability |
Tight token budgets and model tier routing may reduce agent task quality |
Risk-tier tasks: quality-critical tasks get higher budgets and premium models; routine tasks use optimised settings |
| Real-Time Control vs Latency |
Synchronous cost checks add latency to every LLM call |
Async cost tracking with threshold polling; synchronous only at ceiling approach (80% threshold) |
| Granularity vs Overhead |
Very granular per-tool cost tracking creates instrumentation overhead |
Profile high-cost tools precisely; estimate low-cost tools with cached average |
| Pre-Flight Accuracy vs Estimation Speed |
Highly accurate estimates require complex models; too slow for interactive tasks |
Fast heuristic estimate for interactive tasks; detailed estimate for batch/async tasks |
13. Failure Modes
| Failure |
Likelihood |
Impact |
Detection |
Recovery |
| Pre-Flight Estimator Underestimates |
Medium |
High — tasks approved that exceed ceiling; ceiling becomes primary control |
Post-completion actual vs estimate comparison |
Retrain estimator; lower approval threshold temporarily |
| Token Budget Enforcer Bypassed |
Low |
High — unlimited token consumption |
Post-execution cost reconciliation |
Enforce budget at API gateway level as backup; architecture review |
| Ceiling Hit During Critical Task |
Medium |
Medium — partial result returned; task must be retried with higher ceiling |
Cost ceiling log; requesting system receives ceiling-exceeded status |
Requesting system retries with higher ceiling (manual authorisation); partial result may be usable |
| Anomaly Detector False Positive |
Medium |
Low — legitimate expensive task flagged; human review required |
Human review clears anomaly |
Tune anomaly threshold; add agent type-specific baselines |
| Runaway Agent in Multi-Agent System |
Low |
Critical — spawned sub-agents multiply cost |
Anomaly detector catches spend spike |
Kill parent agent; costs for sub-agents still attributed; root cause |
Cascading Failure Scenario
A research agent is deployed to automate competitive intelligence gathering. The agent is configured to call a web search tool and a summarisation tool. The agent prompt contains an instruction to be thorough, which the agent interprets as requiring 50+ web searches per research task (each at USD 0.003). The token budget is set in tokens only; there is no monetary ceiling and no anomaly detector. The agent runs 200 research tasks overnight in batch mode. Each task costs USD 0.50–2.00. The batch costs USD 300. This is above budget but not alarming. Three months later, the agent's prompt is modified to add a competitor product deep-dive sub-task. The modified agent spawns a sub-agent per competitor (10 competitors) per research task. Cost per parent task: USD 25. 200 nightly batch tasks: USD 5,000 per night. Monthly spend: USD 150,000. The first indication is the monthly cloud bill. By the time billing alerts fire, USD 150,000 has been consumed. Remediation: add monetary ceiling; add anomaly detection; require pre-flight approval for tasks estimated above USD 10.
14. Regulatory Considerations
| Regulation |
Cost Governance Relevance |
Architectural Control |
Reference |
| APRA CPS230 — Material Business Services |
AI agents embedded in material business services must have cost controls to prevent service disruption |
Cost ceilings prevent budget exhaustion that would halt business services |
CPS230 operational resilience |
| SOX (US Public Companies) |
AI-generated financial analysis costs must be attributable and auditable for financial controls |
Cost attribution with immutable audit trail |
SOX Section 302/404 |
| EU AI Act Article 9 — Risk Management |
Runaway agent cost is an operational risk that must be managed |
Pre-flight estimation + ceiling + kill switch = risk management controls |
EU AI Act Article 9 |
| GDPR Article 25 — Privacy by Design |
Cost optimisation (prompt compression, caching) must not inadvertently increase privacy risk |
Cached tool results must not expose cross-user data; prompt compression must preserve PII controls |
GDPR Article 25 |
| Financial Services Regulations (General) |
AI operational costs in production financial services workflows require governance and auditability |
Chargeback model; cost dashboard; monthly reports |
OCC; APRA; FCA guidance on operational risk |
| ISO 42001 Clause 8 — Operation |
AI system operational controls include cost management |
Cost governance is an operational control in the AI system lifecycle |
ISO/IEC 42001:2023 Clause 8 |
15. Reference Implementations
AWS
| Component |
AWS Service |
| Pre-Flight Cost Estimator |
AWS Lambda + DynamoDB (historical cost store) |
| Token Budget Enforcer |
LangChain CallbackHandler deployed in Lambda |
| Real-Time Cost Tracker |
Amazon Kinesis Data Streams + Lambda consumer + ElastiCache Redis |
| Cost Ceiling Monitor |
Lambda triggered by Kinesis; writes to SNS on ceiling approach |
| Kill Switch API |
API Gateway + Lambda; sends SSM Run Command to agent container |
| Anomaly Detection |
Amazon CloudWatch Anomaly Detection on cost metrics |
| Cost Attribution Store |
Amazon DynamoDB with BU/team/user partition keys |
| Cost Dashboard |
Amazon QuickSight; or Grafana + CloudWatch data source |
| Model Tier Router |
Amazon Bedrock model selection + LiteLLM |
| Tool Result Cache |
Amazon ElastiCache Redis |
Azure
| Component |
Azure Service |
| Pre-Flight Cost Estimator |
Azure Function + Cosmos DB |
| Token Budget Enforcer |
Azure Function middleware in agent pipeline |
| Real-Time Cost Tracker |
Azure Event Hubs + Azure Function consumer + Azure Cache for Redis |
| Kill Switch API |
Azure API Management + Azure Container Apps lifecycle API |
| Anomaly Detection |
Azure Monitor Anomaly Detection; or Datadog |
| Cost Attribution Store |
Azure Cosmos DB |
| Cost Dashboard |
Power BI + Azure Monitor data source |
| Model Tier Router |
Azure AI Studio model routing + LiteLLM |
GCP
| Component |
GCP Service |
| Pre-Flight Cost Estimator |
Cloud Functions + Firestore |
| Token Budget Enforcer |
Cloud Functions middleware |
| Real-Time Cost Tracker |
Cloud Pub/Sub + Cloud Functions + Memorystore Redis |
| Kill Switch API |
Cloud Run API + Cloud Run job lifecycle management |
| Anomaly Detection |
Cloud Monitoring alerting policies + custom anomaly model |
| Cost Attribution Store |
BigQuery (cost events table) |
| Cost Dashboard |
Looker + BigQuery |
| Model Tier Router |
Vertex AI model garden routing + LiteLLM |
On-Premises / Custom Agent Platforms
| Component |
Technology |
| Pre-Flight Cost Estimator |
FastAPI service + PostgreSQL historical data |
| Token Budget Enforcer |
Python decorator on agent LLM call method; Redis counter |
| Real-Time Cost Tracker |
Apache Kafka + Flink + Redis |
| Kill Switch API |
FastAPI endpoint; sends SIGTERM to agent process |
| Anomaly Detection |
Prophet or ARIMA on cost time series; custom threshold alerts |
| Cost Dashboard |
Grafana + InfluxDB or PostgreSQL |
| Model Tier Router |
LiteLLM proxy with cost-based routing rules |
| Pattern ID |
Pattern Name |
Relationship |
Notes |
| EAAPL-AGT003 |
Human-in-the-Loop Oversight |
COMPLEMENTARY |
HITL gates for high-consequence agent actions complement cost ceilings for high-cost actions |
| EAAPL-AGT007 |
Multi-Agent Orchestration |
PREREQUISITE |
Multi-agent orchestration patterns must include per-agent and per-orchestration cost accounting |
| EAAPL-PLT010 |
AI Developer Portal |
COMPLEMENTARY |
Cost dashboards and self-service cost information for developers is part of the developer portal |
| EAAPL-PLT007 |
AI Observability Platform |
PREREQUISITE |
Agent observability infrastructure is required before cost metrics can be collected |
| EAAPL-CMP002 |
APRA CPS234 AI Security |
COMPLEMENTARY |
Runaway agent cost may indicate an adversarial attack (LLM04 DoS); cost anomaly detection provides security signal |
| EAAPL-AGT001 |
Agent Execution Framework |
PREREQUISITE |
An agent execution framework with middleware injection capability is required to implement token budget enforcement |
17. Maturity Assessment
Overall Maturity Label: Proven
| Dimension |
Level 1 |
Level 2 |
Level 3 |
Level 4 |
Level 5 |
Current Level |
| Token Budget Enforcement |
No limits |
Manual configuration per task |
Automated per-agent-type budgets |
Dynamic budgets based on task complexity estimation |
ML-based adaptive budgets |
Level 3 |
| Monetary Ceiling |
No ceiling |
Manual billing alert only |
Automated real-time ceiling with hard stop |
Ceiling auto-adjusted based on task priority |
Predictive ceiling based on task characteristics |
Level 3 |
| Anomaly Detection |
No detection |
Batch billing alerts |
Real-time 3x baseline anomaly detection |
Multi-signal anomaly (cost + iteration + latency) |
Predictive anomaly before cost spike occurs |
Level 3 |
| Cost Attribution |
No attribution |
BU-level attribution only |
Per-agent-type + user attribution |
Project/feature-level attribution |
Real-time chargeback API for BU budget systems |
Level 3 |
| Cost Optimisation |
No optimisation |
Manual model selection |
Model tier routing + tool caching |
Dynamic optimisation via cost-quality trade-off model |
Continuous optimisation with A/B testing |
Level 2–3 |
18. Revision History
| Version |
Date |
Author |
Changes |
| 1.0 |
2025-07-01 |
EAAPL Working Group |
Initial draft |
| 1.1 |
2026-06-12 |
EAAPL Working Group |
Added RouteLLM/LiteLLM reference implementations; cascading failure scenario; expanded cost optimisation strategies |