EAAPL-INT005 — Batch AI Processing
Tags: batch cost-optimisation high-availability medium-complexity
Status: Proven | Version: 1.0 | Domain: Integration
1. Executive Summary
Batch AI Processing applies AI inference to large volumes of data through scheduled or event-triggered pipeline jobs. Where real-time stream processing targets sub-second to sub-minute latency, batch processing accepts latency measured in minutes to hours in exchange for dramatically lower cost, higher throughput, and simpler operational management.
The pattern addresses the dominant AI workload pattern in enterprise organisations: nightly document classification runs, weekly risk report generation, periodic customer communication personalisation, large-scale data enrichment for analytics, and compliance screening across historical transaction sets. These workloads do not require immediate inference results — but they do require high reliability, cost efficiency, and auditability at scale.
At enterprise scale, the architectural decisions in batch AI processing have direct financial consequences. A poorly designed batch pipeline processing 10 million documents per night with a $0.002 per-document AI cost carries $20,000 of nightly cost. Through partitioning, spot instances, parallelism tuning, retry design, and output validation, well-designed pipelines achieve the same quality at 40–60% lower cost. For CIOs and CTOs, this pattern provides the operational template to run AI at enterprise scale without the cost spiralling that characterises early AI production deployments.
2. Problem Statement
Business Problem
Enterprises accumulate large volumes of unstructured and semi-structured data — documents, contracts, emails, case notes, customer records, transaction histories — that contain insights unlockable only through AI inference. The volume and cost make real-time processing impractical. But without a structured batch processing architecture, these assets go unprocessed, and the business intelligence they contain is never extracted.
Technical Problem
Ad-hoc scripts calling AI APIs directly at large scale fail in predictable ways: rate limit errors abort jobs mid-run; no retry logic means failed items are silently lost; no partitioning means a single failure affects the entire batch; no cost controls allow runaway spending on a misconfigured job. The absence of an architectural framework for batch AI processing is the root cause of these failures.
Symptoms
- "We ran the script and got rate-limit errors halfway through — we don't know which documents were processed."
- AI processing jobs scheduled for 4 hours regularly overrun to 12+ hours without alerting.
- Job failures are discovered when downstream systems detect missing outputs — not when the job fails.
- AI API costs for a single overnight batch run exceed the monthly infrastructure budget.
- Failed items are discarded; after the job, there is no record of which items failed and why.
Cost of Inaction
- Operational: Manual AI processing of documents that should be automated consumes analyst time at $80–$200/hour rates.
- Financial: Unstructured batch jobs without cost controls routinely generate 3–10× the expected AI API spend.
- Quality: Without output validation and DLQ handling, a silent 15% failure rate in document classification produces downstream analytics on an unrepresentative sample.
- Compliance: Batch AI jobs processing regulated data with no audit trail fail the CPS 230 operational risk management standard.
3. Context
When to Apply
- Latency tolerance is minutes to hours (not seconds).
- Input volume is too large for real-time processing at acceptable cost.
- Processing can be scheduled (nightly, weekly) or triggered by an event (new document batch arrives, periodic data export ready).
- SLA can be expressed as job completion time rather than per-event latency.
When NOT to Apply
- Real-time or near-real-time response is required — use EAAPL-INT004.
- Input volume is small enough for synchronous request/response — use direct API integration.
- Interactive user experience requires AI inference results immediately — batch processing is inherently asynchronous.
- Exact processing sequence matters (e.g., each output depends on the previous output) — batch parallelism assumes independent items.
Prerequisites
- A job scheduling mechanism (cron, event trigger, workflow orchestrator).
- An AI inference provider capable of handling the target batch throughput (or on-premises model serving).
- An output storage system capable of receiving the batch output volume.
- A retry and DLQ infrastructure for failed item handling.
Industry Applicability
| Industry | Applicability | Typical Use Case | SLA |
|---|---|---|---|
| Financial Services | Very High | Nightly contract classification, customer risk narrative generation, AML document screening | 4–8 hours for overnight batch |
| Legal / Professional Services | Very High | Contract analysis, due diligence document extraction, regulatory filing review | Hours to days |
| Healthcare | High | Medical record coding, discharge summary generation, clinical trial document review | Hours |
| Government | High | Benefit application processing, permit document review, correspondence classification | Hours to days |
| Insurance | High | Claims document classification, policy comparison, fraud investigation support | Hours |
| Retail / eCommerce | Medium | Product description generation, catalogue enrichment, review sentiment analysis | Hours (overnight) |
4. Architecture Overview
Batch AI Processing is a pipeline architecture with six stages: scheduling, input partitioning, parallel execution, output aggregation, validation, and completion reporting. Each stage is described below with the key architectural decisions required at enterprise scale.
Job Scheduling. Three scheduling patterns are applicable. Cron scheduling runs jobs at fixed times — appropriate for nightly enrichment runs where SLA is defined by business day start. Event-triggered scheduling runs jobs when an input threshold is met (e.g., 10,000 documents arrived in the input bucket triggers the job) — appropriate when input arrives irregularly and processing should begin immediately when sufficient volume justifies the fixed startup cost. Threshold-triggered scheduling runs jobs when a business signal is met (e.g., end-of-month close, regulatory reporting deadline approaching). The scheduler choice drives SLA management: cron-triggered jobs have a fixed start time and calculable completion time; event-triggered jobs have variable start times requiring dynamic SLA tracking.
Input Partitioning. Large input sets must be split into partitions for parallel processing. Partitioning strategies: by document type (PDFs vs. Word vs. emails — enables type-specific AI prompts); by size tier (small documents < 5 pages vs. large documents > 50 pages — enables differently-sized worker resource allocation); by random hash (ensures even load distribution across workers; default choice when no other dimension provides better distribution). Partition skew is a common failure — if document size varies 10× across the input set, a "split into N equal-count partitions" strategy assigns the same number of items but wildly different processing times. Partition strategy must account for heterogeneous input characteristics.
Parallel Batch Execution. Worker fleet sizing: (items in batch / batch duration SLA in seconds) / (per-worker throughput in items/second) = minimum worker count. Add 25% headroom for partition skew. Auto-scaling: start the minimum worker fleet; scale out if consumer lag exceeds threshold or if job is tracking behind the 70% SLA checkpoint. Scale-to-zero after job completion to eliminate idle compute cost. Spot/preemptible instances reduce worker compute cost by 60–80% — handle instance interruption via checkpointing so interrupted partitions are re-queued rather than lost.
Checkpointing. Every worker writes a checkpoint record after completing each item (or each configurable checkpoint interval for large documents): {item_id, partition_id, worker_id, completion_timestamp, output_location}. On worker failure or spot interruption, the unprocessed items in the interrupted partition are re-queued. The checkpoint store enables recovery without reprocessing completed items. Checkpoint data is the source of truth for job progress reporting.
Output Aggregation and Validation. After all workers complete, an aggregation step merges partial outputs and validates completeness. Completeness check: count of items in output vs. count of items in input; any gap triggers investigation. Schema validation: each output item validated against the expected AI result schema; invalid outputs collected for DLQ review. Business rule validation: domain-specific checks on AI outputs (e.g., a risk score must be between 0 and 100; a classification must be from the approved taxonomy) catch AI hallucinations and schema drift before they corrupt downstream systems.
Retry and DLQ. Failed items are collected by workers into a retry queue during execution. After the primary job run, a retry sweep processes the retry queue with exponential backoff. After a configurable maximum retry count (recommend 3), unresolved failures move to the dead letter queue (DLQ). The DLQ record includes: item ID, original item payload, error message, retry count, last attempt timestamp. DLQ items require manual review and remediation — they are not silently discarded. Alert fires on any DLQ entries to prompt investigation.
SLA Management. The job orchestrator tracks progress against the SLA deadline. At 70% of elapsed SLA time, a warning alert fires if job completion projection (based on current throughput) indicates a miss. At 90% of elapsed SLA time, an escalation alert fires and a capacity increase action is triggered automatically. At SLA breach, an incident is created and the downstream consumer is notified of expected delay and partial-completion status.
Cost Controls. Each job executes within a budget envelope: (input item count) × (per-item AI cost estimate) + (worker compute estimate) = job cost estimate. The job orchestrator monitors actual cost against estimate in real time. At 80% of budget, a warning fires. At 100% of budget, the job halts and a manual approval gate is required to continue. This prevents runaway AI API spend from misconfigured jobs.
5. Architecture Diagram
6. Components
| Component | Type | Responsibility | Technology Options | Criticality |
|---|---|---|---|---|
| Job Orchestrator | Service | Schedule execution, monitor SLA and cost, manage job lifecycle, trigger alerts | Apache Airflow, AWS Step Functions, Azure Data Factory, Prefect, Dagster | Critical |
| Partition Strategy Engine | Library/Service | Build input manifest, apply partitioning strategy, write work queue | Custom Python, AWS Glue Crawler, Azure Data Factory partitioning | High |
| Work Queue | Infrastructure | Distribute partition work to worker fleet; track in-flight and completed items | AWS SQS, Azure Service Bus, Redis Queue, GCP Pub/Sub | Critical |
| Worker Fleet | Compute | Process assigned partition: read items, call AI inference, write outputs, checkpoint | AWS Lambda, ECS/Fargate tasks, Azure Functions, Kubernetes Jobs (spot) | Critical |
| Checkpoint Store | Storage | Track per-item completion status for recovery and progress reporting | DynamoDB, Azure Cosmos DB, Redis, PostgreSQL | High |
| AI Inference Provider | AI Service | Execute batch inference for worker-submitted items | OpenAI Batch API, Anthropic Batch, Amazon Bedrock Batch, on-premises vLLM | Critical |
| Retry Queue | Infrastructure | Collect failed items during primary run; feed retry sweep | SQS, Azure Service Bus, Redis | High |
| Output Aggregation Service | Service | Merge partial worker outputs into unified result set; validate completeness | Custom Python, AWS Glue ETL, Azure Data Factory | High |
| Schema Validator | Library | Validate each output item against expected AI result schema | Pydantic, JSON Schema validator, Great Expectations | High |
| Business Rule Validator | Service | Domain-specific output validation; detect AI hallucinations and taxonomy violations | Custom rule engine, dbt tests, Great Expectations | High |
| DLQ and Review Interface | Service + UI | Collect DLQ items; alert on DLQ growth; enable manual review and replay | Custom admin UI + SQS/Service Bus DLQ | High |
| Cost Monitor | Component | Track AI API spend per job; alert at budget thresholds; halt job at budget limit | Custom component using provider cost APIs + job metadata | High |
7. Data Flow
Primary Flow
| Step | Actor | Action | Output |
|---|---|---|---|
| 1 | Scheduler | Triggers job at cron time or on event condition | Job configuration loaded from orchestrator |
| 2 | Partition Engine | Scans input storage; builds manifest of all items; applies partition strategy; writes N partitions to work queue | N work queue messages, each describing a partition |
| 3 | Auto-Scaler | Reads work queue depth; launches worker fleet sized to throughput target | Worker fleet active |
| 4 | Worker | Dequeues partition; reads items; calls AI batch inference API; writes results to output staging area; writes checkpoint records | Partial output files; checkpoint records per item |
| 5 | Auto-Scaler | Monitors queue depth; adds workers if behind SLA; removes workers as queue drains | Dynamic worker fleet |
| 6 | Aggregation Service | Waits for all partitions complete; merges partial outputs; validates completeness | Unified output dataset |
| 7 | Schema Validator | Validates each output item against result schema | Valid items proceed; invalid items to DLQ |
| 8 | Business Rule Validator | Applies domain rules to AI outputs | Valid items written to result store; rule violations to DLQ |
| 9 | Downstream Consumer | Reads result store; incorporates AI outputs into business process | Business process enhanced with AI outputs |
| 10 | Job Orchestrator | Records job completion: items processed, items failed, cost incurred, actual duration vs. SLA | Completion report written to audit store |
Error Flow
| Step | Error Condition | Detection | Recovery |
|---|---|---|---|
| 4 | AI API rate limit (429) | HTTP 429 from provider | Retry with exponential backoff per Retry-After header; item stays in flight |
| 4 | AI API error (5xx) | HTTP 5xx from provider | Item added to retry queue with error code; worker continues to next item |
| 4 | Worker instance interrupted (spot) | Worker health check fails; queue message visibility timeout expires | Work queue message becomes visible again after visibility timeout; another worker picks it up |
| 4 | AI result schema unexpected | Output parsing fails | Item added to retry queue; after max retries, to DLQ with raw AI response for investigation |
| 6 | Completeness check fails (missing items) | Output count < input count | Alert fires; investigate: check checkpoint store for missing items; check DLQ for failed items |
| 7–8 | Validation failure | Schema or business rule check fails | Item to DLQ with specific validation error; downstream receives only valid outputs |
| Ongoing | Job tracking behind SLA at 70% | SLA monitor projection | Warning alert; auto-scaler increases fleet size |
8. Security Considerations
Authentication and Authorisation
- Workers authenticate to AI inference API using service account credentials with least-privilege scope (inference only, no model management).
- Workers have read access to input storage and write access to output staging area only — no cross-partition read/write.
- Job orchestrator has orchestration permissions only; cannot read input data or write output data directly.
- DLQ access restricted to AI governance and on-call engineering roles.
Secrets Management
- AI provider API keys stored in centralised secrets manager; workers retrieve at job start via instance metadata or secrets injection; keys never in job configuration files.
- Separate API keys per job type and environment (prod/staging); enables per-job key rotation without affecting other jobs.
- API key rotation schedule: 90 days; automated rotation with grace period for in-flight jobs.
Data Classification
- Input items classified before job submission; job metadata includes maximum data classification level.
- Workers handling PII items must be deployed in the approved data-residency region for that classification.
- AI outputs inherit the classification of their input; output storage bucket classification tags set at job start.
Encryption
- Input storage, checkpoint store, retry queue, and output storage encrypted at rest (AES-256).
- In-transit encryption (TLS 1.3) for all API calls and storage operations.
- DLQ items may contain PII from failed AI processing; DLQ storage encrypted and access-logged.
Auditability
- Every job execution logged: job ID, configuration, item count, start time, completion time, item-level success/failure counts, AI provider cost.
- Every item processed has a corresponding checkpoint record: item ID, worker ID, timestamp, status, output location.
- Failed items in DLQ have full context: original item (or reference), error message, retry history — enabling post-hoc investigation of what was processed and why it failed.
OWASP LLM Top 10 Mitigations
| OWASP LLM Risk | Relevance | Mitigation in This Pattern |
|---|---|---|
| LLM01 — Prompt Injection | Medium | Batch items are documents or structured data; prompt templates constructed by workers (not from item content); free-text document content passed as data argument, not as prompt instruction |
| LLM02 — Insecure Output Handling | High | Schema validator and business rule validator check every AI output before it reaches downstream systems; invalid outputs quarantined in DLQ |
| LLM03 — Training Data Poisoning | Low | Batch processing is inference only; no training pipeline in this pattern; if fine-tuning uses batch outputs, separate validation gate required |
| LLM04 — Model Denial of Service | Medium | Cost monitor halts job at budget limit; rate limiting per worker prevents runaway API consumption |
| LLM05 — Supply Chain Vulnerabilities | Medium | AI provider selected via enterprise procurement; contract includes data handling obligations; worker SDK versions pinned |
| LLM06 — Sensitive Information Disclosure | High | PII-classified items processed by workers in approved data-residency region only; AI provider data processing agreement required for PII; no PII in checkpoint metadata |
| LLM07 — Insecure Plugin Design | Low | Batch workers use standard inference API only; no function calling or plugins in batch inference pattern |
| LLM08 — Excessive Agency | Low | Batch pipeline produces outputs; no automated action on those outputs within this pattern; downstream consumption is a separate system |
| LLM09 — Overreliance | Medium | Confidence score in every output; downstream consumers configured to require human review for items below minimum confidence threshold |
| LLM10 — Model Theft | Low | Batch inference uses provider API; no model weights in custody; provider contract governs |
9. Governance Considerations
Responsible AI
- Batch AI outputs that influence bulk decisions (e.g., risk scores applied to a customer cohort) must be reviewed at the cohort level for demographic bias before downstream application.
- Model performance tracking: maintain ground truth for a sample of batch outputs; compute accuracy, precision, recall monthly; alert on degradation.
- Provide a mechanism for affected parties to request review of AI batch outputs that influenced decisions about them.
Model Risk Management
- Batch AI inference models subject to the same Model Risk Management framework as real-time models — purpose statement, methodology, validation, ongoing monitoring.
- Model version tracked in every output record; retrospective performance analysis by model version enabled via output store query.
- Prompt version tracked separately from model version; prompt changes require validation of output quality on sample set before production deployment.
Human Approval Gates
- For high-stakes batch outputs (credit risk narratives applied to collection decisions, medical record coding affecting billing), a sample review by subject matter experts before release to downstream systems.
- Batch outputs with confidence < configurable threshold routed to human review queue rather than automatic downstream delivery.
Policy and Traceability
- Every downstream system receiving batch AI outputs must store the job_id and item_id with each AI output so the specific model version and prompt version that generated the output is retrievable.
- AI output lineage: source document → job_id → model_version → prompt_version → output → downstream_application.
Governance Artefacts
| Artefact | Owner | Update Frequency | Storage Location |
|---|---|---|---|
| Batch AI Job Registry | Platform Engineering | Per new job type | Job catalogue repository |
| Model Risk Assessment (Batch Models) | Model Risk Team | Per model version change | MRM register |
| Job Cost Report | FinOps | Monthly | FinOps platform |
| Output Quality Report (Accuracy Sample) | Data Science | Monthly | ML platform |
| DLQ Review Log | AI Governance | Per DLQ event | Governance dashboard |
| Data Classification Map for Batch Inputs | Data Governance | Quarterly | Data catalogue |
10. Operational Considerations
Monitoring and SLOs
| SLO | Target | Measurement | Alert Threshold |
|---|---|---|---|
| Job completion within SLA | 99% of jobs | Actual completion time vs. configured SLA | Any miss triggers incident |
| SLA warning at 70% elapsed | Warning triggers | Job progress projection at 70% of SLA time | Worker fleet auto-scales; manual review if projection shows miss |
| Item failure rate (primary run) | < 2% | Failed items / total items before retry | > 5% → manual investigation before retry sweep |
| Post-retry DLQ rate | < 0.1% | DLQ items / total items | Any DLQ entries → alert |
| Cost overrun | 0% of jobs exceed budget | Actual cost vs. job budget | At 80% → alert; at 100% → halt |
| Output validation pass rate | > 99.5% | Valid outputs / total outputs | < 99% → investigate model or prompt quality |
Logging
- Job orchestrator: job start, partition count, worker count, SLA checkpoint warnings, job completion, cost.
- Workers: partition received, item count, AI call count, item-level success/failure, checkpoint writes, errors.
- Aggregation service: completeness check result, validation summary, DLQ referral count.
Incident Response
- SLA breach: incident created automatically at SLA breach time; on-call notified; downstream consumer notified of delay and estimated completion time; investigate: worker scaling, AI provider rate limits, partition skew.
- AI provider outage: retry queue accumulates; if outage exceeds safe retry window, job paused; notifications sent to downstream consumers; resume on provider recovery.
- DLQ accumulation: investigation required before retry — root cause (AI model error, schema mismatch, data quality) must be identified and resolved before DLQ replay.
Disaster Recovery
| Scenario | RTO | RPO | Recovery Procedure |
|---|---|---|---|
| Worker fleet failure | 5 minutes | 0 (checkpoint-based recovery) | Auto-scaling replaces workers; incomplete partitions re-queued via visibility timeout |
| Work queue failure | 15 minutes | 0 (partition manifest in durable storage) | Restore queue; rebuild from partition manifest in orchestrator state |
| Checkpoint store failure | 30 minutes | Up to checkpoint interval (per item) | Restore from backup; workers re-process uncertain items (idempotent output design prevents duplication) |
| AI provider prolonged outage | Variable | 0 | Job paused in orchestrator; resumes automatically when provider recovers; downstream consumers notified |
Capacity Planning
- Worker count: (target throughput items/hour) / (single-worker throughput items/hour) × safety factor 1.25.
- Job duration: (input items) / (total worker throughput items/hour) = expected hours.
- Output storage: (input items) × (average output size per item) × 1.2 overhead factor.
- Checkpoint store: (input items) × (checkpoint record size of ~200 bytes) = storage requirement.
11. Cost Considerations
Cost Drivers
| Cost Driver | Description | Typical Proportion |
|---|---|---|
| AI Inference API (tokens) | Per-token charges for batch inference; dominant cost | 55–75% |
| Worker Compute (spot/preemptible) | EC2 Spot, Azure Spot VMs, or Preemptible GCE; 60-80% cheaper than on-demand | 10–20% |
| Input/Output Storage | S3/ADLS/GCS costs for input scan and output write | 3–8% |
| Orchestrator | Airflow/Step Functions/Prefect compute or service cost | 2–5% |
| Checkpoint + Queue Storage | DynamoDB/Redis/SQS; proportional to input item count | 2–5% |
| Output Validation Compute | Schema and business rule validation; typically small | 2–4% |
Scaling Risks
- AI API token costs scale directly with batch input size and prompt length. Prompt length optimisation (shorter prompts for large batches) has immediate cost impact.
- Spot instance interruption rate increases during cloud provider capacity constraints; over-reliance on spot instances without on-demand headroom creates SLA risk.
- Retry amplification: a systematic AI model error causing high failure rates triggers retry sweeps that multiply AI API cost — cost monitor budget halt is the safeguard.
Cost Optimisations
- OpenAI Batch API / Anthropic Batch API: dedicated batch endpoints at 50% of real-time API cost; accept up to 24h turnaround — appropriate for overnight runs.
- Spot/preemptible workers: 60–80% compute cost reduction; handle interruption via checkpoint and work queue visibility timeout.
- Prompt caching: some providers (Anthropic, OpenAI) cache long system prompts; structure prompts with invariant content first to maximise cache hit rate.
- Partition size tuning: too-small partitions incur high per-call overhead; too-large partitions reduce parallelism. Optimal partition size is (worker throughput items/min) × 10 minutes.
- Off-peak scheduling: some providers offer lower rates during off-peak hours; overnight batch jobs can take advantage.
Indicative Cost Range
| Scale | Monthly Worker Compute | AI API (Batch Tier) | Total Monthly |
|---|---|---|---|
| Small (1M items/mo, 500 tokens avg) | $200–$800 (spot) | $500–$2,000 | $700–$2,800 |
| Medium (50M items/mo, 500 tokens avg) | $3,000–$8,000 (spot) | $15,000–$50,000 | $18,000–$58,000 |
| Large (500M items/mo, 500 tokens avg) | $20,000–$50,000 (spot) | $100,000–$350,000 | $120,000–$400,000 |
12. Trade-Off Analysis
Architectural Options Comparison
| Option | Throughput | Cost | Complexity | Reliability | Best For |
|---|---|---|---|---|---|
| Option A — Batch Pipeline (this pattern) | Very High | Low (batch tier + spot) | Medium | High (checkpoint + retry) | Overnight enrichment, large-scale classification, non-time-sensitive AI inference |
| Option B — Real-Time Stream Processing | High | High (GPU serving 24/7) | Very High | High | Sub-second to sub-minute latency requirements |
| Option C — Ad-hoc Script | Medium | Medium | Low | Low (no retry, no checkpoint) | Exploratory or one-off runs only |
| Option D — SaaS Batch Processing | High | High (SaaS margin) | Low | Medium | Teams without infrastructure capability |
Architectural Tensions
| Tension | Trade-Off | Resolution |
|---|---|---|
| Partition size vs. Parallelism vs. Overhead | Small partitions = more parallelism = more queue overhead; large partitions = less parallelism = longer recovery from failure | Optimal partition size: 5–15 minutes of work per worker at target throughput |
| Cost (spot) vs. Reliability (on-demand) | All-spot is cheapest; spot interruptions add complexity and potential SLA risk | Mixed fleet: 70% spot for throughput, 30% on-demand for SLA guarantee |
| Validation strictness vs. Yield | Strict validation catches AI errors; too-strict validation quarantines valid outputs unnecessarily | Tiered validation: schema validation mandatory; business rule validation advisory with manual DLQ review |
13. Failure Modes
| Failure | Likelihood | Impact | Detection | Recovery |
|---|---|---|---|---|
| Spot instance mass interruption during cloud capacity event | Medium | High — significant worker fleet loss; SLA risk | Worker count metric drops sharply; queue lag increases | Auto-scaler provisions on-demand replacements; SLA alert fires if lag unrecoverable within window |
| AI provider batch API outage | Low | High — batch jobs stall | HTTP errors from all workers; DLQ growth | Retry queue; if extended outage, job pause + notification; resume on recovery |
| Input manifest build fails (storage scan error) | Low | High — job cannot start | Manifest build step fails in orchestrator | Retry manifest build; alert if persistent; manual trigger after storage issue resolved |
| Systematic AI output validation failure | Medium | Medium — high DLQ rate; downstream receives no outputs | Output validation pass rate alert | Investigate AI model version, prompt configuration, input data quality; pause downstream consumption until resolved |
| Checkpoint store unavailable | Low | Medium — no recovery for interrupted workers | Checkpoint write errors from workers | Workers retry checkpoint writes; if persistent, workers continue without checkpointing with risk of reprocessing on failure |
| Job cost overrun before completion | Medium | Medium — job halted; downstream receives partial output | Cost monitor budget halt | Manual approval gate to continue; investigate: item count vs. estimate, token usage vs. estimate, pricing change |
Cascading Failure Scenarios
- High DLQ rate + no monitoring + downstream trust: AI model quality degrades silently → 30% of outputs fail validation → DLQ accumulates → downstream continues receiving 70% of expected outputs → downstream analytics calculations based on biased sample produce incorrect business reports → decisions made on incorrect reports. Mitigation: DLQ rate alert + completeness check on downstream consumption + confidence score distribution monitoring.
- Retry amplification + no cost monitor: Systematic AI error causes 50% item failure → retry sweep triggered → doubles AI API spend → cost monitor (if absent) doesn't halt → second retry doubles again → 4× original cost incurred on a batch that is failing for a systematic reason. Mitigation: cost monitor budget halt is non-optional; investigate root cause before retry sweep.
14. Regulatory Considerations
APRA CPS 230 — Operational Risk
- Clause 36: Batch AI jobs that produce inputs to operational risk reports (credit risk scores, AML screening results) are part of the operational risk management infrastructure; SLA, checkpointing, and retry design directly address continuity requirements.
- Clause 52: Managed batch AI service providers (OpenAI Batch, Anthropic Batch, Amazon Bedrock Batch) are material service providers under CPS 230 third-party risk obligations.
APRA CPS 234 — Information Security
- Clause 15: Encrypted input/output storage, worker network isolation, and per-job API key scoping address the proportional information security control requirement for batch data handling.
Australian Privacy Act 1988
- APP 11 (Security): Batch inputs containing personal data must be destroyed or anonymised after the batch job completes (within configurable retention period); retention of raw PII input beyond the processing need requires justification.
- APP 3 (Collection): Using personal data in batch AI processing must be within the scope of the collection purpose; secondary-purpose bulk AI processing requires assessment.
EU AI Act (2024)
- Article 12 (Record-keeping): Job orchestrator completion report + item-level checkpoint records constitute the logging requirement for high-risk AI batch processing.
- Article 9 (Risk Management): Cost monitor, validation gates, DLQ review process, and sample-based quality monitoring implement the risk management requirements for batch AI systems.
ISO 42001
- Clause 9.1 (Monitoring): Monthly output quality reports, DLQ review logs, and cost reports constitute the performance monitoring evidence required under ISO 42001.
NIST AI RMF (2023)
- MANAGE 2.2: DLQ handling, retry design, and job orchestrator incident integration implement the AI risk treatment procedures required under NIST AI RMF.
- GOVERN 1.3: Job registry and per-job configuration document the organisational context and purpose for each batch AI application — supporting accountability assignment.
15. Reference Implementations
AWS
- Orchestrator: AWS Step Functions (state machine per batch job type) or Amazon MWAA (Managed Airflow)
- Worker Compute: AWS Batch with Spot Fleet integration; or ECS Fargate Spot tasks
- Work Queue: Amazon SQS with visibility timeout for at-least-once processing
- AI Inference: OpenAI Batch API (external) or Amazon Bedrock Batch Inference
- Checkpoint Store: Amazon DynamoDB (per-item conditional writes)
- Input/Output Storage: Amazon S3 with S3 Intelligent-Tiering
- Cost Monitor: AWS Cost Explorer API + custom Lambda monitoring function
- Validation: AWS Glue DataQuality or custom Lambda function
Azure
- Orchestrator: Azure Data Factory (pipeline with activities) or Azure Workflow (Logic Apps)
- Worker Compute: Azure Batch with Low-Priority VM allocation; or Azure Container Apps jobs
- Work Queue: Azure Service Bus Standard tier queues
- AI Inference: Azure OpenAI Batch or external provider
- Checkpoint Store: Azure Cosmos DB (serverless, per-item upsert)
- Input/Output Storage: Azure Data Lake Storage Gen2
- Cost Monitor: Azure Cost Management API + custom Function monitoring
- Validation: Azure Data Factory Data Flow validation
GCP
- Orchestrator: Cloud Composer (Airflow) or Workflows (GCP)
- Worker Compute: Cloud Batch jobs with Spot VM preemptible VMs
- Work Queue: Google Cloud Pub/Sub with ack deadline as visibility timeout
- AI Inference: Vertex AI Batch Prediction or external AI provider
- Checkpoint Store: Cloud Firestore (serverless, per-item conditional write)
- Input/Output Storage: Google Cloud Storage with lifecycle policies
- Cost Monitor: Cloud Billing API + custom Cloud Function monitoring
- Validation: Dataform or dbt on BigQuery
On-Premises / Private Cloud
- Orchestrator: Apache Airflow on Kubernetes (official Helm chart)
- Worker Compute: Kubernetes Jobs with preemption-tolerant pod spec
- Work Queue: Redis with BLPOP / BRPOP patterns; or RabbitMQ
- AI Inference: vLLM or Ollama serving on GPU nodes; or external provider
- Checkpoint Store: PostgreSQL with UPSERT on item_id
- Input/Output Storage: MinIO (S3-compatible) or NFS
- Cost Monitor: Custom Prometheus metric + Alertmanager rule
- Validation: Great Expectations in validation Python task
16. Related Patterns
| Pattern | Relationship | Notes |
|---|---|---|
| EAAPL-INT001 — Enterprise AI Service Bus | Complementary | Batch job completion events published to AI Service Bus for enterprise-wide visibility and cost attribution |
| EAAPL-INT004 — Real-Time AI Stream Processing | Complementary | Together form Lambda architecture for AI: batch for high-volume historical, stream for real-time current |
| EAAPL-INT007 — AI Circuit Breaker | Enables | Circuit breaker wraps AI inference API calls within workers to handle provider outages gracefully |
| EAAPL-INT008 — Bidirectional AI Sync | Complementary | Batch output results feed the sync pattern to update enterprise data stores with AI-enriched data |
17. Maturity Assessment
Overall Maturity: Proven
| Dimension | Score (1–5) | Justification |
|---|---|---|
| Architectural Completeness | 5 | All six pipeline stages fully specified; spot handling, checkpointing, retry, DLQ, SLA management, cost controls all included |
| Operational Readiness | 5 | Comprehensive SLOs; incident response; DR; capacity planning all defined |
| Security Coverage | 4 | Encryption, access control, OWASP LLM Top 10 covered; PII handling in batch requires organisation-specific data residency configuration |
| Governance Coverage | 5 | Model risk, output quality monitoring, traceability, human approval gates all included |
| Cost Predictability | 5 | Budget envelope per job; cost monitor; spot instance strategy; batch tier pricing all specified |
| Implementation Complexity | 3 | Medium — well-established cloud services handle most complexity; checkpoint design and partition strategy require careful implementation |
| Industry Validation | 5 | Most common AI production pattern; deployed at scale across all regulated industries |
18. Revision History
| Version | Date | Author | Changes |
|---|---|---|---|
| 1.0 | 2026-06-12 | EAAPL Working Group | Initial publication — integration patterns series |