EAAPLEnterprise AI Architecture Pattern Library
EAAPLLibraryAgentic Workflows
Mature
⇄ Compare

Workflow State Machine

📄 Agentic WorkflowsEU AI ActISO/IEC 42001

[EAAPL-WRK012] Workflow State Machine

Category: Agentic Workflows Sub-category: Deterministic Execution Architecture Version: 1.0 Maturity: Proven Tags: state-machine, FSM, workflow-states, deterministic, resumable, auditability Regulatory Relevance: APRA CPS 230 (operational resilience), ISO 42001 §8.4, EU AI Act (Art. 9)


1. Executive Summary

The Workflow State Machine Pattern defines an explicit finite-state machine (FSM) that governs all agentic workflow transitions, making the execution deterministic, safely interruptible, and resumable from any persisted state. Rather than relying on an LLM to decide what to do next at each step, the state machine provides an explicit, auditable model of workflow states and the transitions between them — an AI agent operates within the state machine's boundaries, not outside them. This is the primary architectural pattern for achieving operational resilience in regulated agentic workflows.

For CIO/CTO audiences: without a state machine, an agentic workflow is a black box — you know the input and the output, but the in-between is an opaque LLM reasoning trace that may be different every run. With a state machine, every workflow has an explicit map of all possible states (Initialised, DataCollected, Assessed, AwaitingApproval, Approved, Completed, Failed) and the events that trigger transitions between them. This means: if the workflow is interrupted mid-execution (server restart, timeout, user cancellation), you know exactly which state it was in and can resume from there. You can audit the exact sequence of state transitions taken by any workflow execution. You can test every transition independently. This is the difference between an AI system you can deploy in a regulated environment and one you cannot.


2. Problem Statement

Business Problem

Regulated enterprise workflows require predictability, recoverability, and auditability that an unconstrained agentic reasoning loop cannot provide. A workflow that processes a loan application, a regulatory determination, or a clinical pathway must be able to: (a) resume from where it stopped if interrupted, (b) provide a complete audit trail of every decision point, (c) prevent the same step from being executed twice on resume (idempotency), and (d) support human intervention at any state.

Technical Problem

An LLM reasoning loop produces non-deterministic execution paths: given the same inputs and state, two runs may take different paths and produce different intermediate decisions. Without explicit state management, interrupted workflows cannot safely resume — there is no ground truth about what was completed before the interruption. Re-running from scratch is wasteful and may produce inconsistent results.

Symptoms of Absence

  • Interrupted workflows cannot resume; must restart from the beginning
  • No deterministic mapping of which processing steps have been completed
  • Human intervention has no defined entry points; must wait for workflow completion or cancellation
  • Audit trail shows only start and end; intermediate decision points are not recorded

Cost of Inaction

  • Resilience: Workflow failures require full restart; compute and time cost is wasted; results may be inconsistent
  • Compliance: Regulated workflows without state transition audit trails cannot satisfy audit requirements
  • Human Oversight: No defined intervention points means human oversight is post-hoc, not in-process

3. Context

When to Apply

  • Workflows must be safely resumable after interruption
  • Regulated workflows require a complete audit trail of decision points
  • Human intervention is required at specific points in the workflow
  • Workflows have complex, multi-step execution paths with conditional transitions
  • Idempotency must be guaranteed (same state does not process the same step twice)

When NOT to Apply

  • Short, single-step workflows where state machine overhead is not justified
  • Exploratory or research tasks where the execution path is inherently unpredictable
  • Tasks where the overhead of state machine design outweighs the benefit (low-stakes, non-regulated)

Prerequisites

  • Defined state inventory (all possible workflow states)
  • Defined transition events (all triggers that cause state changes)
  • Persistent state store (state must survive process restarts)
  • State machine engine (custom or framework-provided)
  • Idempotency key per workflow execution

Industry Applicability

Industry Workflow State Machine Use Case Critical States
Financial Services Loan approval workflow Applied → Assessed → Reviewed → Decided → Disbursed
Legal Matter lifecycle Opened → InProgress → ReviewPending → Approved → Closed
Healthcare Clinical pathway Registered → Assessed → TreatmentPlanned → InTreatment → Discharged
Government Permit application Submitted → UnderReview → AdditionalInfoRequired → Decided → Issued
Insurance Claims lifecycle Lodged → Assessed → Investigation → Settled → Closed

4. Architecture Overview

The Workflow State Machine architecture separates the state model (what states exist and how to transition between them) from the agentic logic that executes within each state.

State Model Definition The state model is defined at design time as a formal specification: a set of named states, a set of named events, a transition table (current_state × event → next_state), a set of entry actions (executed when entering a state), and a set of exit actions (executed when leaving a state). The state model is version-controlled and tested independently of the agentic logic that executes within states.

State Persistence The current state of every active workflow execution is persisted in a durable store (not in memory). Every state transition is a durable write before the transition-triggered action is executed. This ensures that if the process restarts after a state transition write but before the action completes, the state machine knows which state was reached and can execute the appropriate recovery action on resume.

Agentic Logic Within States Within each state, an agentic component (LLM, chain, tool call) executes the state's action. The agentic component is free to use non-deterministic reasoning; the state machine imposes determinism at the state-transition level, not within the state's action. The state's action completes by emitting a transition event to the state machine, which persists the new state and triggers the next state's entry action.

Human Intervention Points States that require human review are defined as "awaiting" states: the workflow enters the state, an external notification is sent (email, task queue, Slack), and the workflow waits for a human-provided event (Approved, Rejected, RequestMoreInfo) to trigger the transition. This is the formal implementation of human-in-the-loop at specific, pre-defined decision points.

Resume from State On workflow resume (after interruption), the state machine loads the persisted current state and executes the entry action for that state. If the interruption occurred mid-action (entry action partially complete), the entry action must be idempotent: re-executing it produces the same result as executing it for the first time. Entry actions are written to be idempotent by default (check-then-act; ON CONFLICT DO NOTHING pattern for database writes).

State Transition Audit Every state transition is an immutable audit record: from_state, to_state, triggering_event, timestamp, actor (human or agent), and any decision metadata produced by the agentic action. The audit trail is a sequence of state transition records, not a raw LLM reasoning trace.


5. Architecture Diagram

ARCHITECTURE DIAGRAM
flowchart TD subgraph StateMachine["State Machine Engine"] A[Initialised] B[DataCollected] C[UnderAssessment] D[AwaitingApproval] E[Approved] F[Rejected] G[Completed] H[Failed] end subgraph Actions["State Entry Actions"] I[Data Extraction] J[Risk Assessment] K[Human Notification] L[Fulfilment Action] end subgraph Persistence["State Persistence"] M[(State Store)] end A -->|data_received| B B -->|data_validated| C C -->|assessment_complete HIGH_RISK| D C -->|assessment_complete LOW_RISK| E D -->|human_approved| E D -->|human_rejected| F E -->|processing_complete| G C -->|error| H B --> I C --> J D --> K E --> L A & B & C & D & E & F & G & H --> M

6. Components

Component Type Responsibility Technology Options Criticality
State Model Definition Configuration Formal spec of states, events, transitions, actions JSON/YAML spec; XState; custom DSL Critical
State Machine Engine Orchestration Evaluates transitions; persists state; dispatches entry/exit actions XState; AWS Step Functions; Azure Durable Functions; Temporal; LangGraph Critical
State Store Persistence Durable storage of current and historical states per workflow instance PostgreSQL; DynamoDB; Cosmos DB (append-only) Critical
State Transition Audit Writer Governance Writes immutable audit record per transition Append-only table; CloudWatch Events; event stream Critical
Agentic Action Executor AI Component Executes the agentic logic for each state (LLM, chain, tool calls) Any agentic workflow pattern Critical
Human Notification Service Integration Notifies humans of awaiting states; receives human decisions Email; Slack; task queue; workflow portal High
Resume Controller Resilience Loads persisted state on restart; re-dispatches entry action Built into state machine engine Critical
Idempotency Guard Resilience Ensures entry actions are safe to re-execute Check-then-act; ON CONFLICT DO NOTHING Critical

7. Data Flow

Step Actor Action Output
1 External System Triggers workflow: loan application received Initial event: application_received
2 State Machine Engine Transitions Initialised → DataCollected; persists state State written: {state: "DataCollected", event: "application_received", ts: ...}
3 Agentic Action (Data Extraction) LLM extracts structured data from application {applicant, income, requested_amount, purpose}
4 State Machine Engine Agentic action emits data_validated; transitions DataCollected → UnderAssessment State written; audit record: {from: "DataCollected", to: "UnderAssessment", event: "data_validated"}
5 Agentic Action (Risk Assessment) LLM assesses credit risk; produces score 78 (high risk) {risk_score: 78, risk_category: "high", recommendation: "manual_review"}
6 State Machine Engine Emits assessment_complete_HIGH_RISK; transitions → AwaitingApproval State written; audit record
7 Human Notification Sends approval request to credit officer Email with case summary
8 Credit Officer Approves via portal Human event: human_approved
9 State Machine Engine Transitions AwaitingApproval → Approved; triggers fulfilment action State written; audit record
10 Audit System Retrieves full transition history Complete state audit trail: 5 transitions, timestamps, actors

Error Flow

Error Detection Recovery
Agentic action fails Entry action throws exception State machine transitions to Failed state; error logged; support ticket created
Process restart mid-action State machine loads persisted state on restart Re-executes entry action (must be idempotent); continues from current state
Human does not respond (timeout) Awaiting state timeout State machine emits approval_timeout event; transitions to escalation path
Transition to invalid state State model validation State model validator rejects invalid transitions before they execute

8. Security Considerations

State Tampering

  • Direct modification of persisted state (bypassing the state machine engine) could put a workflow in an invalid or fraudulent state
  • Mitigation: State store is append-only; no direct update operations; state transitions go through the engine only; state integrity check on load (hash of prior state chain)

OWASP LLM Top 10

OWASP LLM Risk State Machine Applicability Mitigation
LLM08 Excessive Agency Agentic actions within states are bounded by state machine transitions Agent can only emit defined transition events; cannot jump to arbitrary states
LLM09 Overreliance Automated state transitions for regulated decisions Human intervention states are mandatory for high-risk decision points
LLM01 Prompt Injection Input to agentic actions within states Input sanitisation before agentic action; state machine constraints limit action scope

9. Governance Considerations

State Model as Regulatory Artefact

  • For regulated workflows, the state model (states, transitions, human intervention points) is a regulatory specification artefact; it documents the decision process and must be approved by compliance teams before deployment
  • Changes to the state model (adding states, changing transitions, removing human intervention points) require compliance review

Governance Artefacts

Artefact Owner Frequency Purpose
State Model Specification AI Governance + Business Process Owner On change; quarterly review Formal definition of all states, transitions, and human intervention points
State Transition Audit Archive Compliance Per workflow execution; retained per policy Immutable record of every state transition for regulatory audit
State Coverage Test Report QA On deployment Confirms every state and every transition is exercised by test suite
Awaiting State SLA Report Operations Weekly Tracks time spent in human-awaiting states; identifies approval bottlenecks

10. Operational Considerations

SLOs

SLO Target Window Alert
Workflow completion rate (reach terminal state) ≥ 99% 24-hour rolling < 97% triggers P2
Failed state rate ≤ 0.5% 24-hour rolling > 1% triggers P2
Awaiting state median duration ≤ business SLA (e.g., 4h for approvals) Business hours > 2× SLA triggers P3
Resume success rate (interrupted workflows) ≥ 99.9% Weekly Any resume failure triggers P1

11. Cost Considerations

State Machine Component Cost Driver Optimisation
State persistence Storage writes per transition Minimal cost; do not optimise away — persistence is the core value
Agentic actions per state LLM inference per state State design should consolidate expensive agentic work into minimum required states
State machine engine Compute per evaluation Negligible vs. LLM inference cost
Awaiting states (human holds) Time cost; opportunity cost SLA enforcement; escalation workflows to reduce dwell time

12. Trade-Off Analysis

Option Determinism Resilience Flexibility Complexity Best For
A: Explicit FSM with durable persistence (Recommended) Very High Very High Medium High Regulated, resumable workflows
B: Implicit state (in-memory only) High Low High Low Development; non-critical short tasks
C: ReAct loop (EAAPL-WRK001) Low Low Very High Medium Exploratory reasoning; not for regulated flows
D: Sequential chain (EAAPL-WRK002) High Medium Low Low Fixed, non-resumable workflows

13. Failure Modes

Failure Mode Likelihood Impact Detection Recovery
State store unavailable Low Critical — no state persistence; workflows cannot proceed safely Health check; store availability alarm Queue transitions locally; replay on restore; circuit breaker to halt new workflows
Non-idempotent entry action executed twice on resume Medium High — duplicate processing, data corruption Idempotency guard detects duplicate ON CONFLICT DO NOTHING; idempotency key per entry action
State machine cycle (workflow never reaches terminal state) Low High — cost and resource waste Cycle detection; max transition count alarm Max transitions limit; forced transition to Failed state
State model version mismatch (old workflow on new state model) Low–Medium High — transition not found Version check on state load State model versioning; migration path for in-flight workflows

14. Regulatory Considerations

APRA CPS 230

  • The state machine provides the operational resilience evidence APRA requires: workflows can be interrupted and resumed without data loss; the state transition audit trail proves operational continuity.

ISO 42001

  • §8.4: The state model specification is a documented AI system operational procedure; it must be version-controlled, tested, and subject to change management.

EU AI Act

  • Art. 9 (Risk Management): The state machine with explicit human intervention states directly implements the human oversight requirement for high-risk AI systems.

15. Reference Implementations

AWS

Component Service
State Machine Engine AWS Step Functions Standard Workflows (durable, long-running)
State Persistence Step Functions execution history (built-in)
Human Intervention Step Functions waitForTaskToken pattern
Audit Trail CloudWatch Events + S3 for execution history
Agentic Actions Lambda functions per state action

Azure

Component Service
State Machine Engine Azure Durable Functions (stateful orchestration)
Human Intervention Durable Functions human interaction pattern (external events)
State Persistence Durable Functions built-in state storage
Audit Trail Azure Monitor + Application Insights

On-Premises

Component Technology
State Machine Engine XState (JavaScript); python-statemachine; Temporal workflows
State Persistence PostgreSQL (append-only workflow_states table)
Agentic Actions LangGraph nodes within each state

Pattern ID Relationship Type Notes
Dynamic Sub-agent Spawning EAAPL-WRK009 Integrates With State machine governs spawn lifecycle states
Long-Running Agent EAAPL-AGT007 Integrates With Long-running agents use state machines for resumable execution
Workflow Tracing and Replay EAAPL-WRK013 Depends On State transition audit is the primary input to workflow tracing
Conditional Routing EAAPL-WRK011 Peer Conditional routing implements switch/if inside a state; state machine governs between states
Human Escalation EAAPL-HITL001 Integrates With Awaiting states implement human escalation points

17. Maturity Assessment

Overall Maturity: Proven

Dimension Score (1–5) Evidence
Research Foundation 5 Finite automata theory is foundational CS; workflow FSM patterns decades-proven
Production Deployment 4 AWS Step Functions, Azure Durable Functions, Temporal all production-proven at scale
Framework Support 5 Step Functions, Durable Functions, Temporal, XState, LangGraph all implement FSM
LLM Integration 3 Integrating LLM agentic actions as FSM state entry actions is newer; patterns maturing
Audit Tooling 4 Step Functions + Durable Functions provide built-in execution history

18. Revision History

Version Date Author Changes
1.0 2025-06-13 Architecture Board Initial publication in Agentic Workflows category
← Back to LibraryMore Agentic Workflows