[EAAPL-WRK012] Workflow State Machine
Category: Agentic Workflows
Sub-category: Deterministic Execution Architecture
Version: 1.0
Maturity: Proven
Tags: state-machine, FSM, workflow-states, deterministic, resumable, auditability
Regulatory Relevance: APRA CPS 230 (operational resilience), ISO 42001 §8.4, EU AI Act (Art. 9)
1. Executive Summary
The Workflow State Machine Pattern defines an explicit finite-state machine (FSM) that governs all agentic workflow transitions, making the execution deterministic, safely interruptible, and resumable from any persisted state. Rather than relying on an LLM to decide what to do next at each step, the state machine provides an explicit, auditable model of workflow states and the transitions between them — an AI agent operates within the state machine's boundaries, not outside them. This is the primary architectural pattern for achieving operational resilience in regulated agentic workflows.
For CIO/CTO audiences: without a state machine, an agentic workflow is a black box — you know the input and the output, but the in-between is an opaque LLM reasoning trace that may be different every run. With a state machine, every workflow has an explicit map of all possible states (Initialised, DataCollected, Assessed, AwaitingApproval, Approved, Completed, Failed) and the events that trigger transitions between them. This means: if the workflow is interrupted mid-execution (server restart, timeout, user cancellation), you know exactly which state it was in and can resume from there. You can audit the exact sequence of state transitions taken by any workflow execution. You can test every transition independently. This is the difference between an AI system you can deploy in a regulated environment and one you cannot.
2. Problem Statement
Business Problem
Regulated enterprise workflows require predictability, recoverability, and auditability that an unconstrained agentic reasoning loop cannot provide. A workflow that processes a loan application, a regulatory determination, or a clinical pathway must be able to: (a) resume from where it stopped if interrupted, (b) provide a complete audit trail of every decision point, (c) prevent the same step from being executed twice on resume (idempotency), and (d) support human intervention at any state.
Technical Problem
An LLM reasoning loop produces non-deterministic execution paths: given the same inputs and state, two runs may take different paths and produce different intermediate decisions. Without explicit state management, interrupted workflows cannot safely resume — there is no ground truth about what was completed before the interruption. Re-running from scratch is wasteful and may produce inconsistent results.
Symptoms of Absence
- Interrupted workflows cannot resume; must restart from the beginning
- No deterministic mapping of which processing steps have been completed
- Human intervention has no defined entry points; must wait for workflow completion or cancellation
- Audit trail shows only start and end; intermediate decision points are not recorded
Cost of Inaction
- Resilience: Workflow failures require full restart; compute and time cost is wasted; results may be inconsistent
- Compliance: Regulated workflows without state transition audit trails cannot satisfy audit requirements
- Human Oversight: No defined intervention points means human oversight is post-hoc, not in-process
3. Context
When to Apply
- Workflows must be safely resumable after interruption
- Regulated workflows require a complete audit trail of decision points
- Human intervention is required at specific points in the workflow
- Workflows have complex, multi-step execution paths with conditional transitions
- Idempotency must be guaranteed (same state does not process the same step twice)
When NOT to Apply
- Short, single-step workflows where state machine overhead is not justified
- Exploratory or research tasks where the execution path is inherently unpredictable
- Tasks where the overhead of state machine design outweighs the benefit (low-stakes, non-regulated)
Prerequisites
- Defined state inventory (all possible workflow states)
- Defined transition events (all triggers that cause state changes)
- Persistent state store (state must survive process restarts)
- State machine engine (custom or framework-provided)
- Idempotency key per workflow execution
Industry Applicability
| Industry |
Workflow State Machine Use Case |
Critical States |
| Financial Services |
Loan approval workflow |
Applied → Assessed → Reviewed → Decided → Disbursed |
| Legal |
Matter lifecycle |
Opened → InProgress → ReviewPending → Approved → Closed |
| Healthcare |
Clinical pathway |
Registered → Assessed → TreatmentPlanned → InTreatment → Discharged |
| Government |
Permit application |
Submitted → UnderReview → AdditionalInfoRequired → Decided → Issued |
| Insurance |
Claims lifecycle |
Lodged → Assessed → Investigation → Settled → Closed |
4. Architecture Overview
The Workflow State Machine architecture separates the state model (what states exist and how to transition between them) from the agentic logic that executes within each state.
State Model Definition
The state model is defined at design time as a formal specification: a set of named states, a set of named events, a transition table (current_state × event → next_state), a set of entry actions (executed when entering a state), and a set of exit actions (executed when leaving a state). The state model is version-controlled and tested independently of the agentic logic that executes within states.
State Persistence
The current state of every active workflow execution is persisted in a durable store (not in memory). Every state transition is a durable write before the transition-triggered action is executed. This ensures that if the process restarts after a state transition write but before the action completes, the state machine knows which state was reached and can execute the appropriate recovery action on resume.
Agentic Logic Within States
Within each state, an agentic component (LLM, chain, tool call) executes the state's action. The agentic component is free to use non-deterministic reasoning; the state machine imposes determinism at the state-transition level, not within the state's action. The state's action completes by emitting a transition event to the state machine, which persists the new state and triggers the next state's entry action.
Human Intervention Points
States that require human review are defined as "awaiting" states: the workflow enters the state, an external notification is sent (email, task queue, Slack), and the workflow waits for a human-provided event (Approved, Rejected, RequestMoreInfo) to trigger the transition. This is the formal implementation of human-in-the-loop at specific, pre-defined decision points.
Resume from State
On workflow resume (after interruption), the state machine loads the persisted current state and executes the entry action for that state. If the interruption occurred mid-action (entry action partially complete), the entry action must be idempotent: re-executing it produces the same result as executing it for the first time. Entry actions are written to be idempotent by default (check-then-act; ON CONFLICT DO NOTHING pattern for database writes).
State Transition Audit
Every state transition is an immutable audit record: from_state, to_state, triggering_event, timestamp, actor (human or agent), and any decision metadata produced by the agentic action. The audit trail is a sequence of state transition records, not a raw LLM reasoning trace.
5. Architecture Diagram
flowchart TD
subgraph StateMachine["State Machine Engine"]
A[Initialised]
B[DataCollected]
C[UnderAssessment]
D[AwaitingApproval]
E[Approved]
F[Rejected]
G[Completed]
H[Failed]
end
subgraph Actions["State Entry Actions"]
I[Data Extraction]
J[Risk Assessment]
K[Human Notification]
L[Fulfilment Action]
end
subgraph Persistence["State Persistence"]
M[(State Store)]
end
A -->|data_received| B
B -->|data_validated| C
C -->|assessment_complete HIGH_RISK| D
C -->|assessment_complete LOW_RISK| E
D -->|human_approved| E
D -->|human_rejected| F
E -->|processing_complete| G
C -->|error| H
B --> I
C --> J
D --> K
E --> L
A & B & C & D & E & F & G & H --> M
6. Components
| Component |
Type |
Responsibility |
Technology Options |
Criticality |
| State Model Definition |
Configuration |
Formal spec of states, events, transitions, actions |
JSON/YAML spec; XState; custom DSL |
Critical |
| State Machine Engine |
Orchestration |
Evaluates transitions; persists state; dispatches entry/exit actions |
XState; AWS Step Functions; Azure Durable Functions; Temporal; LangGraph |
Critical |
| State Store |
Persistence |
Durable storage of current and historical states per workflow instance |
PostgreSQL; DynamoDB; Cosmos DB (append-only) |
Critical |
| State Transition Audit Writer |
Governance |
Writes immutable audit record per transition |
Append-only table; CloudWatch Events; event stream |
Critical |
| Agentic Action Executor |
AI Component |
Executes the agentic logic for each state (LLM, chain, tool calls) |
Any agentic workflow pattern |
Critical |
| Human Notification Service |
Integration |
Notifies humans of awaiting states; receives human decisions |
Email; Slack; task queue; workflow portal |
High |
| Resume Controller |
Resilience |
Loads persisted state on restart; re-dispatches entry action |
Built into state machine engine |
Critical |
| Idempotency Guard |
Resilience |
Ensures entry actions are safe to re-execute |
Check-then-act; ON CONFLICT DO NOTHING |
Critical |
7. Data Flow
| Step |
Actor |
Action |
Output |
| 1 |
External System |
Triggers workflow: loan application received |
Initial event: application_received |
| 2 |
State Machine Engine |
Transitions Initialised → DataCollected; persists state |
State written: {state: "DataCollected", event: "application_received", ts: ...} |
| 3 |
Agentic Action (Data Extraction) |
LLM extracts structured data from application |
{applicant, income, requested_amount, purpose} |
| 4 |
State Machine Engine |
Agentic action emits data_validated; transitions DataCollected → UnderAssessment |
State written; audit record: {from: "DataCollected", to: "UnderAssessment", event: "data_validated"} |
| 5 |
Agentic Action (Risk Assessment) |
LLM assesses credit risk; produces score 78 (high risk) |
{risk_score: 78, risk_category: "high", recommendation: "manual_review"} |
| 6 |
State Machine Engine |
Emits assessment_complete_HIGH_RISK; transitions → AwaitingApproval |
State written; audit record |
| 7 |
Human Notification |
Sends approval request to credit officer |
Email with case summary |
| 8 |
Credit Officer |
Approves via portal |
Human event: human_approved |
| 9 |
State Machine Engine |
Transitions AwaitingApproval → Approved; triggers fulfilment action |
State written; audit record |
| 10 |
Audit System |
Retrieves full transition history |
Complete state audit trail: 5 transitions, timestamps, actors |
Error Flow
| Error |
Detection |
Recovery |
| Agentic action fails |
Entry action throws exception |
State machine transitions to Failed state; error logged; support ticket created |
| Process restart mid-action |
State machine loads persisted state on restart |
Re-executes entry action (must be idempotent); continues from current state |
| Human does not respond (timeout) |
Awaiting state timeout |
State machine emits approval_timeout event; transitions to escalation path |
| Transition to invalid state |
State model validation |
State model validator rejects invalid transitions before they execute |
8. Security Considerations
State Tampering
- Direct modification of persisted state (bypassing the state machine engine) could put a workflow in an invalid or fraudulent state
- Mitigation: State store is append-only; no direct update operations; state transitions go through the engine only; state integrity check on load (hash of prior state chain)
OWASP LLM Top 10
| OWASP LLM Risk |
State Machine Applicability |
Mitigation |
| LLM08 Excessive Agency |
Agentic actions within states are bounded by state machine transitions |
Agent can only emit defined transition events; cannot jump to arbitrary states |
| LLM09 Overreliance |
Automated state transitions for regulated decisions |
Human intervention states are mandatory for high-risk decision points |
| LLM01 Prompt Injection |
Input to agentic actions within states |
Input sanitisation before agentic action; state machine constraints limit action scope |
9. Governance Considerations
State Model as Regulatory Artefact
- For regulated workflows, the state model (states, transitions, human intervention points) is a regulatory specification artefact; it documents the decision process and must be approved by compliance teams before deployment
- Changes to the state model (adding states, changing transitions, removing human intervention points) require compliance review
Governance Artefacts
| Artefact |
Owner |
Frequency |
Purpose |
| State Model Specification |
AI Governance + Business Process Owner |
On change; quarterly review |
Formal definition of all states, transitions, and human intervention points |
| State Transition Audit Archive |
Compliance |
Per workflow execution; retained per policy |
Immutable record of every state transition for regulatory audit |
| State Coverage Test Report |
QA |
On deployment |
Confirms every state and every transition is exercised by test suite |
| Awaiting State SLA Report |
Operations |
Weekly |
Tracks time spent in human-awaiting states; identifies approval bottlenecks |
10. Operational Considerations
SLOs
| SLO |
Target |
Window |
Alert |
| Workflow completion rate (reach terminal state) |
≥ 99% |
24-hour rolling |
< 97% triggers P2 |
| Failed state rate |
≤ 0.5% |
24-hour rolling |
> 1% triggers P2 |
| Awaiting state median duration |
≤ business SLA (e.g., 4h for approvals) |
Business hours |
> 2× SLA triggers P3 |
| Resume success rate (interrupted workflows) |
≥ 99.9% |
Weekly |
Any resume failure triggers P1 |
11. Cost Considerations
| State Machine Component |
Cost Driver |
Optimisation |
| State persistence |
Storage writes per transition |
Minimal cost; do not optimise away — persistence is the core value |
| Agentic actions per state |
LLM inference per state |
State design should consolidate expensive agentic work into minimum required states |
| State machine engine |
Compute per evaluation |
Negligible vs. LLM inference cost |
| Awaiting states (human holds) |
Time cost; opportunity cost |
SLA enforcement; escalation workflows to reduce dwell time |
12. Trade-Off Analysis
| Option |
Determinism |
Resilience |
Flexibility |
Complexity |
Best For |
| A: Explicit FSM with durable persistence (Recommended) |
Very High |
Very High |
Medium |
High |
Regulated, resumable workflows |
| B: Implicit state (in-memory only) |
High |
Low |
High |
Low |
Development; non-critical short tasks |
| C: ReAct loop (EAAPL-WRK001) |
Low |
Low |
Very High |
Medium |
Exploratory reasoning; not for regulated flows |
| D: Sequential chain (EAAPL-WRK002) |
High |
Medium |
Low |
Low |
Fixed, non-resumable workflows |
13. Failure Modes
| Failure Mode |
Likelihood |
Impact |
Detection |
Recovery |
| State store unavailable |
Low |
Critical — no state persistence; workflows cannot proceed safely |
Health check; store availability alarm |
Queue transitions locally; replay on restore; circuit breaker to halt new workflows |
| Non-idempotent entry action executed twice on resume |
Medium |
High — duplicate processing, data corruption |
Idempotency guard detects duplicate |
ON CONFLICT DO NOTHING; idempotency key per entry action |
| State machine cycle (workflow never reaches terminal state) |
Low |
High — cost and resource waste |
Cycle detection; max transition count alarm |
Max transitions limit; forced transition to Failed state |
| State model version mismatch (old workflow on new state model) |
Low–Medium |
High — transition not found |
Version check on state load |
State model versioning; migration path for in-flight workflows |
14. Regulatory Considerations
APRA CPS 230
- The state machine provides the operational resilience evidence APRA requires: workflows can be interrupted and resumed without data loss; the state transition audit trail proves operational continuity.
ISO 42001
- §8.4: The state model specification is a documented AI system operational procedure; it must be version-controlled, tested, and subject to change management.
EU AI Act
- Art. 9 (Risk Management): The state machine with explicit human intervention states directly implements the human oversight requirement for high-risk AI systems.
15. Reference Implementations
AWS
| Component |
Service |
| State Machine Engine |
AWS Step Functions Standard Workflows (durable, long-running) |
| State Persistence |
Step Functions execution history (built-in) |
| Human Intervention |
Step Functions waitForTaskToken pattern |
| Audit Trail |
CloudWatch Events + S3 for execution history |
| Agentic Actions |
Lambda functions per state action |
Azure
| Component |
Service |
| State Machine Engine |
Azure Durable Functions (stateful orchestration) |
| Human Intervention |
Durable Functions human interaction pattern (external events) |
| State Persistence |
Durable Functions built-in state storage |
| Audit Trail |
Azure Monitor + Application Insights |
On-Premises
| Component |
Technology |
| State Machine Engine |
XState (JavaScript); python-statemachine; Temporal workflows |
| State Persistence |
PostgreSQL (append-only workflow_states table) |
| Agentic Actions |
LangGraph nodes within each state |
| Pattern |
ID |
Relationship Type |
Notes |
| Dynamic Sub-agent Spawning |
EAAPL-WRK009 |
Integrates With |
State machine governs spawn lifecycle states |
| Long-Running Agent |
EAAPL-AGT007 |
Integrates With |
Long-running agents use state machines for resumable execution |
| Workflow Tracing and Replay |
EAAPL-WRK013 |
Depends On |
State transition audit is the primary input to workflow tracing |
| Conditional Routing |
EAAPL-WRK011 |
Peer |
Conditional routing implements switch/if inside a state; state machine governs between states |
| Human Escalation |
EAAPL-HITL001 |
Integrates With |
Awaiting states implement human escalation points |
17. Maturity Assessment
Overall Maturity: Proven
| Dimension |
Score (1–5) |
Evidence |
| Research Foundation |
5 |
Finite automata theory is foundational CS; workflow FSM patterns decades-proven |
| Production Deployment |
4 |
AWS Step Functions, Azure Durable Functions, Temporal all production-proven at scale |
| Framework Support |
5 |
Step Functions, Durable Functions, Temporal, XState, LangGraph all implement FSM |
| LLM Integration |
3 |
Integrating LLM agentic actions as FSM state entry actions is newer; patterns maturing |
| Audit Tooling |
4 |
Step Functions + Durable Functions provide built-in execution history |
18. Revision History
| Version |
Date |
Author |
Changes |
| 1.0 |
2025-06-13 |
Architecture Board |
Initial publication in Agentic Workflows category |