Proven

Workflow State Machine

Agentic WorkflowsEU AI ActISO/IEC 42001

[EAAPL-WRK012] Workflow State Machine

Category: Agentic Workflows Sub-category: Deterministic Execution Architecture Version: 1.0 Maturity: Proven Tags: state-machine, FSM, workflow-states, deterministic, resumable, auditability Regulatory Relevance: APRA CPS 230 (operational resilience), ISO 42001 §8.4, EU AI Act (Art. 9)

1. Executive Summary

The Workflow State Machine Pattern defines an explicit finite-state machine (FSM) that governs all agentic workflow transitions, making the execution deterministic, safely interruptible, and resumable from any persisted state. Rather than relying on an LLM to decide what to do next at each step, the state machine provides an explicit, auditable model of workflow states and the transitions between them — an AI agent operates within the state machine's boundaries, not outside them. This is the primary architectural pattern for achieving operational resilience in regulated agentic workflows.

For CIO/CTO audiences: without a state machine, an agentic workflow is a black box — you know the input and the output, but the in-between is an opaque LLM reasoning trace that may be different every run. With a state machine, every workflow has an explicit map of all possible states (Initialised, DataCollected, Assessed, AwaitingApproval, Approved, Completed, Failed) and the events that trigger transitions between them. This means: if the workflow is interrupted mid-execution (server restart, timeout, user cancellation), you know exactly which state it was in and can resume from there. You can audit the exact sequence of state transitions taken by any workflow execution. You can test every transition independently. This is the difference between an AI system you can deploy in a regulated environment and one you cannot.

2. Problem Statement

Business Problem

Regulated enterprise workflows require predictability, recoverability, and auditability that an unconstrained agentic reasoning loop cannot provide. A workflow that processes a loan application, a regulatory determination, or a clinical pathway must be able to: (a) resume from where it stopped if interrupted, (b) provide a complete audit trail of every decision point, (c) prevent the same step from being executed twice on resume (idempotency), and (d) support human intervention at any state.

Technical Problem

An LLM reasoning loop produces non-deterministic execution paths: given the same inputs and state, two runs may take different paths and produce different intermediate decisions. Without explicit state management, interrupted workflows cannot safely resume — there is no ground truth about what was completed before the interruption. Re-running from scratch is wasteful and may produce inconsistent results.

Symptoms of Absence

Interrupted workflows cannot resume; must restart from the beginning
No deterministic mapping of which processing steps have been completed
Human intervention has no defined entry points; must wait for workflow completion or cancellation
Audit trail shows only start and end; intermediate decision points are not recorded

Cost of Inaction

Resilience: Workflow failures require full restart; compute and time cost is wasted; results may be inconsistent
Compliance: Regulated workflows without state transition audit trails cannot satisfy audit requirements
Human Oversight: No defined intervention points means human oversight is post-hoc, not in-process

3. Context

When to Apply

Workflows must be safely resumable after interruption
Regulated workflows require a complete audit trail of decision points
Human intervention is required at specific points in the workflow
Workflows have complex, multi-step execution paths with conditional transitions
Idempotency must be guaranteed (same state does not process the same step twice)

When NOT to Apply

Short, single-step workflows where state machine overhead is not justified
Exploratory or research tasks where the execution path is inherently unpredictable
Tasks where the overhead of state machine design outweighs the benefit (low-stakes, non-regulated)

Prerequisites

Defined state inventory (all possible workflow states)
Defined transition events (all triggers that cause state changes)
Persistent state store (state must survive process restarts)
State machine engine (custom or framework-provided)
Idempotency key per workflow execution

Industry Applicability

Industry	Workflow State Machine Use Case	Critical States
Financial Services	Loan approval workflow	Applied → Assessed → Reviewed → Decided → Disbursed
Legal	Matter lifecycle	Opened → InProgress → ReviewPending → Approved → Closed
Healthcare	Clinical pathway	Registered → Assessed → TreatmentPlanned → InTreatment → Discharged
Government	Permit application	Submitted → UnderReview → AdditionalInfoRequired → Decided → Issued
Insurance	Claims lifecycle	Lodged → Assessed → Investigation → Settled → Closed

4. Architecture Overview

The Workflow State Machine architecture separates the state model (what states exist and how to transition between them) from the agentic logic that executes within each state.

State Model Definition The state model is defined at design time as a formal specification: a set of named states, a set of named events, a transition table (current_state × event → next_state), a set of entry actions (executed when entering a state), and a set of exit actions (executed when leaving a state). The state model is version-controlled and tested independently of the agentic logic that executes within states.

State Persistence The current state of every active workflow execution is persisted in a durable store (not in memory). Every state transition is a durable write before the transition-triggered action is executed. This ensures that if the process restarts after a state transition write but before the action completes, the state machine knows which state was reached and can execute the appropriate recovery action on resume.

Agentic Logic Within States Within each state, an agentic component (LLM, chain, tool call) executes the state's action. The agentic component is free to use non-deterministic reasoning; the state machine imposes determinism at the state-transition level, not within the state's action. The state's action completes by emitting a transition event to the state machine, which persists the new state and triggers the next state's entry action.

Human Intervention Points States that require human review are defined as "awaiting" states: the workflow enters the state, an external notification is sent (email, task queue, Slack), and the workflow waits for a human-provided event (Approved, Rejected, RequestMoreInfo) to trigger the transition. This is the formal implementation of human-in-the-loop at specific, pre-defined decision points.

Resume from State On workflow resume (after interruption), the state machine loads the persisted current state and executes the entry action for that state. If the interruption occurred mid-action (entry action partially complete), the entry action must be idempotent: re-executing it produces the same result as executing it for the first time. Entry actions are written to be idempotent by default (check-then-act; ON CONFLICT DO NOTHING pattern for database writes).

State Transition Audit Every state transition is an immutable audit record: from_state, to_state, triggering_event, timestamp, actor (human or agent), and any decision metadata produced by the agentic action. The audit trail is a sequence of state transition records, not a raw LLM reasoning trace.

5. Architecture Diagram

ARCHITECTURE DIAGRAM

flowchart TD subgraph StateMachine["State Machine Engine"] A[Initialised] B[DataCollected] C[UnderAssessment] D[AwaitingApproval] E[Approved] F[Rejected] G[Completed] H[Failed] end subgraph Actions["State Entry Actions"] I[Data Extraction] J[Risk Assessment] K[Human Notification] L[Fulfilment Action] end subgraph Persistence["State Persistence"] M[(State Store)] end A -->|data_received| B B -->|data_validated| C C -->|assessment_complete HIGH_RISK| D C -->|assessment_complete LOW_RISK| E D -->|human_approved| E D -->|human_rejected| F E -->|processing_complete| G C -->|error| H B --> I C --> J D --> K E --> L A & B & C & D & E & F & G & H --> M

6. Components

Component	Type	Responsibility	Technology Options	Criticality
State Model Definition	Configuration	Formal spec of states, events, transitions, actions	JSON/YAML spec; XState; custom DSL	Critical
State Machine Engine	Orchestration	Evaluates transitions; persists state; dispatches entry/exit actions	XState; AWS Step Functions; Azure Durable Functions; Temporal; LangGraph	Critical
State Store	Persistence	Durable storage of current and historical states per workflow instance	PostgreSQL; DynamoDB; Cosmos DB (append-only)	Critical
State Transition Audit Writer	Governance	Writes immutable audit record per transition	Append-only table; CloudWatch Events; event stream	Critical
Agentic Action Executor	AI Component	Executes the agentic logic for each state (LLM, chain, tool calls)	Any agentic workflow pattern	Critical
Human Notification Service	Integration	Notifies humans of awaiting states; receives human decisions	Email; Slack; task queue; workflow portal	High
Resume Controller	Resilience	Loads persisted state on restart; re-dispatches entry action	Built into state machine engine	Critical
Idempotency Guard	Resilience	Ensures entry actions are safe to re-execute	Check-then-act; ON CONFLICT DO NOTHING	Critical

7. Data Flow

Step	Actor	Action	Output
1	External System	Triggers workflow: loan application received	Initial event: `application_received`
2	State Machine Engine	Transitions Initialised → DataCollected; persists state	State written: `{state: "DataCollected", event: "application_received", ts: ...}`
3	Agentic Action (Data Extraction)	LLM extracts structured data from application	`{applicant, income, requested_amount, purpose}`
4	State Machine Engine	Agentic action emits `data_validated`; transitions DataCollected → UnderAssessment	State written; audit record: `{from: "DataCollected", to: "UnderAssessment", event: "data_validated"}`
5	Agentic Action (Risk Assessment)	LLM assesses credit risk; produces score 78 (high risk)	`{risk_score: 78, risk_category: "high", recommendation: "manual_review"}`
6	State Machine Engine	Emits `assessment_complete_HIGH_RISK`; transitions → AwaitingApproval	State written; audit record
7	Human Notification	Sends approval request to credit officer	Email with case summary
8	Credit Officer	Approves via portal	Human event: `human_approved`
9	State Machine Engine	Transitions AwaitingApproval → Approved; triggers fulfilment action	State written; audit record
10	Audit System	Retrieves full transition history	Complete state audit trail: 5 transitions, timestamps, actors

Error Flow

Error	Detection	Recovery
Agentic action fails	Entry action throws exception	State machine transitions to Failed state; error logged; support ticket created
Process restart mid-action	State machine loads persisted state on restart	Re-executes entry action (must be idempotent); continues from current state
Human does not respond (timeout)	Awaiting state timeout	State machine emits `approval_timeout` event; transitions to escalation path
Transition to invalid state	State model validation	State model validator rejects invalid transitions before they execute

8. Security Considerations

State Tampering

Direct modification of persisted state (bypassing the state machine engine) could put a workflow in an invalid or fraudulent state
Mitigation: State store is append-only; no direct update operations; state transitions go through the engine only; state integrity check on load (hash of prior state chain)

OWASP LLM Top 10

OWASP LLM Risk	State Machine Applicability	Mitigation
LLM08 Excessive Agency	Agentic actions within states are bounded by state machine transitions	Agent can only emit defined transition events; cannot jump to arbitrary states
LLM09 Overreliance	Automated state transitions for regulated decisions	Human intervention states are mandatory for high-risk decision points
LLM01 Prompt Injection	Input to agentic actions within states	Input sanitisation before agentic action; state machine constraints limit action scope

9. Governance Considerations

State Model as Regulatory Artefact

For regulated workflows, the state model (states, transitions, human intervention points) is a regulatory specification artefact; it documents the decision process and must be approved by compliance teams before deployment
Changes to the state model (adding states, changing transitions, removing human intervention points) require compliance review

Governance Artefacts

Artefact	Owner	Frequency	Purpose
State Model Specification	AI Governance + Business Process Owner	On change; quarterly review	Formal definition of all states, transitions, and human intervention points
State Transition Audit Archive	Compliance	Per workflow execution; retained per policy	Immutable record of every state transition for regulatory audit
State Coverage Test Report	QA	On deployment	Confirms every state and every transition is exercised by test suite
Awaiting State SLA Report	Operations	Weekly	Tracks time spent in human-awaiting states; identifies approval bottlenecks

10. Operational Considerations

SLOs

SLO	Target	Window	Alert
Workflow completion rate (reach terminal state)	≥ 99%	24-hour rolling	< 97% triggers P2
Failed state rate	≤ 0.5%	24-hour rolling	> 1% triggers P2
Awaiting state median duration	≤ business SLA (e.g., 4h for approvals)	Business hours	> 2× SLA triggers P3
Resume success rate (interrupted workflows)	≥ 99.9%	Weekly	Any resume failure triggers P1

11. Cost Considerations

State Machine Component	Cost Driver	Optimisation
State persistence	Storage writes per transition	Minimal cost; do not optimise away — persistence is the core value
Agentic actions per state	LLM inference per state	State design should consolidate expensive agentic work into minimum required states
State machine engine	Compute per evaluation	Negligible vs. LLM inference cost
Awaiting states (human holds)	Time cost; opportunity cost	SLA enforcement; escalation workflows to reduce dwell time

12. Trade-Off Analysis

Option	Determinism	Resilience	Flexibility	Complexity	Best For
A: Explicit FSM with durable persistence (Recommended)	Very High	Very High	Medium	High	Regulated, resumable workflows
B: Implicit state (in-memory only)	High	Low	High	Low	Development; non-critical short tasks
C: ReAct loop (EAAPL-WRK001)	Low	Low	Very High	Medium	Exploratory reasoning; not for regulated flows
D: Sequential chain (EAAPL-WRK002)	High	Medium	Low	Low	Fixed, non-resumable workflows

13. Failure Modes

Failure Mode	Likelihood	Impact	Detection	Recovery
State store unavailable	Low	Critical — no state persistence; workflows cannot proceed safely	Health check; store availability alarm	Queue transitions locally; replay on restore; circuit breaker to halt new workflows
Non-idempotent entry action executed twice on resume	Medium	High — duplicate processing, data corruption	Idempotency guard detects duplicate	ON CONFLICT DO NOTHING; idempotency key per entry action
State machine cycle (workflow never reaches terminal state)	Low	High — cost and resource waste	Cycle detection; max transition count alarm	Max transitions limit; forced transition to Failed state
State model version mismatch (old workflow on new state model)	Low–Medium	High — transition not found	Version check on state load	State model versioning; migration path for in-flight workflows

14. Regulatory Considerations

APRA CPS 230

The state machine provides the operational resilience evidence APRA requires: workflows can be interrupted and resumed without data loss; the state transition audit trail proves operational continuity.

ISO 42001

§8.4: The state model specification is a documented AI system operational procedure; it must be version-controlled, tested, and subject to change management.

EU AI Act

Art. 9 (Risk Management): The state machine with explicit human intervention states directly implements the human oversight requirement for high-risk AI systems.

15. Reference Implementations

AWS

Component	Service
State Machine Engine	AWS Step Functions Standard Workflows (durable, long-running)
State Persistence	Step Functions execution history (built-in)
Human Intervention	Step Functions waitForTaskToken pattern
Audit Trail	CloudWatch Events + S3 for execution history
Agentic Actions	Lambda functions per state action

Azure

Component	Service
State Machine Engine	Azure Durable Functions (stateful orchestration)
Human Intervention	Durable Functions human interaction pattern (external events)
State Persistence	Durable Functions built-in state storage
Audit Trail	Azure Monitor + Application Insights

On-Premises

Component	Technology
State Machine Engine	XState (JavaScript); python-statemachine; Temporal workflows
State Persistence	PostgreSQL (append-only workflow_states table)
Agentic Actions	LangGraph nodes within each state

Pattern	ID	Relationship Type	Notes
Dynamic Sub-agent Spawning	EAAPL-WRK009	Integrates With	State machine governs spawn lifecycle states
Long-Running Agent	EAAPL-AGT007	Integrates With	Long-running agents use state machines for resumable execution
Workflow Tracing and Replay	EAAPL-WRK013	Depends On	State transition audit is the primary input to workflow tracing
Conditional Routing	EAAPL-WRK011	Peer	Conditional routing implements switch/if inside a state; state machine governs between states
Human Escalation	EAAPL-HITL001	Integrates With	Awaiting states implement human escalation points

17. Maturity Assessment

Overall Maturity: Proven

Dimension	Score (1–5)	Evidence
Research Foundation	5	Finite automata theory is foundational CS; workflow FSM patterns decades-proven
Production Deployment	4	AWS Step Functions, Azure Durable Functions, Temporal all production-proven at scale
Framework Support	5	Step Functions, Durable Functions, Temporal, XState, LangGraph all implement FSM
LLM Integration	3	Integrating LLM agentic actions as FSM state entry actions is newer; patterns maturing
Audit Tooling	4	Step Functions + Durable Functions provide built-in execution history

18. Revision History

Version	Date	Author	Changes
1.0	2025-06-13	Architecture Board	Initial publication in Agentic Workflows category

Track this pattern for APRA/ASIC review

← Back to Library More Agentic Workflows →