Emerging

CodeAct Agent

Agentic WorkflowsEU AI ActISO/IEC 42001

[EAAPL-WRK014] CodeAct Agent

Category: Agentic Workflows Sub-category: Code-as-Action Architecture Version: 1.0 Maturity: Emerging Tags: code-generation, code-execution, sandboxed-execution, python-executor, code-act, tool-synthesis Regulatory Relevance: EU AI Act (Art. 9, Art. 15), ISO 42001 §8.4, APRA CPS 234

1. Executive Summary

The CodeAct Agent Pattern defines an architecture in which an agent's primary action mechanism is writing and executing code — typically Python — rather than invoking pre-registered structured tool functions. When the agent needs to perform a calculation, transform data, call an API, or interact with a system, it writes a code snippet, executes it in a sandboxed environment, and uses the execution result (output or error traceback) as its observation. This "code-as-action" paradigm provides dramatically more expressive power than structured tool calls (virtually unlimited action space through code) at the cost of significantly higher security requirements (arbitrary code execution demands robust sandboxing).

For CIO/CTO audiences: the practical implication is this — a conventional tool-using agent can only do what its pre-registered tools allow. A CodeAct agent can do almost anything that can be expressed in code: complex calculations, data transformations, novel API calls, statistical analyses, and more, without requiring a developer to pre-register a specific tool for every possible action. This flexibility is powerful for data science, analysis, and research workflows. The security implication is also significant: executing arbitrary LLM-generated code requires enterprise-grade sandboxing, code scanning, and resource limits. The CodeAct pattern is appropriate for trusted enterprise environments with strong sandbox infrastructure and inappropriate for any multi-tenant or consumer-facing environment without the same.

2. Problem Statement

Business Problem

Enterprise data analysis, research, and automation tasks frequently require ad-hoc computational actions that cannot be anticipated and pre-registered as structured tools. A data scientist asking an agent to "calculate the correlation between these two datasets with outlier removal and visualise it" cannot rely on a pre-registered tool for every possible statistical analysis variant.

Technical Problem

Pre-registered tool sets have a fixed action space: the agent can only invoke tools that a developer has already implemented and registered. Complex computational tasks either require very large tool libraries (unmaintainable) or cannot be handled at all. Code generation and execution provides an effectively unlimited action space through the expressiveness of general-purpose programming languages.

Symptoms of Absence

Complex computational requests require developer effort to add new tools before the agent can handle them
Data transformation and analysis tasks are outside the agent's capability because they cannot be expressed as structured tool calls
Agent cannot handle novel data formats or API structures not covered by pre-registered tools

Cost of Inaction

Capability: Data analysis and computation tasks remain entirely manual rather than being partially automated
Maintainability: Pre-registered tool libraries grow without bound as new action types are required
Time-to-Value: Every new action type requires development, testing, and deployment of a new tool

3. Context

When to Apply

Tasks require complex data transformation, computation, or statistical analysis
The set of required actions cannot be enumerated upfront (dynamic action synthesis)
A high-trust, single-tenant execution environment with robust sandbox infrastructure
Users are sophisticated (data scientists, analysts, engineers) who understand that code is being executed

When NOT to Apply

Multi-tenant environments where users cannot be trusted with shared code execution resources
Consumer-facing applications without enterprise-grade sandboxing
Tasks for which pre-registered structured tools are sufficient (use Tool Call Orchestration, EAAPL-WRK006)
Environments where code execution cannot be safely sandboxed (network isolation unavailable)
Regulated environments where every agent action must be from a pre-approved, audited action set

Prerequisites

Secure sandbox environment (container-based or VM-based isolation)
Code scanning prior to execution (static analysis for high-risk patterns)
Resource limits (CPU, memory, network, disk, execution time)
No persistent state across sandbox executions (stateless execution environment)
Comprehensive execution logging

Industry Applicability

Industry	CodeAct Use Case	Code Produced
Financial Services	Quantitative analysis; portfolio calculations	Python Pandas/NumPy calculations
Technology	Automated test generation; code review + fix	Python/JavaScript test code; linting scripts
Research	Scientific data analysis	Python SciPy/Statsmodels analyses
Legal	Contract data extraction + transformation	Python regex/NLP extraction scripts
Healthcare	Clinical data processing and visualisation	Python statistical analysis; chart generation

4. Architecture Overview

The CodeAct architecture wraps the agent's reasoning loop with a code generation, scanning, and sandboxed execution layer.

Code Generation Phase Within the ReAct-style reasoning loop (EAAPL-WRK001), when the agent decides it needs to take an action, it generates a Python (or JavaScript) code snippet rather than a structured tool call. The code snippet is the "action" in the thought-action-observation sequence. The code must be self-contained (no undefined variable references), executable (syntactically valid), and bounded (no infinite loops, no unbounded resource consumption in theory). The LLM is instructed to write code with explicit print statements or return values that produce the observation — the execution result must be observable.

Code Pre-Execution Scanning Before execution, the generated code is passed through a static code scanner that detects high-risk patterns: (a) network access outside the allowed whitelist, (b) file system access outside the allowed sandbox paths, (c) process spawning (subprocess, os.system), (d) import of disallowed modules (anything that can circumvent sandboxing), (e) code patterns associated with resource exhaustion attacks (infinite loops, large memory allocations). Code that matches high-risk patterns is rejected; the rejection reason is returned as an observation for the agent to reason about and produce safer code.

Sandboxed Execution Approved code is executed in an isolated sandbox environment: (a) a separate container with no network access except to whitelisted endpoints, (b) CPU time limit (default: 30s), (c) memory limit (default: 512MB), (d) read-only file system except for a designated scratch directory, (e) no ability to spawn child processes. Execution output (stdout, stderr, return value) is captured and returned. Runtime errors (tracebacks) are captured and returned as observations — the agent can read the traceback and fix the code.

Execution Result as Observation The execution result (stdout output, return value, or error traceback) is injected as an observation into the agent's scratchpad. This follows the standard ReAct observation pattern: the agent's next thought is informed by the execution result. If the code errored, the agent reads the traceback and generates corrected code. Successful execution produces data the agent can reason about to inform its next action or synthesise a final answer.

Tool Synthesis vs. Tool Call An important distinction: CodeAct does not eliminate the ability to call pre-registered tools. In many implementations, the code execution environment has access to a limited set of pre-installed libraries (pandas, numpy, requests to whitelisted endpoints) and can call these as Python libraries. This "tool synthesis" blends the flexibility of CodeAct with the safety properties of pre-approved libraries.

5. Architecture Diagram

ARCHITECTURE DIAGRAM

flowchart TD subgraph AgentLoop["ReAct Agent Loop"] A[Reason] B{Action Type?} end subgraph CodeAct["CodeAct Execution Pipeline"] C[Code Generator] D[Static Code Scanner] E{Scan Result?} F[Sandbox Executor] G[Output Capture] end subgraph Sandbox["Execution Sandbox"] H[Python Runtime] I[Whitelisted Libs] J[No Network / Read-only FS] end subgraph Observation["Observation Injection"] K[Execution Result] L[Rejection Observation] M[Error Observation] end A --> B B -->|code action| C B -->|tool call| N[Tool Registry] C --> D D --> E E -->|high risk| L E -->|approved| F F --> H H --> I & J H --> G G -->|success| K G -->|error| M K & M & L --> A

6. Components

Component	Type	Responsibility	Technology Options	Criticality
Code Generator	AI Component	Generates executable Python/JS from agent reasoning	Any capable LLM (GPT-4o, Claude 3.5 Sonnet, Gemini 1.5 Pro)	Critical
Static Code Scanner	Security	Detects high-risk patterns before execution	Bandit (Python); custom AST scanner; semgrep rules	Critical
Sandbox Executor	Infrastructure	Executes code in isolated container; enforces resource limits	E2B; Modal; Docker + gVisor; AWS Lambda; Dagger	Critical
Output Capturer	Logic	Captures stdout, stderr, return value, traceback	Custom subprocess wrapper; E2B output API	Critical
Code Execution Audit Logger	Governance	Logs every code snippet (pre- and post-scan) and execution result	Append-only log; S3; PostgreSQL	Critical
Iteration Controller	Safety	Limits code-generate-execute cycles per task	Counter; configurable max	High
Whitelisted Library Manager	Security	Manages approved Python libraries in sandbox environment	Conda environment; requirements.txt pinned versions	High
Result Size Limiter	Safety	Truncates execution output exceeding context budget	Custom truncation wrapper	High

7. Data Flow

Step	Actor	Action	Output
1	Agent	Thought: "I need to calculate the Pearson correlation between columns A and B, removing outliers beyond 2 standard deviations"	Internal reasoning
2	Code Generator	Produces Python snippet	`import pandas as pd\nimport numpy as np\ndf = pd.read_json('/sandbox/input.json')\nmean, std = df['A'].mean(), df['A'].std()\ndf_clean = df[np.abs(df['A'] - mean) < 2*std]\nprint(df_clean[['A','B']].corr())`
3	Static Scanner	Checks for disallowed patterns	No subprocess, no network, no fs outside /sandbox — APPROVED
4	Audit Logger	Records code snippet (pre-execution)	Code hash + content logged
5	Sandbox Executor	Executes in isolated container (30s limit, 512MB limit)	Execution completed in 0.4s
6	Output Capturer	Captures stdout	`Pearson correlation A↔B: 0.847 (p < 0.001)`
7	Agent	Injects as observation; continues reasoning	`Observation: Pearson correlation A↔B: 0.847 (p < 0.001)`

Error Flow

Error	Detection	Recovery
Code scan rejection (disallowed network call)	Static Scanner	Inject: `Observation: Code rejected: attempted network call to non-whitelisted host. Use the approved data connector tool instead.`
Code syntax error	Python compile error in sandbox	Inject traceback as observation; agent self-corrects
Runtime error (e.g. key not found in dict)	Execution traceback	Inject traceback; agent debugs and regenerates
Execution timeout	Sandbox timeout (30s)	Inject: `Observation: Execution timed out. The code may have an infinite loop or be processing too much data. Use a smaller sample.`
Output too large (> context budget)	Result Size Limiter	Truncate; inject with marker `[TRUNCATED: 50,000 chars. First 2,000 chars shown]`

8. Security Considerations

Sandboxing is the Critical Control

The entire security model rests on the sandbox being truly isolated. A sandbox escape allows arbitrary code execution with the host's permissions.
Mitigation: Use container isolation with gVisor (kernel-level isolation) or microVM (Firecracker) rather than plain Docker; validate sandbox isolation regularly; treat any sandbox escape as a P0 security incident

OWASP LLM Top 10

OWASP LLM Risk	CodeAct Applicability	Mitigation
LLM08 Excessive Agency	Arbitrary code execution is maximum agency	Sandbox isolation; static scanner; network whitelist; resource limits; audit every execution
LLM01 Prompt Injection	Malicious input could generate malicious code	Pre-execution scanning; sandbox prevents most damage even if injection succeeds
LLM04 Model DoS	Code can exhaust CPU/memory without sandbox limits	Hard resource limits in sandbox; per-task execution budget
LLM07 Insecure Plugin Design	Code can call external APIs if network is allowed	Strict network whitelist; outbound connections to approved endpoints only

9. Governance Considerations

Code Execution Audit is Non-Negotiable

Every code snippet generated, every scan result, and every execution result must be logged to an immutable audit store
The audit log must be queryable: it may be required for incident response, regulatory inquiry, or security investigation

Governance Artefacts

Artefact	Owner	Frequency	Purpose
Code Execution Audit Log	Security + Compliance	Per execution; retained per policy	Complete record of all generated code and execution results
Sandbox Configuration Spec	Security	On change; quarterly review	Documents sandbox isolation configuration, resource limits, network whitelist
Whitelisted Library Register	Security + AI Platform	On library addition	Approved Python libraries with pinned versions and security review
Static Scanner Rule Set	Security	Monthly update	High-risk pattern rules; updated as new bypass techniques are discovered

10. Operational Considerations

SLOs

SLO	Target	Window	Alert
Code scan rejection rate	≤ 5%	24-hour rolling	> 10% triggers P3; review LLM code generation quality
Execution timeout rate	≤ 2%	24-hour rolling	> 5% triggers P3; review time limit or data volume
Code error rate (runtime errors requiring agent retry)	≤ 15%	24-hour rolling	> 30% triggers P3; review LLM code generation quality
Sandbox startup latency p95	≤ 2s	1-hour rolling	> 5s triggers P2

11. Cost Considerations

Cost Factor	Driver	Typical Range
Code generation (LLM inference)	Tokens in generated code	$0.005–0.05 per code snippet
Sandbox execution	Compute time + container overhead	$0.001–0.01 per execution (pre-warmed sandbox)
Static scanning	Compute time for AST analysis	Negligible
Code retries (error correction)	Additional LLM + execution calls	+20–40% overhead on average

12. Trade-Off Analysis

Option	Action Space	Security Risk	Complexity	Best For
A: CodeAct with sandbox + scanner (Recommended)	Very High	Medium (mitigated)	High	Data science; analysis; trusted enterprise
B: Structured tool calls only (EAAPL-WRK006)	Medium	Low	Low	Most production enterprise use cases
C: CodeAct without sandbox (NEVER)	Very High	Critical	Low	Research only; never production
D: Hybrid: code + pre-registered tools	High	Medium	Medium	Mixed tasks; preferred production shape

13. Failure Modes

Failure Mode	Likelihood	Impact	Detection	Recovery
Sandbox escape (container isolation bypass)	Very Low	Critical — arbitrary code execution	Sandbox isolation monitoring; anomaly detection	P0 incident; immediate sandbox shutdown; forensic investigation
Code generation loops (agent generates same error repeatedly)	Medium	Medium — cost waste	Code similarity detection between attempts	Inject loop-break observation after N identical errors; escalate to human
Data exfiltration via execution output (PII in result)	Low	High — privacy violation	PII detection on execution output before logging	PII scan on output; redact before logging and observation injection
Scanner false positive (rejects safe code)	Medium	Medium — reduced capability	High scan rejection rate monitoring	Scanner rule review; appeal mechanism for false positives

14. Regulatory Considerations

EU AI Act

Art. 9 (Risk Management): Code execution in production AI systems is a high-risk action class; sandbox configuration, static scanner, and audit log are the required risk management controls.
Art. 15 (Accuracy and Robustness): Code generation quality (syntax error rate, logical correctness) must be monitored and documented.

APRA CPS 234

Code execution that accesses data assets must be logged (code execution audit log) and access must be limited to minimum necessary data (sandbox file system scope).

Australian Context

For data-driven regulated decisions, code-generated calculations must be reproducible: the code audit log enables recalculation audits required under financial services and insurance regulations.

15. Reference Implementations

AWS

Component	Service
Sandbox Executor	AWS Lambda with ephemeral /tmp; or Fargate per-execution
Code Scanning	Custom Lambda layer with Bandit + custom AST rules
Code Execution Audit	CloudWatch Logs + S3
LLM Code Generation	Amazon Bedrock (Claude 3.5 Sonnet)
Network Isolation	Lambda VPC with restrictive security group

Azure

Component	Service
Sandbox Executor	Azure Container Instances (ephemeral, per-execution)
Code Scanning	Azure Container Apps with custom scan sidecar
Code Execution Audit	Azure Monitor + Blob Storage

On-Premises

Component	Technology
Sandbox Executor	E2B (open source); Modal; gVisor + Docker
Code Scanning	Bandit (Python); semgrep with custom rules
Audit Log	PostgreSQL append-only table

Pattern	ID	Relationship Type	Notes
Agent Tool Registry	EAAPL-AGT003	Complementary	Pre-registered tools remain available in CodeAct; code fills the gaps
Tool Call Orchestration	EAAPL-WRK006	Peer	Structured tool calls are preferred where available; CodeAct used for novel actions
ReAct Agent Loop	EAAPL-WRK001	Base Pattern	CodeAct operates within a ReAct loop; code generation is the Action step
Workflow Tracing and Replay	EAAPL-WRK013	Integrates With	Code execution audit is a trace input; code + output must be traced

17. Maturity Assessment

Overall Maturity: Emerging

Dimension	Score (1–5)	Evidence
Research Foundation	4	CodeAct paper (Wang et al., 2024); strong evidence of capability improvement
Production Deployment	2	Early enterprise deployments; most production use cases still on structured tools
Sandbox Tooling	3	E2B, Modal, gVisor available but enterprise integration maturing
Security Standards	2	No established standard for CodeAct sandbox requirements; evolving
Framework Support	3	OpenAI Code Interpreter; LangChain PythonREPLTool; E2B SDK

18. Revision History

Version	Date	Author	Changes
1.0	2025-06-13	Architecture Board	Initial publication in Agentic Workflows category

Track this pattern for APRA/ASIC review

← Back to Library More Agentic Workflows →