[EAAPL-WRK014] CodeAct Agent
Category: Agentic Workflows
Sub-category: Code-as-Action Architecture
Version: 1.0
Maturity: Emerging
Tags: code-generation, code-execution, sandboxed-execution, python-executor, code-act, tool-synthesis
Regulatory Relevance: EU AI Act (Art. 9, Art. 15), ISO 42001 §8.4, APRA CPS 234
1. Executive Summary
The CodeAct Agent Pattern defines an architecture in which an agent's primary action mechanism is writing and executing code — typically Python — rather than invoking pre-registered structured tool functions. When the agent needs to perform a calculation, transform data, call an API, or interact with a system, it writes a code snippet, executes it in a sandboxed environment, and uses the execution result (output or error traceback) as its observation. This "code-as-action" paradigm provides dramatically more expressive power than structured tool calls (virtually unlimited action space through code) at the cost of significantly higher security requirements (arbitrary code execution demands robust sandboxing).
For CIO/CTO audiences: the practical implication is this — a conventional tool-using agent can only do what its pre-registered tools allow. A CodeAct agent can do almost anything that can be expressed in code: complex calculations, data transformations, novel API calls, statistical analyses, and more, without requiring a developer to pre-register a specific tool for every possible action. This flexibility is powerful for data science, analysis, and research workflows. The security implication is also significant: executing arbitrary LLM-generated code requires enterprise-grade sandboxing, code scanning, and resource limits. The CodeAct pattern is appropriate for trusted enterprise environments with strong sandbox infrastructure and inappropriate for any multi-tenant or consumer-facing environment without the same.
2. Problem Statement
Business Problem
Enterprise data analysis, research, and automation tasks frequently require ad-hoc computational actions that cannot be anticipated and pre-registered as structured tools. A data scientist asking an agent to "calculate the correlation between these two datasets with outlier removal and visualise it" cannot rely on a pre-registered tool for every possible statistical analysis variant.
Technical Problem
Pre-registered tool sets have a fixed action space: the agent can only invoke tools that a developer has already implemented and registered. Complex computational tasks either require very large tool libraries (unmaintainable) or cannot be handled at all. Code generation and execution provides an effectively unlimited action space through the expressiveness of general-purpose programming languages.
Symptoms of Absence
- Complex computational requests require developer effort to add new tools before the agent can handle them
- Data transformation and analysis tasks are outside the agent's capability because they cannot be expressed as structured tool calls
- Agent cannot handle novel data formats or API structures not covered by pre-registered tools
Cost of Inaction
- Capability: Data analysis and computation tasks remain entirely manual rather than being partially automated
- Maintainability: Pre-registered tool libraries grow without bound as new action types are required
- Time-to-Value: Every new action type requires development, testing, and deployment of a new tool
3. Context
When to Apply
- Tasks require complex data transformation, computation, or statistical analysis
- The set of required actions cannot be enumerated upfront (dynamic action synthesis)
- A high-trust, single-tenant execution environment with robust sandbox infrastructure
- Users are sophisticated (data scientists, analysts, engineers) who understand that code is being executed
When NOT to Apply
- Multi-tenant environments where users cannot be trusted with shared code execution resources
- Consumer-facing applications without enterprise-grade sandboxing
- Tasks for which pre-registered structured tools are sufficient (use Tool Call Orchestration, EAAPL-WRK006)
- Environments where code execution cannot be safely sandboxed (network isolation unavailable)
- Regulated environments where every agent action must be from a pre-approved, audited action set
Prerequisites
- Secure sandbox environment (container-based or VM-based isolation)
- Code scanning prior to execution (static analysis for high-risk patterns)
- Resource limits (CPU, memory, network, disk, execution time)
- No persistent state across sandbox executions (stateless execution environment)
- Comprehensive execution logging
Industry Applicability
| Industry |
CodeAct Use Case |
Code Produced |
| Financial Services |
Quantitative analysis; portfolio calculations |
Python Pandas/NumPy calculations |
| Technology |
Automated test generation; code review + fix |
Python/JavaScript test code; linting scripts |
| Research |
Scientific data analysis |
Python SciPy/Statsmodels analyses |
| Legal |
Contract data extraction + transformation |
Python regex/NLP extraction scripts |
| Healthcare |
Clinical data processing and visualisation |
Python statistical analysis; chart generation |
4. Architecture Overview
The CodeAct architecture wraps the agent's reasoning loop with a code generation, scanning, and sandboxed execution layer.
Code Generation Phase
Within the ReAct-style reasoning loop (EAAPL-WRK001), when the agent decides it needs to take an action, it generates a Python (or JavaScript) code snippet rather than a structured tool call. The code snippet is the "action" in the thought-action-observation sequence. The code must be self-contained (no undefined variable references), executable (syntactically valid), and bounded (no infinite loops, no unbounded resource consumption in theory). The LLM is instructed to write code with explicit print statements or return values that produce the observation — the execution result must be observable.
Code Pre-Execution Scanning
Before execution, the generated code is passed through a static code scanner that detects high-risk patterns: (a) network access outside the allowed whitelist, (b) file system access outside the allowed sandbox paths, (c) process spawning (subprocess, os.system), (d) import of disallowed modules (anything that can circumvent sandboxing), (e) code patterns associated with resource exhaustion attacks (infinite loops, large memory allocations). Code that matches high-risk patterns is rejected; the rejection reason is returned as an observation for the agent to reason about and produce safer code.
Sandboxed Execution
Approved code is executed in an isolated sandbox environment: (a) a separate container with no network access except to whitelisted endpoints, (b) CPU time limit (default: 30s), (c) memory limit (default: 512MB), (d) read-only file system except for a designated scratch directory, (e) no ability to spawn child processes. Execution output (stdout, stderr, return value) is captured and returned. Runtime errors (tracebacks) are captured and returned as observations — the agent can read the traceback and fix the code.
Execution Result as Observation
The execution result (stdout output, return value, or error traceback) is injected as an observation into the agent's scratchpad. This follows the standard ReAct observation pattern: the agent's next thought is informed by the execution result. If the code errored, the agent reads the traceback and generates corrected code. Successful execution produces data the agent can reason about to inform its next action or synthesise a final answer.
Tool Synthesis vs. Tool Call
An important distinction: CodeAct does not eliminate the ability to call pre-registered tools. In many implementations, the code execution environment has access to a limited set of pre-installed libraries (pandas, numpy, requests to whitelisted endpoints) and can call these as Python libraries. This "tool synthesis" blends the flexibility of CodeAct with the safety properties of pre-approved libraries.
5. Architecture Diagram
flowchart TD
subgraph AgentLoop["ReAct Agent Loop"]
A[Reason]
B{Action Type?}
end
subgraph CodeAct["CodeAct Execution Pipeline"]
C[Code Generator]
D[Static Code Scanner]
E{Scan Result?}
F[Sandbox Executor]
G[Output Capture]
end
subgraph Sandbox["Execution Sandbox"]
H[Python Runtime]
I[Whitelisted Libs]
J[No Network / Read-only FS]
end
subgraph Observation["Observation Injection"]
K[Execution Result]
L[Rejection Observation]
M[Error Observation]
end
A --> B
B -->|code action| C
B -->|tool call| N[Tool Registry]
C --> D
D --> E
E -->|high risk| L
E -->|approved| F
F --> H
H --> I & J
H --> G
G -->|success| K
G -->|error| M
K & M & L --> A
6. Components
| Component |
Type |
Responsibility |
Technology Options |
Criticality |
| Code Generator |
AI Component |
Generates executable Python/JS from agent reasoning |
Any capable LLM (GPT-4o, Claude 3.5 Sonnet, Gemini 1.5 Pro) |
Critical |
| Static Code Scanner |
Security |
Detects high-risk patterns before execution |
Bandit (Python); custom AST scanner; semgrep rules |
Critical |
| Sandbox Executor |
Infrastructure |
Executes code in isolated container; enforces resource limits |
E2B; Modal; Docker + gVisor; AWS Lambda; Dagger |
Critical |
| Output Capturer |
Logic |
Captures stdout, stderr, return value, traceback |
Custom subprocess wrapper; E2B output API |
Critical |
| Code Execution Audit Logger |
Governance |
Logs every code snippet (pre- and post-scan) and execution result |
Append-only log; S3; PostgreSQL |
Critical |
| Iteration Controller |
Safety |
Limits code-generate-execute cycles per task |
Counter; configurable max |
High |
| Whitelisted Library Manager |
Security |
Manages approved Python libraries in sandbox environment |
Conda environment; requirements.txt pinned versions |
High |
| Result Size Limiter |
Safety |
Truncates execution output exceeding context budget |
Custom truncation wrapper |
High |
7. Data Flow
| Step |
Actor |
Action |
Output |
| 1 |
Agent |
Thought: "I need to calculate the Pearson correlation between columns A and B, removing outliers beyond 2 standard deviations" |
Internal reasoning |
| 2 |
Code Generator |
Produces Python snippet |
import pandas as pd\nimport numpy as np\ndf = pd.read_json('/sandbox/input.json')\nmean, std = df['A'].mean(), df['A'].std()\ndf_clean = df[np.abs(df['A'] - mean) < 2*std]\nprint(df_clean[['A','B']].corr()) |
| 3 |
Static Scanner |
Checks for disallowed patterns |
No subprocess, no network, no fs outside /sandbox — APPROVED |
| 4 |
Audit Logger |
Records code snippet (pre-execution) |
Code hash + content logged |
| 5 |
Sandbox Executor |
Executes in isolated container (30s limit, 512MB limit) |
Execution completed in 0.4s |
| 6 |
Output Capturer |
Captures stdout |
Pearson correlation A↔B: 0.847 (p < 0.001) |
| 7 |
Agent |
Injects as observation; continues reasoning |
Observation: Pearson correlation A↔B: 0.847 (p < 0.001) |
Error Flow
| Error |
Detection |
Recovery |
| Code scan rejection (disallowed network call) |
Static Scanner |
Inject: Observation: Code rejected: attempted network call to non-whitelisted host. Use the approved data connector tool instead. |
| Code syntax error |
Python compile error in sandbox |
Inject traceback as observation; agent self-corrects |
| Runtime error (e.g. key not found in dict) |
Execution traceback |
Inject traceback; agent debugs and regenerates |
| Execution timeout |
Sandbox timeout (30s) |
Inject: Observation: Execution timed out. The code may have an infinite loop or be processing too much data. Use a smaller sample. |
| Output too large (> context budget) |
Result Size Limiter |
Truncate; inject with marker [TRUNCATED: 50,000 chars. First 2,000 chars shown] |
8. Security Considerations
Sandboxing is the Critical Control
- The entire security model rests on the sandbox being truly isolated. A sandbox escape allows arbitrary code execution with the host's permissions.
- Mitigation: Use container isolation with gVisor (kernel-level isolation) or microVM (Firecracker) rather than plain Docker; validate sandbox isolation regularly; treat any sandbox escape as a P0 security incident
OWASP LLM Top 10
| OWASP LLM Risk |
CodeAct Applicability |
Mitigation |
| LLM08 Excessive Agency |
Arbitrary code execution is maximum agency |
Sandbox isolation; static scanner; network whitelist; resource limits; audit every execution |
| LLM01 Prompt Injection |
Malicious input could generate malicious code |
Pre-execution scanning; sandbox prevents most damage even if injection succeeds |
| LLM04 Model DoS |
Code can exhaust CPU/memory without sandbox limits |
Hard resource limits in sandbox; per-task execution budget |
| LLM07 Insecure Plugin Design |
Code can call external APIs if network is allowed |
Strict network whitelist; outbound connections to approved endpoints only |
9. Governance Considerations
Code Execution Audit is Non-Negotiable
- Every code snippet generated, every scan result, and every execution result must be logged to an immutable audit store
- The audit log must be queryable: it may be required for incident response, regulatory inquiry, or security investigation
Governance Artefacts
| Artefact |
Owner |
Frequency |
Purpose |
| Code Execution Audit Log |
Security + Compliance |
Per execution; retained per policy |
Complete record of all generated code and execution results |
| Sandbox Configuration Spec |
Security |
On change; quarterly review |
Documents sandbox isolation configuration, resource limits, network whitelist |
| Whitelisted Library Register |
Security + AI Platform |
On library addition |
Approved Python libraries with pinned versions and security review |
| Static Scanner Rule Set |
Security |
Monthly update |
High-risk pattern rules; updated as new bypass techniques are discovered |
10. Operational Considerations
SLOs
| SLO |
Target |
Window |
Alert |
| Code scan rejection rate |
≤ 5% |
24-hour rolling |
> 10% triggers P3; review LLM code generation quality |
| Execution timeout rate |
≤ 2% |
24-hour rolling |
> 5% triggers P3; review time limit or data volume |
| Code error rate (runtime errors requiring agent retry) |
≤ 15% |
24-hour rolling |
> 30% triggers P3; review LLM code generation quality |
| Sandbox startup latency p95 |
≤ 2s |
1-hour rolling |
> 5s triggers P2 |
11. Cost Considerations
| Cost Factor |
Driver |
Typical Range |
| Code generation (LLM inference) |
Tokens in generated code |
$0.005–0.05 per code snippet |
| Sandbox execution |
Compute time + container overhead |
$0.001–0.01 per execution (pre-warmed sandbox) |
| Static scanning |
Compute time for AST analysis |
Negligible |
| Code retries (error correction) |
Additional LLM + execution calls |
+20–40% overhead on average |
12. Trade-Off Analysis
| Option |
Action Space |
Security Risk |
Complexity |
Best For |
| A: CodeAct with sandbox + scanner (Recommended) |
Very High |
Medium (mitigated) |
High |
Data science; analysis; trusted enterprise |
| B: Structured tool calls only (EAAPL-WRK006) |
Medium |
Low |
Low |
Most production enterprise use cases |
| C: CodeAct without sandbox (NEVER) |
Very High |
Critical |
Low |
Research only; never production |
| D: Hybrid: code + pre-registered tools |
High |
Medium |
Medium |
Mixed tasks; preferred production shape |
13. Failure Modes
| Failure Mode |
Likelihood |
Impact |
Detection |
Recovery |
| Sandbox escape (container isolation bypass) |
Very Low |
Critical — arbitrary code execution |
Sandbox isolation monitoring; anomaly detection |
P0 incident; immediate sandbox shutdown; forensic investigation |
| Code generation loops (agent generates same error repeatedly) |
Medium |
Medium — cost waste |
Code similarity detection between attempts |
Inject loop-break observation after N identical errors; escalate to human |
| Data exfiltration via execution output (PII in result) |
Low |
High — privacy violation |
PII detection on execution output before logging |
PII scan on output; redact before logging and observation injection |
| Scanner false positive (rejects safe code) |
Medium |
Medium — reduced capability |
High scan rejection rate monitoring |
Scanner rule review; appeal mechanism for false positives |
14. Regulatory Considerations
EU AI Act
- Art. 9 (Risk Management): Code execution in production AI systems is a high-risk action class; sandbox configuration, static scanner, and audit log are the required risk management controls.
- Art. 15 (Accuracy and Robustness): Code generation quality (syntax error rate, logical correctness) must be monitored and documented.
APRA CPS 234
- Code execution that accesses data assets must be logged (code execution audit log) and access must be limited to minimum necessary data (sandbox file system scope).
Australian Context
- For data-driven regulated decisions, code-generated calculations must be reproducible: the code audit log enables recalculation audits required under financial services and insurance regulations.
15. Reference Implementations
AWS
| Component |
Service |
| Sandbox Executor |
AWS Lambda with ephemeral /tmp; or Fargate per-execution |
| Code Scanning |
Custom Lambda layer with Bandit + custom AST rules |
| Code Execution Audit |
CloudWatch Logs + S3 |
| LLM Code Generation |
Amazon Bedrock (Claude 3.5 Sonnet) |
| Network Isolation |
Lambda VPC with restrictive security group |
Azure
| Component |
Service |
| Sandbox Executor |
Azure Container Instances (ephemeral, per-execution) |
| Code Scanning |
Azure Container Apps with custom scan sidecar |
| Code Execution Audit |
Azure Monitor + Blob Storage |
On-Premises
| Component |
Technology |
| Sandbox Executor |
E2B (open source); Modal; gVisor + Docker |
| Code Scanning |
Bandit (Python); semgrep with custom rules |
| Audit Log |
PostgreSQL append-only table |
| Pattern |
ID |
Relationship Type |
Notes |
| Agent Tool Registry |
EAAPL-AGT003 |
Complementary |
Pre-registered tools remain available in CodeAct; code fills the gaps |
| Tool Call Orchestration |
EAAPL-WRK006 |
Peer |
Structured tool calls are preferred where available; CodeAct used for novel actions |
| ReAct Agent Loop |
EAAPL-WRK001 |
Base Pattern |
CodeAct operates within a ReAct loop; code generation is the Action step |
| Workflow Tracing and Replay |
EAAPL-WRK013 |
Integrates With |
Code execution audit is a trace input; code + output must be traced |
17. Maturity Assessment
Overall Maturity: Emerging
| Dimension |
Score (1–5) |
Evidence |
| Research Foundation |
4 |
CodeAct paper (Wang et al., 2024); strong evidence of capability improvement |
| Production Deployment |
2 |
Early enterprise deployments; most production use cases still on structured tools |
| Sandbox Tooling |
3 |
E2B, Modal, gVisor available but enterprise integration maturing |
| Security Standards |
2 |
No established standard for CodeAct sandbox requirements; evolving |
| Framework Support |
3 |
OpenAI Code Interpreter; LangChain PythonREPLTool; E2B SDK |
18. Revision History
| Version |
Date |
Author |
Changes |
| 1.0 |
2025-06-13 |
Architecture Board |
Initial publication in Agentic Workflows category |