EAAPLEnterprise AI Architecture Pattern Library
EAAPLLibraryAgentic Workflows
Mature
⇄ Compare

CodeAct Agent

📄 Agentic WorkflowsEU AI ActISO/IEC 42001

[EAAPL-WRK014] CodeAct Agent

Category: Agentic Workflows Sub-category: Code-as-Action Architecture Version: 1.0 Maturity: Emerging Tags: code-generation, code-execution, sandboxed-execution, python-executor, code-act, tool-synthesis Regulatory Relevance: EU AI Act (Art. 9, Art. 15), ISO 42001 §8.4, APRA CPS 234


1. Executive Summary

The CodeAct Agent Pattern defines an architecture in which an agent's primary action mechanism is writing and executing code — typically Python — rather than invoking pre-registered structured tool functions. When the agent needs to perform a calculation, transform data, call an API, or interact with a system, it writes a code snippet, executes it in a sandboxed environment, and uses the execution result (output or error traceback) as its observation. This "code-as-action" paradigm provides dramatically more expressive power than structured tool calls (virtually unlimited action space through code) at the cost of significantly higher security requirements (arbitrary code execution demands robust sandboxing).

For CIO/CTO audiences: the practical implication is this — a conventional tool-using agent can only do what its pre-registered tools allow. A CodeAct agent can do almost anything that can be expressed in code: complex calculations, data transformations, novel API calls, statistical analyses, and more, without requiring a developer to pre-register a specific tool for every possible action. This flexibility is powerful for data science, analysis, and research workflows. The security implication is also significant: executing arbitrary LLM-generated code requires enterprise-grade sandboxing, code scanning, and resource limits. The CodeAct pattern is appropriate for trusted enterprise environments with strong sandbox infrastructure and inappropriate for any multi-tenant or consumer-facing environment without the same.


2. Problem Statement

Business Problem

Enterprise data analysis, research, and automation tasks frequently require ad-hoc computational actions that cannot be anticipated and pre-registered as structured tools. A data scientist asking an agent to "calculate the correlation between these two datasets with outlier removal and visualise it" cannot rely on a pre-registered tool for every possible statistical analysis variant.

Technical Problem

Pre-registered tool sets have a fixed action space: the agent can only invoke tools that a developer has already implemented and registered. Complex computational tasks either require very large tool libraries (unmaintainable) or cannot be handled at all. Code generation and execution provides an effectively unlimited action space through the expressiveness of general-purpose programming languages.

Symptoms of Absence

  • Complex computational requests require developer effort to add new tools before the agent can handle them
  • Data transformation and analysis tasks are outside the agent's capability because they cannot be expressed as structured tool calls
  • Agent cannot handle novel data formats or API structures not covered by pre-registered tools

Cost of Inaction

  • Capability: Data analysis and computation tasks remain entirely manual rather than being partially automated
  • Maintainability: Pre-registered tool libraries grow without bound as new action types are required
  • Time-to-Value: Every new action type requires development, testing, and deployment of a new tool

3. Context

When to Apply

  • Tasks require complex data transformation, computation, or statistical analysis
  • The set of required actions cannot be enumerated upfront (dynamic action synthesis)
  • A high-trust, single-tenant execution environment with robust sandbox infrastructure
  • Users are sophisticated (data scientists, analysts, engineers) who understand that code is being executed

When NOT to Apply

  • Multi-tenant environments where users cannot be trusted with shared code execution resources
  • Consumer-facing applications without enterprise-grade sandboxing
  • Tasks for which pre-registered structured tools are sufficient (use Tool Call Orchestration, EAAPL-WRK006)
  • Environments where code execution cannot be safely sandboxed (network isolation unavailable)
  • Regulated environments where every agent action must be from a pre-approved, audited action set

Prerequisites

  • Secure sandbox environment (container-based or VM-based isolation)
  • Code scanning prior to execution (static analysis for high-risk patterns)
  • Resource limits (CPU, memory, network, disk, execution time)
  • No persistent state across sandbox executions (stateless execution environment)
  • Comprehensive execution logging

Industry Applicability

Industry CodeAct Use Case Code Produced
Financial Services Quantitative analysis; portfolio calculations Python Pandas/NumPy calculations
Technology Automated test generation; code review + fix Python/JavaScript test code; linting scripts
Research Scientific data analysis Python SciPy/Statsmodels analyses
Legal Contract data extraction + transformation Python regex/NLP extraction scripts
Healthcare Clinical data processing and visualisation Python statistical analysis; chart generation

4. Architecture Overview

The CodeAct architecture wraps the agent's reasoning loop with a code generation, scanning, and sandboxed execution layer.

Code Generation Phase Within the ReAct-style reasoning loop (EAAPL-WRK001), when the agent decides it needs to take an action, it generates a Python (or JavaScript) code snippet rather than a structured tool call. The code snippet is the "action" in the thought-action-observation sequence. The code must be self-contained (no undefined variable references), executable (syntactically valid), and bounded (no infinite loops, no unbounded resource consumption in theory). The LLM is instructed to write code with explicit print statements or return values that produce the observation — the execution result must be observable.

Code Pre-Execution Scanning Before execution, the generated code is passed through a static code scanner that detects high-risk patterns: (a) network access outside the allowed whitelist, (b) file system access outside the allowed sandbox paths, (c) process spawning (subprocess, os.system), (d) import of disallowed modules (anything that can circumvent sandboxing), (e) code patterns associated with resource exhaustion attacks (infinite loops, large memory allocations). Code that matches high-risk patterns is rejected; the rejection reason is returned as an observation for the agent to reason about and produce safer code.

Sandboxed Execution Approved code is executed in an isolated sandbox environment: (a) a separate container with no network access except to whitelisted endpoints, (b) CPU time limit (default: 30s), (c) memory limit (default: 512MB), (d) read-only file system except for a designated scratch directory, (e) no ability to spawn child processes. Execution output (stdout, stderr, return value) is captured and returned. Runtime errors (tracebacks) are captured and returned as observations — the agent can read the traceback and fix the code.

Execution Result as Observation The execution result (stdout output, return value, or error traceback) is injected as an observation into the agent's scratchpad. This follows the standard ReAct observation pattern: the agent's next thought is informed by the execution result. If the code errored, the agent reads the traceback and generates corrected code. Successful execution produces data the agent can reason about to inform its next action or synthesise a final answer.

Tool Synthesis vs. Tool Call An important distinction: CodeAct does not eliminate the ability to call pre-registered tools. In many implementations, the code execution environment has access to a limited set of pre-installed libraries (pandas, numpy, requests to whitelisted endpoints) and can call these as Python libraries. This "tool synthesis" blends the flexibility of CodeAct with the safety properties of pre-approved libraries.


5. Architecture Diagram

ARCHITECTURE DIAGRAM
flowchart TD subgraph AgentLoop["ReAct Agent Loop"] A[Reason] B{Action Type?} end subgraph CodeAct["CodeAct Execution Pipeline"] C[Code Generator] D[Static Code Scanner] E{Scan Result?} F[Sandbox Executor] G[Output Capture] end subgraph Sandbox["Execution Sandbox"] H[Python Runtime] I[Whitelisted Libs] J[No Network / Read-only FS] end subgraph Observation["Observation Injection"] K[Execution Result] L[Rejection Observation] M[Error Observation] end A --> B B -->|code action| C B -->|tool call| N[Tool Registry] C --> D D --> E E -->|high risk| L E -->|approved| F F --> H H --> I & J H --> G G -->|success| K G -->|error| M K & M & L --> A

6. Components

Component Type Responsibility Technology Options Criticality
Code Generator AI Component Generates executable Python/JS from agent reasoning Any capable LLM (GPT-4o, Claude 3.5 Sonnet, Gemini 1.5 Pro) Critical
Static Code Scanner Security Detects high-risk patterns before execution Bandit (Python); custom AST scanner; semgrep rules Critical
Sandbox Executor Infrastructure Executes code in isolated container; enforces resource limits E2B; Modal; Docker + gVisor; AWS Lambda; Dagger Critical
Output Capturer Logic Captures stdout, stderr, return value, traceback Custom subprocess wrapper; E2B output API Critical
Code Execution Audit Logger Governance Logs every code snippet (pre- and post-scan) and execution result Append-only log; S3; PostgreSQL Critical
Iteration Controller Safety Limits code-generate-execute cycles per task Counter; configurable max High
Whitelisted Library Manager Security Manages approved Python libraries in sandbox environment Conda environment; requirements.txt pinned versions High
Result Size Limiter Safety Truncates execution output exceeding context budget Custom truncation wrapper High

7. Data Flow

Step Actor Action Output
1 Agent Thought: "I need to calculate the Pearson correlation between columns A and B, removing outliers beyond 2 standard deviations" Internal reasoning
2 Code Generator Produces Python snippet import pandas as pd\nimport numpy as np\ndf = pd.read_json('/sandbox/input.json')\nmean, std = df['A'].mean(), df['A'].std()\ndf_clean = df[np.abs(df['A'] - mean) < 2*std]\nprint(df_clean[['A','B']].corr())
3 Static Scanner Checks for disallowed patterns No subprocess, no network, no fs outside /sandbox — APPROVED
4 Audit Logger Records code snippet (pre-execution) Code hash + content logged
5 Sandbox Executor Executes in isolated container (30s limit, 512MB limit) Execution completed in 0.4s
6 Output Capturer Captures stdout Pearson correlation A↔B: 0.847 (p < 0.001)
7 Agent Injects as observation; continues reasoning Observation: Pearson correlation A↔B: 0.847 (p < 0.001)

Error Flow

Error Detection Recovery
Code scan rejection (disallowed network call) Static Scanner Inject: Observation: Code rejected: attempted network call to non-whitelisted host. Use the approved data connector tool instead.
Code syntax error Python compile error in sandbox Inject traceback as observation; agent self-corrects
Runtime error (e.g. key not found in dict) Execution traceback Inject traceback; agent debugs and regenerates
Execution timeout Sandbox timeout (30s) Inject: Observation: Execution timed out. The code may have an infinite loop or be processing too much data. Use a smaller sample.
Output too large (> context budget) Result Size Limiter Truncate; inject with marker [TRUNCATED: 50,000 chars. First 2,000 chars shown]

8. Security Considerations

Sandboxing is the Critical Control

  • The entire security model rests on the sandbox being truly isolated. A sandbox escape allows arbitrary code execution with the host's permissions.
  • Mitigation: Use container isolation with gVisor (kernel-level isolation) or microVM (Firecracker) rather than plain Docker; validate sandbox isolation regularly; treat any sandbox escape as a P0 security incident

OWASP LLM Top 10

OWASP LLM Risk CodeAct Applicability Mitigation
LLM08 Excessive Agency Arbitrary code execution is maximum agency Sandbox isolation; static scanner; network whitelist; resource limits; audit every execution
LLM01 Prompt Injection Malicious input could generate malicious code Pre-execution scanning; sandbox prevents most damage even if injection succeeds
LLM04 Model DoS Code can exhaust CPU/memory without sandbox limits Hard resource limits in sandbox; per-task execution budget
LLM07 Insecure Plugin Design Code can call external APIs if network is allowed Strict network whitelist; outbound connections to approved endpoints only

9. Governance Considerations

Code Execution Audit is Non-Negotiable

  • Every code snippet generated, every scan result, and every execution result must be logged to an immutable audit store
  • The audit log must be queryable: it may be required for incident response, regulatory inquiry, or security investigation

Governance Artefacts

Artefact Owner Frequency Purpose
Code Execution Audit Log Security + Compliance Per execution; retained per policy Complete record of all generated code and execution results
Sandbox Configuration Spec Security On change; quarterly review Documents sandbox isolation configuration, resource limits, network whitelist
Whitelisted Library Register Security + AI Platform On library addition Approved Python libraries with pinned versions and security review
Static Scanner Rule Set Security Monthly update High-risk pattern rules; updated as new bypass techniques are discovered

10. Operational Considerations

SLOs

SLO Target Window Alert
Code scan rejection rate ≤ 5% 24-hour rolling > 10% triggers P3; review LLM code generation quality
Execution timeout rate ≤ 2% 24-hour rolling > 5% triggers P3; review time limit or data volume
Code error rate (runtime errors requiring agent retry) ≤ 15% 24-hour rolling > 30% triggers P3; review LLM code generation quality
Sandbox startup latency p95 ≤ 2s 1-hour rolling > 5s triggers P2

11. Cost Considerations

Cost Factor Driver Typical Range
Code generation (LLM inference) Tokens in generated code $0.005–0.05 per code snippet
Sandbox execution Compute time + container overhead $0.001–0.01 per execution (pre-warmed sandbox)
Static scanning Compute time for AST analysis Negligible
Code retries (error correction) Additional LLM + execution calls +20–40% overhead on average

12. Trade-Off Analysis

Option Action Space Security Risk Complexity Best For
A: CodeAct with sandbox + scanner (Recommended) Very High Medium (mitigated) High Data science; analysis; trusted enterprise
B: Structured tool calls only (EAAPL-WRK006) Medium Low Low Most production enterprise use cases
C: CodeAct without sandbox (NEVER) Very High Critical Low Research only; never production
D: Hybrid: code + pre-registered tools High Medium Medium Mixed tasks; preferred production shape

13. Failure Modes

Failure Mode Likelihood Impact Detection Recovery
Sandbox escape (container isolation bypass) Very Low Critical — arbitrary code execution Sandbox isolation monitoring; anomaly detection P0 incident; immediate sandbox shutdown; forensic investigation
Code generation loops (agent generates same error repeatedly) Medium Medium — cost waste Code similarity detection between attempts Inject loop-break observation after N identical errors; escalate to human
Data exfiltration via execution output (PII in result) Low High — privacy violation PII detection on execution output before logging PII scan on output; redact before logging and observation injection
Scanner false positive (rejects safe code) Medium Medium — reduced capability High scan rejection rate monitoring Scanner rule review; appeal mechanism for false positives

14. Regulatory Considerations

EU AI Act

  • Art. 9 (Risk Management): Code execution in production AI systems is a high-risk action class; sandbox configuration, static scanner, and audit log are the required risk management controls.
  • Art. 15 (Accuracy and Robustness): Code generation quality (syntax error rate, logical correctness) must be monitored and documented.

APRA CPS 234

  • Code execution that accesses data assets must be logged (code execution audit log) and access must be limited to minimum necessary data (sandbox file system scope).

Australian Context

  • For data-driven regulated decisions, code-generated calculations must be reproducible: the code audit log enables recalculation audits required under financial services and insurance regulations.

15. Reference Implementations

AWS

Component Service
Sandbox Executor AWS Lambda with ephemeral /tmp; or Fargate per-execution
Code Scanning Custom Lambda layer with Bandit + custom AST rules
Code Execution Audit CloudWatch Logs + S3
LLM Code Generation Amazon Bedrock (Claude 3.5 Sonnet)
Network Isolation Lambda VPC with restrictive security group

Azure

Component Service
Sandbox Executor Azure Container Instances (ephemeral, per-execution)
Code Scanning Azure Container Apps with custom scan sidecar
Code Execution Audit Azure Monitor + Blob Storage

On-Premises

Component Technology
Sandbox Executor E2B (open source); Modal; gVisor + Docker
Code Scanning Bandit (Python); semgrep with custom rules
Audit Log PostgreSQL append-only table

Pattern ID Relationship Type Notes
Agent Tool Registry EAAPL-AGT003 Complementary Pre-registered tools remain available in CodeAct; code fills the gaps
Tool Call Orchestration EAAPL-WRK006 Peer Structured tool calls are preferred where available; CodeAct used for novel actions
ReAct Agent Loop EAAPL-WRK001 Base Pattern CodeAct operates within a ReAct loop; code generation is the Action step
Workflow Tracing and Replay EAAPL-WRK013 Integrates With Code execution audit is a trace input; code + output must be traced

17. Maturity Assessment

Overall Maturity: Emerging

Dimension Score (1–5) Evidence
Research Foundation 4 CodeAct paper (Wang et al., 2024); strong evidence of capability improvement
Production Deployment 2 Early enterprise deployments; most production use cases still on structured tools
Sandbox Tooling 3 E2B, Modal, gVisor available but enterprise integration maturing
Security Standards 2 No established standard for CodeAct sandbox requirements; evolving
Framework Support 3 OpenAI Code Interpreter; LangChain PythonREPLTool; E2B SDK

18. Revision History

Version Date Author Changes
1.0 2025-06-13 Architecture Board Initial publication in Agentic Workflows category
← Back to LibraryMore Agentic Workflows