Agentic Retrieval-Augmented Generation
[EAAPL-RAG007] Agentic Retrieval-Augmented Generation
Category: Artificial Intelligence / Retrieval-Augmented Generation
Sub-category: Agentic and Multi-Hop RAG
Version: 1.2
Maturity: Proven
Tags: rag agentic multi-hop iterative-retrieval query-planning self-critique tool-use reasoning
Regulatory Relevance: EU AI Act Article 14 (Human oversight for autonomous systems), ISO/IEC 42001 Section 8.5 (AI system operation), NIST AI RMF (Govern 1.7 — autonomous decision-making)
1. Executive Summary
Agentic RAG places a reasoning AI agent in the orchestration loop of the retrieval-generation pipeline, enabling it to plan multi-step retrieval strategies, evaluate the quality of retrieved evidence, execute iterative searches to fill gaps, and self-critique its nascent answers before returning a final response. Unlike standard RAG, which executes a single fixed retrieval cycle per query, Agentic RAG can break a complex question into sub-questions, retrieve evidence for each, synthesise intermediate answers, identify what it still doesn't know, and retrieve again — autonomously, until it has sufficient grounded evidence to answer with confidence.
For enterprise leaders, Agentic RAG addresses a category of knowledge queries that single-cycle RAG fundamentally cannot handle: multi-hop questions that require chaining across multiple documents ("Which of our suppliers are located in regions affected by the sanctions announced in the regulatory alert published this week, and what are our contractual obligations for those suppliers?"), questions requiring synthesis across multiple evidence sources, and research-style queries that require iterative exploration of a knowledge domain. The business value is the automation of knowledge work that previously required a skilled analyst to retrieve, read, synthesise, and reason across multiple documents — turning a 2-hour research task into a 30-second AI-assisted workflow.
2. Problem Statement
Business Problem
Single-cycle RAG systems retrieve a fixed set of candidate documents and generate an answer from them in one pass. This architecture is adequate for factual lookup questions but insufficient for analytical questions that require multi-document synthesis, causal reasoning, or iterative hypothesis refinement. Analysts who use RAG systems for complex research tasks frequently report that the system misses relevant documents that would only be found if the initial answer had been used to formulate a better search query.
Technical Problem
Standard RAG has no mechanism for the system to recognise that its retrieved context is insufficient to answer the question, no ability to plan a multi-step retrieval strategy before retrieving, and no capacity to iteratively refine its retrieval based on partial answers. The single retrieval cycle architecture is fundamentally limited for complex, multi-hop knowledge questions.
Symptoms
- RAG system answers complex analytical questions with shallow, incomplete responses that reference only one or two source documents
- Users manually perform "follow-up searches" after receiving an initial RAG answer, indicating the system did not exhaust the relevant knowledge
- High analyst override rate: AI-generated answers are frequently edited with additional context that the analyst had to find separately
- The system answers "I don't have enough information" for questions that are answerable from the corpus but require multi-hop reasoning
Cost of Inaction
- Complex knowledge work remains manual and unautomated, despite a capable knowledge corpus
- Analysts treat the RAG system as a first-pass tool only, not as a research assistant capable of deep synthesis
- Competitive disadvantage versus AI deployments that automate complex analytical workflows end-to-end
3. Context
When to Apply
- Complex multi-hop questions requiring evidence from multiple, sequentially discovered documents
- Research synthesis tasks: competitive analysis, regulatory impact assessment, risk analysis across multiple source documents
- Question answering where the answer to one sub-question determines which documents to retrieve next (dependent retrieval chains)
- Use cases where the appropriate retrieval strategy varies per question (some questions need one retrieval; others need five)
- Scenarios where the agent should be able to answer "I cannot find sufficient evidence" after exhausting retrieval strategies, rather than hallucinating
When NOT to Apply
- Simple factual lookups answerable in a single retrieval cycle (adds unnecessary latency and cost)
- Latency-critical applications (agentic loops add 1–10 seconds of reasoning time per iteration)
- Autonomous action-taking scenarios where the agent's retrieval findings would directly trigger writes or external API calls without human approval — this requires explicit Human-in-the-Loop governance (EAAPL-RAG007 combined with HITL patterns)
- Corpora that are too small to benefit from multi-hop retrieval (< 10,000 documents)
Prerequisites
- An underlying RAG retrieval capability (EAAPL-RAG001 or EAAPL-RAG005)
- An LLM with reliable tool-calling / function-calling capability (GPT-4o, Claude 3.5, Gemini 1.5 Pro)
- A defined set of retrieval tools the agent can invoke (search, filter, summarise, compare)
- A maximum iteration limit to prevent infinite loops
- Observability instrumentation that traces each agent decision, tool call, and its output
Industry Applicability
| Industry | Complex Query Example | Multi-Hop Depth |
|---|---|---|
| Financial Services | "Which loan products in our portfolio have covenant conditions that may be triggered by the RBA rate change announced today?" | 3–4 hops |
| Legal | "Summarise all cases in our case database where a precedent from Smith v Jones was applied in an employment context" | 3–5 hops |
| Procurement | "Identify all active suppliers in our approved supplier list that are headquartered in sanctioned jurisdictions per this week's OFAC update" | 2–3 hops |
| Healthcare | "Are there any patients in our clinical notes who have been prescribed Drug X and also have contraindicated Condition Y per this month's formulary update?" | 2–4 hops |
| Strategy | "Summarise all internal research reports that reference our top 5 competitors and were published in the last 12 months" | 2–3 hops |
4. Architecture Overview
Agentic RAG wraps the retrieval-generation pipeline in a Reason-Act-Observe loop (a specialisation of the ReAct framework). The agent is an LLM with access to a defined set of retrieval tools and a scratchpad for recording its reasoning steps.
Query Planning
When a complex question is received, the agent's first step is query planning: it reasons about what information is needed to answer the question and decomposes the original query into a set of sub-queries with dependencies. For example, "Which of our suppliers are in sanctioned regions per this week's alert?" decomposes into: (1) retrieve this week's sanctions alert, (2) extract the list of sanctioned regions, (3) retrieve the supplier list, (4) cross-reference supplier regions against the sanctioned list. The query plan is recorded in the agent scratchpad and drives the retrieval sequence.
Query planning can be explicit (the agent writes a multi-step plan before executing any retrieval) or implicit (the agent uses tool calling in sequence, with each tool output informing the next call). Explicit planning produces more predictable and auditable behaviour; implicit sequential tool calling is more flexible but harder to trace.
Retrieval Tools
The agent is given a defined set of retrieval tools as callable functions:
search(query: str, filters: dict) → List[Chunk]: standard RAG retrievalget_document(doc_id: str) → Document: retrieve a full document by ID (for follow-up reading)summarise_results(chunks: List[Chunk]) → str: summarise a set of retrieved chunkscompare_documents(doc_ids: List[str], aspect: str) → str: compare specific aspects across multiple documentsextract_entities(chunks: List[Chunk], entity_type: str) → List[str]: extract named entities for use as next-query parameters
Tools are strictly typed and validated — the agent cannot invoke arbitrary code, only the defined tool set. Tool definitions include descriptions that guide the agent's tool selection.
Iterative Retrieval and Self-Critique
After each retrieval cycle, the agent evaluates its current evidence against the original question using a self-critique prompt: "Given the question and the evidence retrieved so far, what information is still missing? What additional searches would improve the answer?" If the self-critique identifies gaps, the agent formulates and executes additional retrieval calls. The loop continues until one of three stopping conditions is met: (1) the agent's self-critique determines the evidence is sufficient, (2) the maximum iteration limit is reached, or (3) additional retrieval is producing diminishing returns (identical or near-identical results to previous retrieval steps).
Grounded Final Synthesis
When the retrieval loop concludes, the agent synthesises a final answer from the accumulated evidence. The synthesis step is more complex than in standard RAG: the agent must integrate evidence from multiple retrieval rounds, handle potentially conflicting evidence, attribute each claim to its source, and structure the answer appropriately (summary, list, table, or prose depending on the question type). The synthesis prompt explicitly instructs the agent to cite each claim and to acknowledge when evidence is incomplete or conflicting.
Human Oversight Gate
For high-stakes agentic tasks (those whose answers will directly inform consequential decisions), an optional human oversight gate is inserted before final response delivery. The gate presents the agent's reasoning trace, the evidence sources used, and the draft answer to a human reviewer, who can approve, edit, or reject the answer. This gate is a regulatory requirement for high-risk AI use cases under EU AI Act Article 14.
5. Architecture Diagram
6. Components
| Component | Type | Responsibility | Technology Options | Criticality |
|---|---|---|---|---|
| Query Planner | NLP / LLM | Decompose complex query into sub-queries with dependencies | LLM function calling; LangChain Plan-and-Execute; LlamaIndex ReAct | High |
| ReAct Agent Orchestrator | Orchestration | Drive the Reason-Act-Observe loop; manage scratchpad; enforce iteration limits | LangChain AgentExecutor; LlamaIndex ReActAgent; AutoGen; custom | Critical |
| Retrieval Tool: search | Retrieval | Execute RAG retrieval (delegates to EAAPL-RAG001/005) | LangChain retriever tool; custom tool wrapper | Critical |
| Retrieval Tool: get_document | Retrieval | Fetch full document for deep reading | Document store SDK as tool | High |
| Retrieval Tool: summarise | LLM | Summarise retrieved chunks into concise evidence | LLM call within tool | Medium |
| Retrieval Tool: compare | LLM | Compare specific aspects across multiple documents | LLM call within tool | Medium |
| Self-Critique Module | LLM | Evaluate current evidence sufficiency; identify gaps | Structured LLM prompt with stopping criteria | High |
| Scratchpad / Working Memory | Storage | Record agent reasoning, tool calls, and observations per session | In-memory dict (short sessions); Redis (long sessions) | High |
| Human Oversight Gate | Workflow | Present agent reasoning to human reviewer for high-stakes decisions | Custom UI; Slack workflow; ServiceNow integration | High (regulated use) |
| Iteration Limiter | Safety | Enforce maximum loop iterations; prevent infinite loops | Hard counter in orchestrator; token budget guard | Critical |
| Agentic Audit Logger | Compliance | Record full reasoning trace, all tool calls, and all sources used | Langfuse, Arize AI, custom structured logger | Critical |
7. Data Flow
Primary Flow
| Step | Actor | Action | Output |
|---|---|---|---|
| 1 | User | Submit complex multi-hop query | Query string |
| 2 | Query Planner | Decompose query into sub-queries; record in scratchpad | Sub-query list + dependency graph |
| 3 | Agent (Reason) | Select next action based on scratchpad state | Tool call specification |
| 4 | Tool Executor | Execute selected tool (search / get_document / summarise) | Tool output (chunks / document / summary) |
| 5 | Agent (Observe) | Record tool output in scratchpad | Updated scratchpad |
| 6 | Self-Critique | Evaluate: "Is the evidence sufficient? What is still missing?" | Continue signal OR Stop signal |
| 7 | Loop (if Continue) | Return to step 3 with updated scratchpad context | New tool call based on gaps identified |
| 8 | Synthesis (if Stop) | Integrate all scratchpad evidence into final answer with citations | Draft answer + full citation list |
| 9 | Human Oversight Gate (if required) | Present reasoning trace and draft to human reviewer | Approved / Edited / Rejected decision |
| 10 | Audit Logger | Record complete reasoning trace, tool call sequence, all source IDs | Immutable audit record |
| 11 | Response Delivery | Return final answer with citations and (optionally) reasoning trace | Final response |
Error Flow
| Error Condition | Detection | Recovery |
|---|---|---|
| Maximum iteration limit reached without sufficient evidence | Iteration counter | Return best-available answer with "Incomplete evidence" flag; log for quality review |
| Tool call returns no results (empty retrieval) | Tool output validation | Agent reasons about the empty result; may reformulate query or acknowledge knowledge gap |
| Agent enters circular retrieval loop (same query repeated) | Query deduplication in scratchpad | Detect repeated tool calls; break loop; proceed to synthesis with available evidence |
| LLM tool-calling error (malformed function call) | JSON schema validation | Retry with re-prompted clarification; max 3 retries; escalate to error state |
8. Security Considerations
Tool Boundary Enforcement
The most critical security control in Agentic RAG is ensuring the agent cannot invoke tools outside its defined tool set. All retrieval tools are read-only — the agent must have no access to write, update, or delete operations. Tool definitions must be validated against a schema; any tool invocation not matching a defined schema must be rejected. The agent runtime must run in a sandboxed environment with no outbound network access beyond the defined tool API endpoints.
Prompt Injection in Multi-Hop Context
Multi-hop retrieval creates a compounded prompt injection risk: an adversarial document retrieved in hop 1 could inject instructions into the agent's scratchpad that influence hop 2 onwards. The agent system prompt must explicitly instruct the model that retrieved content is data, not instructions, and the orchestrator must sanitise tool outputs before inserting them into the agent context window.
OWASP LLM Top 10 Mitigations
| OWASP LLM Risk | Agentic-Specific Concern | Mitigation |
|---|---|---|
| LLM01: Prompt Injection | Adversarial content retrieved in hop N influences agent reasoning for hop N+1 | Tool output sanitisation; treat all tool outputs as untrusted data in system prompt |
| LLM07: Insecure Plugin Design | Agentic tools are equivalent to plugins; must be strictly typed and scoped | Read-only tools only; strict JSON schema for tool calls; no shell or code execution tools |
| LLM08: Excessive Agency | Agent autonomously executes many retrieval steps; scope creep risk | Iteration limit; read-only tools; human oversight gate for consequential outputs |
| LLM04: Model Denial of Service | Runaway agent loops with expensive LLM calls | Hard iteration limit; token budget guard; per-user session cost limit |
9. Governance Considerations
Autonomous Decision Boundary
Agentic RAG must have a clearly defined boundary between autonomous retrieval (permitted) and autonomous action-taking (not permitted in this pattern). The agent may search, read, summarise, and synthesise — but the final answer must be delivered to a human, not used to trigger downstream automated actions without explicit human approval.
Reasoning Trace as Governance Artefact
Every agentic session must produce a complete, immutable reasoning trace: the initial query, each reasoning step, each tool call with its parameters, each tool output, and the final synthesis. This trace is the primary governance artefact for post-hoc review of agentic decisions. Regulators reviewing an AI-assisted compliance analysis must be able to trace every claim back to the retrieved document that supported it.
Governance Artefacts
| Artefact | Owner | Frequency | Purpose |
|---|---|---|---|
| Agentic Session Trace | AI Operations | Per session | Full audit trail of all reasoning steps and tool calls |
| Tool Call Audit | Security | Weekly | Review tool call patterns for anomalies or scope creep |
| Human Override Rate Report | AI Governance | Monthly | Track rate at which human reviewers edit or reject agentic outputs |
| Iteration Distribution Report | AI Operations | Weekly | Monitor average and P99 iteration counts; identify expensive query types |
10. Operational Considerations
Monitoring
| Metric | Alert Threshold | Notes |
|---|---|---|
| Average iterations per session | > 8 (investigate) | May indicate query types too complex for available corpus |
| Max iteration cap hit rate | > 10% of sessions | Corpus coverage or agent capability issue |
| Agentic session P95 latency | > 30 seconds | Optimise tool call latency; increase parallelism |
| Tool call error rate | > 2% | Tool API health issue |
| Human override rate | > 30% | Agent quality degradation; review self-critique prompts |
Service Level Objectives
| SLO | Target | Notes |
|---|---|---|
| Agentic session completion P95 | ≤ 20 seconds | Depends on corpus and query complexity |
| Iteration cap hit rate | < 5% | Measure of query/corpus fit |
| Reasoning trace completeness | 100% | Every session must have a complete audit trace |
11. Cost Considerations
Cost Drivers
| Cost Driver | Notes | Optimisation |
|---|---|---|
| LLM inference per iteration | Each reasoning step is an LLM call; N iterations = N+1 LLM calls per session | Use smaller model for self-critique; use premium model only for final synthesis |
| Tool call overhead (embedding per search) | Each search tool call re-embeds a potentially modified sub-query | Cache embeddings for sub-queries that are identical to previous iterations |
| Context window growth | Scratchpad grows with each iteration; LLM input cost increases linearly | Summarise and compress scratchpad after every 3 iterations |
| Human oversight gate | Human reviewer time for high-stakes queries | Reserve HITL for designated high-stakes query types only |
Indicative Cost Range
| Use Case | Sessions/Day | Average Iterations | Cost per Session | Monthly Cost |
|---|---|---|---|---|
| Research assistant (analysts) | 100 | 4 | $0.50–$2.00 | $1,500–$6,000 |
| Compliance Q&A | 500 | 3 | $0.30–$1.00 | $4,500–$15,000 |
| Complex legal research | 50 | 6 | $1.00–$4.00 | $1,500–$6,000 |
12. Trade-Off Analysis
Agentic Orchestration Framework Comparison
| Framework | Flexibility | Observability | Production Readiness | Recommendation |
|---|---|---|---|---|
| LangChain AgentExecutor + ReAct | High | Good (LangSmith integration) | Proven | Strong choice; large ecosystem |
| LlamaIndex ReActAgent | High | Good (built-in trace logging) | Proven | Strong for document-heavy use cases |
| AutoGen (Microsoft) | Very High (multi-agent) | Moderate | Emerging | Complex multi-agent scenarios |
| Custom orchestrator | Maximum | Custom | Depends | When framework limitations are binding |
Query Planning Strategy
| Strategy | Predictability | Flexibility | Auditability |
|---|---|---|---|
| Explicit plan-then-execute | High | Low (plan may not adapt to unexpected retrieval results) | Highest |
| Implicit sequential tool calling | Medium | High | Medium |
| Hybrid (explicit plan, adaptive execution) | High | High | Highest |
Architectural Tensions
| Tension | Trade-off | Recommendation |
|---|---|---|
| Iteration depth vs. latency | Deep iteration: complete answer; shallow: fast | Configure max iterations per query type; user-selectable "quick" vs. "thorough" modes |
| Self-critique verbosity vs. cost | Verbose critique: better gap identification; concise: cheaper | Structured JSON self-critique with fixed fields; not free-form prose |
13. Failure Modes
| Failure Mode | Likelihood | Impact | Detection | Recovery |
|---|---|---|---|---|
| Circular retrieval loop (agent retrieves same content repeatedly) | Medium | High (cost + latency) | Query deduplication in scratchpad; loop detection | Detect repeated tool calls; break loop; proceed to synthesis |
| Hallucinated tool calls (agent invents tool that doesn't exist) | Low | High | Tool call schema validation | Reject invalid tool calls; re-prompt with valid tool list |
| Scratchpad context overflow (exceeds LLM context window) | Medium | High | Token count monitoring | Compress scratchpad after N iterations; summarise older evidence |
| Agent misattributes evidence (cites wrong source) | Medium | High | Citation validation post-synthesis | Cross-reference every cited source ID against tool call outputs in audit trace |
| Self-critique always returns "continue" (runaway optimism) | Low | High | Iteration cap hit rate monitoring | Hard iteration cap; self-critique must use structured stopping criteria |
14. Regulatory Considerations
| Regulation | Requirement | Agentic RAG Response |
|---|---|---|
| EU AI Act Article 14 | Human oversight capability for high-risk AI systems | Human oversight gate for agentic outputs used in consequential decisions |
| EU AI Act Article 13 | Transparency: users must understand how AI-generated outputs were produced | Reasoning trace available on request; session scratchpad as explainability artefact |
| ISO/IEC 42001 Section 8.5 | AI system operation includes monitoring autonomous behaviours | Iteration monitoring; tool call audit; human override rate tracking |
| NIST AI RMF Govern 1.7 | Document and manage AI system autonomy levels | Autonomy boundary documented: retrieval is autonomous; final answer requires human review for high-stakes use cases |
15. Reference Implementations
AWS
- Agent: Bedrock Agents (native multi-hop retrieval) or LangChain on Lambda
- Retrieval tool: Amazon Kendra or OpenSearch k-NN
- Scratchpad: Amazon ElastiCache (Redis) for session state
- Audit: CloudWatch Logs with structured JSON; X-Ray for trace
Azure
- Agent: Azure AI Studio Prompt Flow with agent orchestration, or LangChain on Azure Functions
- Retrieval tool: Azure AI Search
- Scratchpad: Azure Cache for Redis
- Audit: Azure Monitor + Application Insights
GCP
- Agent: Vertex AI Agent Builder (Grounding with Search) or LlamaIndex on Cloud Run
- Retrieval tool: Vertex AI Vector Search
- Scratchpad: Cloud Memorystore (Redis)
- Audit: Cloud Trace + Cloud Logging
16. Related Patterns
| Pattern ID | Pattern Name | Relationship |
|---|---|---|
| EAAPL-RAG001 | Enterprise RAG | Agentic RAG wraps and repeatedly invokes the RAG001 retrieval layer |
| EAAPL-RAG005 | Hybrid RAG | Recommended retrieval strategy for each agent search tool call |
| EAAPL-RAG009 | Graph RAG | Agent may invoke graph traversal as an additional tool alongside vector search |
| EAAPL-RAG003 | Secure RAG | ACL enforcement applies to every tool call within the agentic loop |
17. Maturity Assessment
Overall Maturity: Proven — Agentic RAG is deployed in production for research and compliance use cases; ReAct and Plan-and-Execute frameworks are mature; the primary ongoing challenges are iteration cost management and reasoning trace audit quality.
| Dimension | Score (1–5) | Rationale |
|---|---|---|
| Technology Readiness | 4 | LLM tool calling is GA and reliable; orchestration frameworks are production-grade |
| Tooling Ecosystem | 4 | LangChain, LlamaIndex, Bedrock Agents, Azure AI Studio support agentic patterns |
| Operational Guidance | 3 | Loop management and cost optimisation require tuning expertise |
| Security & Compliance | 3 | Prompt injection in multi-hop contexts and tool boundary enforcement require careful implementation |
| Scalability Evidence | 3 | Session-based; horizontal scaling straightforward; cost per session grows with complexity |
| Cost Predictability | 2 | Iteration count variability makes cost highly query-dependent; monitoring and alerting essential |
18. Revision History
| Version | Date | Author | Changes |
|---|---|---|---|
| 1.0 | 2024-07-01 | EAAPL Working Group | Initial publication |
| 1.1 | 2024-10-15 | EAAPL Working Group | Self-critique module formalised; circular loop detection added |
| 1.2 | 2025-04-01 | EAAPL Working Group | Human oversight gate added; EU AI Act Article 14 mapping; scratchpad compression strategy |