[EAAPL-RAG008] Multimodal Retrieval-Augmented Generation
Category: Artificial Intelligence / Retrieval-Augmented Generation
Sub-category: Multimodal and Cross-Modal Retrieval
Version: 1.0
Maturity: Emerging
Tags: rag multimodal vision image-retrieval table-understanding cross-modal clip colpali document-intelligence
Regulatory Relevance: EU AI Act Article 10 (Data quality across modalities), ISO/IEC 42001 Section 6.1, NIST AI RMF (Map 1.5)
1. Executive Summary
Multimodal RAG extends the retrieval-augmented generation paradigm to knowledge corpora that include images, charts, diagrams, tables, and other non-text modalities alongside prose documents. Enterprise knowledge frequently exists in forms that standard text-based RAG cannot access: engineering diagrams, medical imaging reports, financial charts, product photographs, scanned contracts, and slide decks with data visualisations. Multimodal RAG enables users to query this knowledge in natural language and receive answers grounded in visual as well as textual evidence.
For enterprise leaders in engineering, healthcare, manufacturing, finance, and legal domains, Multimodal RAG unlocks a significant portion of the knowledge corpus that text-only RAG leaves inaccessible. A maintenance engineer asking "What does the valve assembly look like on model X?" needs a diagram, not a text description. A financial analyst asking "Show me the revenue trend chart from the Q3 investor presentation" needs the chart itself, retrieved and analysed, not a text summary of revenue figures. The pattern is emerging rather than mature — the enabling technologies (multimodal embedding models, vision-language models capable of grounded QA over retrieved images) are advancing rapidly but have not yet reached the operational reliability of text-only RAG. Early adopters with visual-heavy corpora should pilot carefully and plan for ongoing model upgrades.
2. Problem Statement
Business Problem
Enterprise document corpora are not exclusively textual. Technical manuals contain engineering diagrams. Financial reports contain charts. Contracts contain tables of fees and conditions. Training materials contain screenshots. Product documentation contains photographs. Text-only RAG silently ignores all of this visual content, leaving entire knowledge domains inaccessible to AI-assisted search and Q&A.
Technical Problem
Text embedding models cannot embed images. Vector similarity search over text embeddings cannot retrieve images by semantic query. Vision-language models capable of answering questions from image content exist but require grounded evidence retrieval — they cannot answer questions about a diagram without the diagram being present in context. The architecture must therefore solve two distinct problems: (1) how to retrieve the most relevant image/diagram/table for a given query, and (2) how to present it to the generation model in a form that enables grounded visual question answering.
Symptoms
- AI assistant cannot answer questions about diagrams, charts, or photographs even though they exist in the knowledge corpus
- Users receive text-only answers to questions that require visual evidence, and must manually locate the relevant diagram
- RAG system quality evaluations show low recall on questions derived from figure captions, table contents, or diagram annotations
- Users explicitly request "show me the diagram" or "retrieve the chart" and the system cannot comply
Cost of Inaction
- Significant portions of the knowledge corpus remain unsearchable via AI, limiting the ROI of the RAG investment
- Engineers, clinicians, and financial analysts must perform manual visual search in parallel with AI text search, duplicating effort
- Competitive disadvantage as multimodal AI capabilities become standard in enterprise knowledge platforms
3. Context
When to Apply
- Knowledge corpora where more than 10% of content value is in non-text form (images, diagrams, charts, tables)
- Technical documentation with engineering diagrams, schematics, or photographs
- Financial document Q&A where charts and tables are primary information carriers
- Healthcare document Q&A over radiology reports, clinical diagrams, or pharmaceutical product photographs
- Legal contracts with tabular terms, fee schedules, and signature pages
When NOT to Apply
- Text-only corpora where all visual content is purely decorative (logos, page backgrounds)
- Deployments with strict latency requirements where multimodal embedding and vision-language model inference adds unacceptable overhead
- Organisations without a mature multimodal data ingestion and storage capability — text-only RAG should be deployed first
Prerequisites
- A multimodal embedding model capable of embedding both text queries and image content in the same vector space (CLIP, ColPali, Nomic Embed Vision)
- A vision-language model capable of grounded visual question answering (GPT-4o, Gemini 1.5 Pro, Claude 3.5 Sonnet — all support image input)
- An image/figure extraction pipeline (extracts figures and tables from PDFs and documents)
- Object storage for image assets (referenced from the vector index by URL or object key)
- A table understanding component (converts tables to structured JSON or markdown for LLM consumption)
Industry Applicability
| Industry |
Modality |
Use Case |
| Engineering / Manufacturing |
CAD diagrams, P&ID schematics, assembly photographs |
"Show the assembly procedure for component X" |
| Financial Services |
Revenue charts, balance sheet tables, trend graphs |
"What does the FCF trend chart in the Q3 report show?" |
| Healthcare |
Anatomical diagrams, procedure illustrations, drug formulary tables |
"What does the surgical approach diagram for procedure Y look like?" |
| Legal |
Contract tables (fee schedules, milestone lists), signature pages |
"What are the termination fees in Table 3 of the MSA?" |
| Retail / E-commerce |
Product photographs, size charts, packaging diagrams |
"Show me the product dimensions chart for SKU X" |
| Architecture / Construction |
Floor plans, elevation drawings, material schedules |
"Show the floor plan for Level 3 of Building B" |
4. Architecture Overview
Multimodal RAG introduces two new ingestion paths and one new retrieval path alongside the standard text pipeline. Understanding the distinct characteristics of each modality is essential to designing an effective architecture.
Multimodal Document Parsing
The ingestion pipeline must first extract non-text elements from documents. For PDFs, this requires a document parsing step that identifies figures, tables, and images and extracts them as separate assets. Tools such as Apache Tika, AWS Textract, Azure Document Intelligence, or Google Document AI can extract figures and tables from PDFs with varying quality. Each extracted asset is assigned a unique asset ID, a reference to its parent document, a page number, a caption (if present), and surrounding context text (the paragraphs immediately before and after the figure in the source document).
Image Embedding Path
Each extracted image is embedded using a multimodal embedding model that produces vectors in the same semantic space as text embeddings. CLIP (Contrastive Language-Image Pre-training) is the canonical architecture: trained on image-text pairs, it produces comparable embeddings for text queries and images, enabling cross-modal retrieval ("text query → retrieve relevant images"). ColPali is an emerging alternative that produces multi-vector patch-level embeddings for higher-resolution document understanding.
Images are stored in object storage (S3, Azure Blob, GCS) and referenced by URL in the vector index. The vector index entry contains the image URL, the image's embedding, the parent document ID, the caption, and surrounding context text. Storing the full image bytes in the vector database is not recommended — only references.
Table Understanding Path
Tables require a distinct handling path because they are structured data, not natural language. Table extraction (via document intelligence services) produces structured table representations. These are then converted to either Markdown table format (for LLM consumption) or JSON (for structured query interfaces). The table as a whole is embedded as a text embedding (of its Markdown representation) for retrieval, not as an image.
Cross-Modal Retrieval
At query time, the user's text query is embedded using the multimodal embedding model and used to search both the text chunk index and the image/table index simultaneously. Scores from both indexes are merged (using RRF or weighted combination) and the top-K results across all modalities are selected. The context assembler then constructs a multimodal prompt that includes both text chunks and images (base64-encoded or URL-referenced, depending on the VLM API).
Vision-Language Model for Grounded Visual QA
The generation step uses a vision-language model (VLM) that accepts both text and images in its context window. The VLM is instructed to answer the user's question based on the provided text and visual evidence, citing both text sources and image sources. The prompt structure places the visual evidence (images, tables) alongside the text chunks and explicitly asks the model to ground its response in the visual content when relevant.
5. Architecture Diagram
flowchart TD
subgraph Ingestion["Multimodal Ingestion"]
A[Source Documents]
B[Document Parser]
C[Text Vector Index]
D[Image + Table Indexes]
end
subgraph Query["Cross-Modal Retrieval"]
E[User Query]
F[Multimodal Embedder]
G[Cross-Modal Merger]
end
subgraph Generation["VLM Generation"]
H[Multimodal Context]
I[Vision-Language Model]
end
A --> B
B -->|text chunks| C
B -->|images + tables| D
E --> F
F --> C
F --> D
C --> G
D --> G
G --> H --> I --> E
style A fill:#dbeafe,stroke:#3b82f6
style B fill:#f0fdf4,stroke:#22c55e
style C fill:#fef9c3,stroke:#eab308
style D fill:#fef9c3,stroke:#eab308
style E fill:#dbeafe,stroke:#3b82f6
style F fill:#f0fdf4,stroke:#22c55e
style G fill:#f0fdf4,stroke:#22c55e
style H fill:#f0fdf4,stroke:#22c55e
style I fill:#d1fae5,stroke:#10b981
6. Components
| Component |
Type |
Responsibility |
Technology Options |
Criticality |
| Document Intelligence / Parser |
Data Processing |
Extract text, figures, and tables from documents |
AWS Textract, Azure Document Intelligence, Google Document AI, Unstructured.io |
Critical |
| Image/Figure Extractor |
Data Processing |
Isolate figure regions from PDFs; extract captions and context |
PyMuPDF, pdfplumber, Apache Tika, Document AI |
High |
| Object Storage |
Storage |
Store extracted image assets referenced by vector index |
Amazon S3, Azure Blob Storage, Google Cloud Storage |
Critical |
| Multimodal Embedding Model |
ML Inference |
Embed images and text in shared vector space |
OpenAI CLIP, ColPali, Nomic Embed Vision, Google multimodal embedding |
Critical |
| Table Extractor |
Data Processing |
Extract table data as structured representation |
AWS Textract (table mode), Azure DI table extraction, camelot-py |
High |
| Image Vector Index |
Storage |
ANN index over image embeddings with metadata |
Weaviate (multi-vector), Qdrant, Pinecone |
Critical |
| Table Vector Index |
Storage |
Index of table embeddings (as text) with structured metadata |
Same vector DB; separate namespace/collection |
High |
| Cross-Modal Retrieval Orchestrator |
Retrieval |
Query all modality indexes; merge results |
Custom Python; LangChain multi-retriever |
High |
| Multimodal Context Assembler |
Orchestration |
Construct VLM prompt with text + images + tables |
Custom; LangChain; LlamaIndex multimodal retriever |
High |
| Vision-Language Model |
ML Inference |
Generate grounded answer from multimodal context |
GPT-4o, Claude 3.5 Sonnet, Gemini 1.5 Pro |
Critical |
7. Data Flow
Primary Flow
| Step |
Actor |
Action |
Output |
| 1 |
Document Intelligence |
Parse PDF; extract text blocks, figure regions, table regions |
Separated text, image bytes, table data per document |
| 2 |
Image Extractor |
Crop figure regions; extract caption and surrounding text context |
{image_id, image_bytes, caption, context_text, page, doc_id} |
| 3 |
Object Storage |
Store image_bytes at s3://bucket/{doc_id}/{image_id}.png |
Image URL |
| 4 |
Multimodal Embedding |
Embed image using CLIP/ColPali |
Image embedding vector |
| 5 |
Image Vector Index |
Upsert {image_id, embedding, image_url, caption, context_text, doc_id} |
Indexed image entry |
| 6 |
Table Extractor |
Extract table as Markdown/JSON |
{table_id, markdown, page, doc_id} |
| 7 |
Table Vector Index |
Embed Markdown representation; upsert |
Indexed table entry |
| 8 |
User |
Submit natural language query |
Query string |
| 9 |
Multimodal Query Embedder |
Embed query using the same multimodal model |
Query vector (comparable to image and text vectors) |
| 10 |
Cross-Modal Retrieval |
ANN search across text, image, and table indexes |
Candidates from each modality |
| 11 |
Modal Result Merger |
Apply RRF across modalities |
Unified ranked candidate list |
| 12 |
Context Assembler |
Fetch text chunks; fetch image bytes (base64) or URLs; fetch table Markdown |
Multimodal prompt: text + images + tables |
| 13 |
VLM |
Generate answer grounded in visual + textual evidence |
Response with image and text citations |
Error Flow
| Error Condition |
Detection |
Recovery |
| Figure extraction fails (complex PDF layout) |
Extraction error log; empty image count |
Ingest text-only for failed pages; flag document for manual review |
| Multimodal embedding model unavailable |
API health check |
Fall back to caption-text embedding (text-only retrieval for images); surface quality degradation |
| VLM image token limit exceeded |
Token count validation before VLM call |
Reduce number of images in context; summarise image captions instead |
| Image URL expired (object storage pre-signed URL) |
HTTP 403 on VLM image fetch |
Use long-lived URLs or regenerate pre-signed URL at query time |
8. Security Considerations
Image Content Classification
Images in enterprise documents may contain sensitive content (facial photographs, handwritten signatures, confidential diagrams). Image classification must be applied at ingestion to flag sensitive images and enforce the same ACL-based access controls as text documents (EAAPL-RAG003). A document with a PROTECTED classification propagates that classification to all extracted images.
OWASP LLM Top 10 Mitigations
| OWASP LLM Risk |
Multimodal-Specific Concern |
Mitigation |
| LLM01: Prompt Injection |
Visual prompt injection: adversarial content embedded in images (invisible text in images) |
Image content safety scanner before indexing and before VLM context assembly |
| LLM06: Sensitive Information Disclosure |
Image contains PII (photograph, signature, handwritten notes) visible to unauthorised users |
ACL enforcement on image retrieval; image-level classification tagging |
| LLM02: Insecure Output Handling |
VLM describes confidential diagram content verbatim |
Output classification labelling; no verbatim reproduction of classified visual content |
9. Governance Considerations
Visual Content Governance
All extracted images must be inventoried as part of the corpus inventory (EAAPL-KNW003). Images containing personal information (photographs, handwritten documents) require explicit privacy assessment. The image extraction pipeline must be reviewed by the Privacy Officer before processing HR or healthcare documents.
Governance Artefacts
| Artefact |
Owner |
Frequency |
Purpose |
| Multimodal Corpus Inventory |
Knowledge Manager |
Continuous |
Track all images, tables, and their source documents |
| Image Classification Audit |
Privacy Officer |
Quarterly |
Review images containing personal information |
| Visual Retrieval Quality Report |
AI Operations |
Monthly |
Benchmark cross-modal retrieval recall on a test set |
10. Operational Considerations
Monitoring
| Metric |
Alert Threshold |
Notes |
| Image extraction success rate |
< 90% |
Document Intelligence API quality issue |
| Multimodal embedding API latency |
> 500ms |
Affects ingestion throughput |
| VLM image token cost per session |
> $0.50 |
High image count in context; optimise retrieval K for images |
| Cross-modal retrieval recall (benchmark) |
< 0.70 |
Multimodal embedding quality degradation |
Service Level Objectives
| SLO |
Target |
Notes |
| Multimodal query P95 latency |
≤ 5 seconds |
Longer than text-only due to image embedding and VLM processing |
| Image extraction coverage |
≥ 90% of documents with figures |
Measured monthly |
| Visual retrieval recall@5 |
≥ 0.70 on benchmark |
Measured monthly |
11. Cost Considerations
Cost Drivers
| Cost Driver |
Notes |
Optimisation |
| Document Intelligence (image extraction) |
$10–$30 per 1,000 pages |
Batch processing; cache extraction results |
| Multimodal embedding |
Higher cost than text embedding; CLIP APIs ~$0.05–0.15/1K images |
Self-host CLIP for large corpora |
| VLM with image input |
Vision tokens are significantly more expensive than text tokens (GPT-4o: $2.50/1M image tokens) |
Limit images per context window; use lower-resolution images when detail is not required |
| Object storage (images) |
$0.02–$0.025/GB/month |
Lifecycle policies to move old images to cheaper storage tiers |
Indicative Cost Range
| Deployment Scale |
Monthly Cost (Multimodal) |
Notes |
| Small pilot (< 100K images) |
$500 – $2,000 |
Primarily extraction and embedding setup cost |
| Medium (100K–1M images) |
$2,000 – $10,000 |
VLM query cost becomes dominant |
| Large (> 1M images) |
$10,000 – $50,000 |
Self-hosted CLIP; VLM batching |
12. Trade-Off Analysis
Multimodal Embedding Approach
| Approach |
Cross-Modal Quality |
Cost |
Complexity |
Recommendation |
| CLIP (ViT-B/32 or ViT-L) |
Good |
Low (self-hostable) |
Low |
Default for most deployments |
| ColPali (multi-vector patch) |
Higher for document images |
Higher compute |
Medium |
For document-heavy corpora (PDFs with diagrams) |
| Caption-only embedding |
Low cross-modal quality |
Very Low |
Very Low |
Fallback only; not recommended for visual retrieval |
Table Handling Strategy
| Strategy |
Retrieval Quality |
Structured Query Support |
Complexity |
| Markdown text embedding |
Good |
None |
Low |
| JSON structured representation |
Good |
SQL-like queries possible |
Medium |
| Table as image (render table as PNG) |
Moderate |
None |
Low |
Architectural Tensions
| Tension |
Trade-off |
Recommendation |
| Context window image count vs. VLM cost |
More images: better visual grounding; higher token cost |
Cap images at 3 per query; prioritise highest-scored image retrieval |
| Image resolution vs. processing speed |
High resolution: better VLM understanding; higher token cost and latency |
Use 512px thumbnails for retrieval context; offer "view full resolution" link |
13. Failure Modes
| Failure Mode |
Likelihood |
Impact |
Detection |
Recovery |
| VLM hallucinates about image content not present in context |
Medium |
High |
Citation validation; visual grounding check |
Explicit prompt instruction to describe only visible content; confidence scoring |
| Figure extraction misses complex multi-column layouts |
High |
Medium |
Extraction coverage monitoring |
Manual review queue for documents with < 80% figure coverage |
| Cross-modal embedding model version drift |
Low |
High |
Retrieval quality benchmark |
Atomic re-embedding on model upgrade (same process as text-only RAG) |
| Object storage image URL expiry causing VLM 403 |
Medium |
High |
VLM error log |
Use long-lived signed URLs; regenerate at query time |
14. Regulatory Considerations
| Regulation |
Requirement |
Multimodal RAG Response |
| Privacy Act 1988 APP 11 |
Sensitive personal information (photographs) must be protected |
Facial photograph detection at ingestion; restricted access for documents containing photographs |
| EU AI Act Article 10 |
Training and operational data quality across all modalities |
Image extraction quality metrics; multimodal benchmark on representative corpus |
| GDPR Article 9 |
Special categories of data (medical images, biometrics) require explicit consent |
Healthcare and biometric images require separate consent and access control tier |
15. Reference Implementations
AWS
- Document Intelligence: Amazon Textract (figure + table extraction)
- Image storage: Amazon S3
- Multimodal embedding: Amazon Titan Multimodal Embeddings G1 or self-hosted CLIP on SageMaker
- Image vector index: Amazon OpenSearch Service with k-NN
- VLM: Amazon Bedrock (Claude 3.5 Sonnet or Nova)
Azure
- Document Intelligence: Azure AI Document Intelligence (figure + table extraction)
- Image storage: Azure Blob Storage
- Multimodal embedding: Azure OpenAI (CLIP via custom deployment) or Azure AI Vision
- Image vector index: Azure AI Search (vector mode)
- VLM: Azure OpenAI GPT-4o (native image input)
GCP
- Document Intelligence: Google Document AI
- Image storage: Google Cloud Storage
- Multimodal embedding: Vertex AI Multimodal Embeddings
- Image vector index: Vertex AI Vector Search
- VLM: Vertex AI Gemini 1.5 Pro (native multimodal)
| Pattern ID |
Pattern Name |
Relationship |
| EAAPL-RAG001 |
Enterprise RAG |
Foundation; RAG008 extends text retrieval with cross-modal capability |
| EAAPL-RAG005 |
Hybrid RAG |
Hybrid retrieval applied to text path; image path uses cross-modal embedding only |
| EAAPL-RAG009 |
Graph RAG |
Diagram elements can be modelled as knowledge graph entities; complementary |
| EAAPL-KNW003 |
AI Knowledge Corpus Management |
Corpus management must include visual asset lifecycle |
17. Maturity Assessment
Overall Maturity: Emerging — Multimodal embedding models and VLMs with image input are production-grade (GPT-4o, Gemini 1.5 Pro); document intelligence for figure extraction is mature; end-to-end multimodal RAG pipelines are in early production at leading enterprises but tooling is less standardised than text-only RAG.
| Dimension |
Score (1–5) |
Rationale |
| Technology Readiness |
3 |
VLMs are GA; multimodal embedding models are evolving rapidly; figure extraction quality varies |
| Tooling Ecosystem |
2 |
No turnkey multimodal RAG framework; significant custom development required |
| Operational Guidance |
2 |
Limited production guidance; benchmark and evaluation standards for visual retrieval are nascent |
| Security & Compliance |
2 |
Image ACL enforcement and visual PII detection are less mature than text equivalents |
| Scalability Evidence |
2 |
Limited large-scale production evidence; cost at scale not fully characterised |
| Cost Predictability |
2 |
VLM image token costs are high and variable; optimisation strategies are still evolving |
18. Revision History
| Version |
Date |
Author |
Changes |
| 1.0 |
2025-01-10 |
EAAPL Working Group |
Initial publication; ColPali and GPT-4o multimodal integrated |