[EAAPL-PLT005] Prompt Version Control
Category: Platform Engineering
Sub-category: Prompt Engineering / MLOps
Version: 1.1
Maturity: Proven
Tags: prompt-engineering, version-control, prompt-registry, a-b-testing, promotion-workflow, rollback, benchmarking, prompt-governance
Regulatory Relevance: EU AI Act Article 17 (Quality Management), ISO 42001 Clause 7, NIST AI RMF MANAGE 3.1
1. Executive Summary
Prompts are the primary programming interface for LLM-based systems, yet most organisations treat them as informal text strings embedded in application code—unversioned, untested, and unreviewed. This creates a silent risk: a change to a system prompt that reaches production can catastrophically alter AI behaviour at scale before anyone notices, and there is no mechanism to roll back.
The Prompt Version Control pattern establishes prompts as first-class software engineering artefacts with all the discipline that entails: a centralised registry, semantic versioning, automated performance benchmarking, a formalised promotion workflow from development through staging to production, and atomic rollback capability. The pattern also defines prompt ownership and review processes, integrating with existing code review workflows. Organisations that implement this pattern gain reproducible AI behaviour (the same prompt version produces consistent outputs), auditability (every AI output can be traced to the exact prompt version that produced it), and confidence to iterate prompts without fear of undetected regressions.
2. Problem Statement
Business Problem
AI product quality degrades silently when prompts change without controlled testing. Customer-facing AI features produce unexpected outputs after "minor" prompt edits, with no visibility into what changed or when. Regulatory auditors request evidence of what instructions were given to the AI model at the time of a specific decision; there is no answer.
Technical Problem
Prompts are stored as string literals in application code, environment variables, or ad hoc databases. There is no version history, no test suite for prompts, no staging environment for prompt changes, no automated comparison between prompt versions, and no mechanism to roll back a prompt change without a full code deployment.
Symptoms
- AI output quality issues traced post-facto to undocumented prompt changes in a commit buried in application code
- Multiple teams maintaining near-identical prompts independently with no shared library
- No ability to answer the audit question: "what prompt instructions were in effect for this AI decision on this date?"
- Prompt changes deployed to production alongside unrelated code changes, making rollback disproportionately disruptive
- No systematic evaluation comparing the before/after quality of a prompt change
Cost of Inaction
- Undetected prompt regressions causing AI product quality incidents that erode user trust
- Regulatory non-compliance due to inability to reconstruct AI decision context
- Duplicated prompt engineering effort across teams
- Slow, risky prompt iteration cycle discouraging AI feature experimentation
3. Context
When to Apply
- Organisation has AI features in production with prompts that change over time
- Multiple teams or individuals author and modify prompts for AI systems
- Regulatory or audit requirements mandate traceability of AI behaviour to its instructions
- Prompt performance optimisation is an ongoing engineering activity
- AI output quality incidents have occurred due to uncontrolled prompt changes
When NOT to Apply
- Single static prompt for a single feature that never changes: version control overhead not warranted
- Highly dynamic prompt construction where the prompt is entirely generated at runtime from structured data: the data pipeline, not the prompt template, is the engineering artefact
- Proof-of-concept phase: establish prompts in code first; migrate to registry when moving to production
Prerequisites
- Git repository or equivalent version control system as the backend
- Evaluation dataset (golden dataset of representative inputs and expected outputs) for benchmarking
- CI/CD pipeline for automated evaluation on pull requests
- AI API Gateway (PLT002) with prompt registry integration for runtime prompt loading
- Defined prompt ownership policy (who is accountable for each prompt's quality)
Industry Applicability
| Industry |
Applicability |
Key Driver |
| Financial Services |
Very High |
Regulatory traceability; consistent customer-facing AI outputs |
| Healthcare |
Very High |
Clinical AI reproducibility; regulatory approval requires controlled instructions |
| Technology / SaaS |
High |
Quality at scale; frequent iteration; multi-team prompt authorship |
| Legal / Professional Services |
Very High |
Professional responsibility for AI-assisted advice; exact instruction tracking |
| Government |
High |
Public accountability; audit requirements |
| Retail / E-commerce |
Medium-High |
Brand-consistent AI outputs; product description quality |
4. Architecture Overview
The Prompt Version Control system is architecturally similar to a software artifact registry (think: npm, Docker Hub, Maven) but specialised for prompt management. It provides storage, versioning, metadata, evaluation, and deployment workflow for prompt artefacts.
Prompt Structure and Storage defines what a prompt artefact is. A prompt in the registry is not just a string; it is a structured document containing: the prompt text (system prompt, user message template, assistant prefill if applicable), metadata (name, description, use case tag, model family compatibility, author, creation date, last modified date, changelog), evaluation configuration (benchmark dataset reference, quality metrics and thresholds), and deployment state (current version in each environment: dev/staging/production). Prompts are stored in a Git-backed repository as structured YAML/JSON files, giving the registry the full history, diff, and branching capabilities of Git for free.
Semantic Versioning for Prompts adapts software versioning conventions for prompts. A MAJOR version bump (e.g., 1.0.0 → 2.0.0) indicates a breaking change in the prompt's output format or semantics—consuming applications may need to be updated. A MINOR version bump (1.0.0 → 1.1.0) indicates an improvement that is backward-compatible: better quality, additional instructions, clarified guidance. A PATCH version bump (1.0.0 → 1.0.1) indicates a non-functional change: typo correction, formatting improvement, comment addition with no output impact. This versioning discipline enables consuming applications to pin to a major version and receive safe improvements automatically.
Evaluation Framework Integration is what distinguishes this pattern from simple version control. When a prompt change is submitted (via pull request), the CI pipeline automatically executes the prompt against the registered evaluation dataset, computes quality metrics (accuracy, factuality, format compliance, brand voice score, latency), and compares results to the baseline (the current production version). The evaluation results are posted as a pull request comment with a pass/fail decision based on the configured quality thresholds. A prompt change that degrades a tracked metric below the threshold fails the check and cannot be merged without explicit override and documented justification.
Promotion Workflow defines the lifecycle stages. A new prompt version is created in the draft state in the development environment. After automated evaluation passes, it can be promoted to staging, where it is served to a percentage of staging traffic for real-world validation. After a configurable soak period (typically 48–72 hours) with no quality regression signals, it can be promoted to production. Promotion between stages requires a review and approval: from the prompt author to merge a draft, from the prompt owner (engineering manager or product manager) to promote to staging, from the AI Governance Board for high-risk use case prompts promoting to production.
Rollback Mechanism is atomic and does not require a code deployment. Any environment's active prompt version can be reverted to a previous version via the registry API or portal UI with a single action. The rollback takes effect for all new requests within the gateway's prompt cache TTL (typically <60 seconds). The rollback event is logged with actor, timestamp, and reason code for the audit trail.
Runtime Prompt Loading integrates the registry with the AI API Gateway. The gateway is configured to load prompts by name and version (or latest with major version pin). At request time, the gateway resolves the current production version of the named prompt from the registry (with a short TTL cache to avoid registry becoming a latency bottleneck) and assembles the final prompt by combining the template with request-specific variables. This means prompt changes are decoupled from application deployments—a prompt can be updated without any application code change or deployment pipeline execution.
5. Architecture Diagram
flowchart TD
subgraph Authoring["Authoring"]
A[Prompt Author]
B[Pull Request]
end
subgraph Registry["Prompt Registry"]
C[Prompt Store]
D{Quality Gate}
E[Version Manager]
end
subgraph Delivery["Delivery"]
F[Prompt Loader]
G[LLM Inference]
H[Audit Log]
end
A -->|submit change| B
B -->|evaluate vs baseline| D
D -->|fail| B
D -->|pass| C
C --> E
E -->|promote to prod| F
F -->|assembled prompt| G
G --> H
H -->|version record| C
style A fill:#dbeafe,stroke:#3b82f6
style B fill:#dbeafe,stroke:#3b82f6
style C fill:#fef9c3,stroke:#eab308
style D fill:#f3e8ff,stroke:#a855f7
style E fill:#f0fdf4,stroke:#22c55e
style F fill:#f0fdf4,stroke:#22c55e
style G fill:#fef9c3,stroke:#eab308
style H fill:#fef9c3,stroke:#eab308
6. Components
| Component |
Type |
Responsibility |
Technology Options |
Criticality |
| Prompt Store |
Service |
Store versioned prompt artefacts; Git history as version log |
Git repository (GitHub/GitLab/Bitbucket) + metadata DB |
Critical |
| Version Manager |
Service |
Assign semver; manage state machine (draft→staging→prod) |
Custom service, LangSmith, Promptflow |
Critical |
| Metadata Store |
Service |
Store prompt metadata (owner, use case, model compat, benchmark config) |
PostgreSQL, SQLite-backed service |
High |
| CI Evaluation Runner |
Service |
Execute prompts against golden dataset on PR; compute metrics |
Custom harness, Ragas, DeepEval, GitHub Actions |
High |
| Benchmark Dataset Store |
Service |
Store golden dataset of representative inputs + expected outputs |
S3, DVC, Git LFS |
High |
| Quality Gate |
Service |
Compare evaluation metrics to thresholds; pass/fail PR |
Custom CI step, GitHub Status API |
High |
| Promotion Workflow |
Service |
Manage approvals and environment transitions |
GitHub PR approvals, custom workflow, Jira |
High |
| Runtime Prompt Loader |
Service |
Resolve and cache prompt for gateway consumption |
Custom, LangChain prompt hub client |
Critical |
| Prompt Cache |
Service |
Cache resolved prompts at gateway with TTL |
Redis, in-memory cache |
High |
| Audit Log |
Service |
Record which prompt version served each request |
OpenTelemetry → S3/Kafka |
Critical |
| Rollback API |
Service |
Atomic reversion of environment's active prompt version |
Custom REST API + version manager |
High |
| Developer Portal Integration |
Service |
Surface prompt catalogue, version history, evaluation results |
Backstage plugin, custom portal page |
Medium |
7. Data Flow
Primary Flow — Prompt Change and Promotion
| Step |
Actor |
Action |
Output |
| 1 |
Prompt Author |
Create feature branch; edit prompt YAML; open pull request |
PR with prompt diff |
| 2 |
CI Pipeline |
Detect prompt change; load evaluation config from prompt metadata |
Evaluation job triggered |
| 3 |
Benchmark Runner |
Execute current version and new version against golden dataset (100–500 examples) |
Quality metrics for both versions |
| 4 |
Quality Gate |
Compare metrics: new version accuracy 94% vs. baseline 92%; threshold 90% → PASS |
PR check marked passing |
| 5 |
Prompt Owner |
Review PR; approve and merge |
Prompt promoted to Draft in development registry |
| 6 |
Promotion Approver |
Review soak data; approve promotion to staging |
Prompt version active in staging |
| 7 |
Governance Review (high-risk only) |
Review evaluation report; sign off for production |
Approval record in GRC system |
| 8 |
Platform Team |
Promote to production |
New version active in production registry |
| 9 |
Gateway |
On next request, prompt loader detects new version; loads and caches |
New prompt version serving production traffic |
Error Flow
| Error |
Detection |
Response |
| Quality gate failure (metric below threshold) |
CI evaluation at step 4 |
PR blocked; author receives metric comparison report |
| Benchmark dataset unavailable |
CI pipeline setup |
Evaluation skipped; PR requires manual quality review override |
| Rollback required (production incident) |
Incident declared |
Rollback API called; previous version active within 60 seconds |
| Prompt loader registry unavailable |
Gateway health check |
Serve cached prompt version; alert platform team |
8. Security Considerations
- Prompt content may contain system instructions that constitute IP; access to the prompt store is restricted to authorised team members via RBAC
- Prompts are evaluated for injection risk (containing instructions that could be exploited by user input concatenation) as part of the review process
- Evaluation results are stored separately from prompt content; evaluation datasets may contain sensitive representative inputs requiring the same access controls as production data
OWASP LLM Top 10 Controls
| OWASP LLM Risk |
Prompt Registry Control |
| LLM01 Prompt Injection |
Evaluation suite includes injection test cases; prompts with injection vulnerabilities fail quality gate |
| LLM02 Insecure Output Handling |
Evaluation includes output format validation; format regressions caught before production |
| LLM09 Overreliance |
Evaluation benchmarks track factuality and hallucination rate; regressions blocked |
9. Governance Considerations
Responsible AI
- Each prompt must declare its intended use case in metadata; this is the authoritative source for the gateway's use-case-based policy enforcement
- Prompts for high-risk AI use cases must pass an expanded evaluation suite including bias and fairness metrics
Traceability
- The runtime prompt loader records the exact prompt version (name + version + Git commit hash) in the request's audit log entry; this enables post-hoc reconstruction of what instructions were in effect for any historical AI output
Governance Artefacts
| Artefact |
Owner |
Cadence |
Location |
| Prompt store (all versions) |
Platform Team |
Continuous |
Git repository |
| Evaluation benchmark datasets |
Model Owner + Data Team |
Per prompt; updated when representative inputs change |
S3 / DVC |
| Quality thresholds configuration |
Model Owner + Platform Team |
Per model/use-case; reviewed quarterly |
Prompt metadata |
| Promotion approval records |
Promotion Approver |
Per promotion |
Git PR records + GRC system |
| Rollback log |
Platform Team |
Per rollback |
Audit log |
10. Operational Considerations
Monitoring
| Signal |
Source |
Alert |
Owner |
| Prompt loader cache miss rate |
Gateway metrics |
>20% sustained miss (registry connectivity issue) |
Platform On-Call |
| Production quality metric regression |
Evaluation on production sample |
Statistically significant drop from benchmark |
Prompt Owner |
| Prompt registry service availability |
Health check |
<99.9% availability |
Platform On-Call |
| Failed promotion attempts |
Promotion workflow |
Repeated failures may indicate broken CI |
Platform Team |
SLOs
| SLO |
Target |
Window |
| Prompt resolution latency (cached) |
<5ms |
Rolling 7 days |
| Prompt registry availability |
99.9% |
Rolling 30 days |
| Evaluation pipeline completion time |
<15 minutes for standard evaluation suite |
Per run |
| Rollback execution time |
<60 seconds end-to-end |
Per event |
Disaster Recovery
| Component |
RPO |
RTO |
Strategy |
| Prompt store (Git) |
0 |
5 min |
Git replication; restore from remote |
| Metadata DB |
5 min |
15 min |
Database replication |
| Prompt loader cache |
0 |
5 min |
Rebuild from registry on cache miss |
| Evaluation results |
1 hour |
30 min |
Recomputable from stored datasets |
11. Cost Considerations
Cost Drivers
| Driver |
Description |
Relative Weight |
| Evaluation API calls |
Running golden dataset through model on each PR |
Medium — scale with dataset size and PR frequency |
| Registry hosting |
Git repository + metadata DB |
Very Low |
| CI compute |
Evaluation pipeline execution |
Low |
| Prompt loader cache |
Redis or in-memory |
Very Low |
Optimisations
- Use the cheapest capable model for evaluation runs (not necessarily the model the prompt will run against in production) to reduce evaluation API cost
- Cache evaluation results for prompts where the diff is non-functional (whitespace, comments); skip re-evaluation
- Limit golden dataset to 100–200 representative examples for routine evaluation; run extended 500+ example suite only for major version changes
Indicative Cost Range
| Scale |
Monthly Prompt Registry Infra + Evaluation Cost |
| Small (5–20 prompts, low PR volume) |
$200–$600 (mostly evaluation API calls) |
| Medium (50–200 prompts, active development) |
$1,000–$4,000 |
| Large (500+ prompts, many teams) |
$5,000–$15,000 (evaluation at scale) |
12. Trade-Off Analysis
Registry Backend Options
| Option |
Description |
Pros |
Cons |
Best For |
| Git-backed Store |
Prompts as YAML/JSON files in Git repository |
Full version history; PR workflow; diff tooling |
Less structured; no native querying; metadata in comments |
Teams with strong Git discipline; open source preference |
| Purpose-Built Prompt Registry |
Dedicated tool (LangSmith, Promptflow, custom) |
UI-first; evaluation integration; model comparison |
Vendor dependency; may duplicate Git |
Large teams; PromptOps as a discipline |
| Feature Flag System |
Store prompts as feature flag values |
Familiar to engineering teams; easy A/B |
No evaluation integration; limited metadata; not designed for structured text |
Simple use cases; teams with existing feature flag investment |
Evaluation Strategy Options
| Option |
Description |
Pros |
Cons |
Best For |
| Automated Benchmarking Only |
Evaluate against golden dataset in CI |
Scalable; consistent; no human bottleneck |
Only as good as the golden dataset; may miss nuanced quality changes |
Structured output tasks; classification; extraction |
| Human Evaluation Gate |
Require human review of AI output samples for each change |
Highest quality assurance |
Slow; doesn't scale; expensive |
Critical customer-facing prompts; high-risk use cases |
| Combined Automated + Sampling |
Auto-eval for CI gate; human review of sample on promotion to prod |
Balance of speed and quality |
Requires evaluation team capacity |
Production AI systems with quality risk |
Architectural Tensions
| Tension |
Option A |
Option B |
Resolution |
| Strict evaluation gate vs. development velocity |
Block all PRs failing any metric |
Advisory only; never block |
Block on regression (metric drops below threshold); never block on no-change; allow override with documented justification |
| Centralised registry vs. team autonomy |
One registry for all prompts |
Team-owned registries |
One registry with team namespaces; platform team manages infrastructure, teams own their namespace |
| Prompt as code vs. prompt as configuration |
Prompts in application code |
Prompts in registry |
Prompts in registry; application code loads by reference; decouples prompt lifecycle from code deployment |
13. Failure Modes
| Failure |
Likelihood |
Impact |
Detection |
Recovery |
| Registry unavailable (prompt loader fails) |
Low |
High — all AI requests use stale cached version |
Registry health check; prompt loader error rate |
Extend cache TTL to last-known-good; alert platform |
| Evaluation dataset drifts from production distribution |
Medium |
Medium — evaluation passes but production quality declines |
Production quality monitoring |
Update golden dataset; trigger re-evaluation of current production version |
| Accidental production version rollback to bad prompt |
Low |
High — regression deployed without evaluation |
Production quality drop; user feedback spike |
Re-promote correct version; post-incident update of rollback approval process |
| Golden dataset contains PII |
Low |
High — PII exposed in CI logs and evaluation results |
Data classification scan on dataset upload |
Anonymise dataset; purge from CI logs; incident process |
| Evaluation quality threshold too strict (blocks all changes) |
Medium |
Medium — development blocked |
Repeated PR failures on valid improvements |
Review and recalibrate thresholds; consider per-metric thresholds vs. composite |
14. Regulatory Considerations
EU AI Act Article 17 (Quality Management System)
- The prompt versioning and promotion workflow constitutes a quality management system for AI instructions as required by Article 17
- Evaluation records and promotion approvals provide the documented evidence of quality control required for high-risk AI systems
ISO 42001 Clause 7 (Support — Documented Information)
- Prompt store version history + metadata satisfies Clause 7.5 requirements for controlled documented information for the AI management system
- Evaluation reports constitute performance evidence required by Clause 9.1
NIST AI RMF MANAGE 3.1
- The rollback mechanism constitutes a documented incident response procedure for AI behaviour changes as required by MANAGE 3.1
15. Reference Implementations
AWS
| Component |
AWS Service |
| Prompt store |
GitHub/GitLab + AWS CodeCommit mirror |
| Metadata store |
Amazon RDS PostgreSQL |
| Evaluation pipeline |
AWS CodeBuild + custom evaluation harness |
| Evaluation API calls |
Amazon Bedrock batch inference |
| Prompt loader cache |
Amazon ElastiCache Redis |
| Audit log |
CloudWatch Logs + S3 |
Azure
| Component |
Azure Service |
| Prompt store |
Azure Repos (Git) or GitHub |
| Evaluation pipeline |
Azure Pipelines + PromptFlow evaluation |
| Metadata store |
Azure SQL |
| Prompt loader |
Azure API Management policy loading from Key Vault |
SaaS
| Component |
Technology |
| Prompt store + evaluation + registry |
LangSmith (LangChain), Promptflow (Azure AI Studio), Pezzo |
On-Premises
| Component |
Technology |
| Prompt store |
GitLab self-hosted |
| Evaluation pipeline |
GitLab CI + custom Python harness |
| Metadata store |
PostgreSQL |
| Prompt loader cache |
Redis |
| Pattern ID |
Name |
Relationship |
| EAAPL-PLT001 |
Enterprise AI Platform |
Parent — Prompt Registry is Layer 4 of the platform |
| EAAPL-PLT002 |
AI API Gateway |
Consumer — gateway loads prompts from registry at runtime |
| EAAPL-PLT008 |
AI Experiment Tracking |
Complementary — evaluation results feed experiment tracking |
| EAAPL-PLT003 |
Model Routing |
Complementary — prompt version may specify model family preference |
17. Maturity Assessment
Overall Maturity: Proven
Prompt version control as a practice is established in leading AI-native organisations and large enterprises. Git-backed storage is simple and immediately deployable. Purpose-built prompt registry tools are maturing rapidly.
Scoring Matrix
| Dimension |
Score (1–5) |
Rationale |
| Pattern Completeness |
5 |
All sections documented |
| Implementation Evidence |
4 |
Widely adopted in principle; tooling varies; some orgs still in ad hoc phase |
| Tooling Maturity |
3 |
Git-backed: mature; purpose-built registry tools: emerging |
| Regulatory Alignment |
5 |
Strong mapping to EU AI Act Article 17 and ISO 42001 |
| Operational Complexity |
Low-Medium |
Git workflow familiar; evaluation pipeline setup is main effort |
18. Revision History
| Version |
Date |
Author |
Changes |
| 1.0 |
2024-05-01 |
EAAPL Working Group |
Initial publication |
| 1.1 |
2025-06-12 |
EAAPL Working Group |
Semver prompt versioning section expanded; runtime prompt loading via gateway added; rollback SLO updated |