EAAPL-MDL001 — Model Versioning and Lineage Tracking
| Attribute | Value |
|---|---|
| Pattern ID | EAAPL-MDL001 |
| Name | Model Versioning and Lineage Tracking |
| Maturity | Mature |
| Complexity | Low |
| Tags | model-register traceability accountability low-complexity |
| Last Reviewed | 2026-06-12 |
| Owner | Enterprise AI Architecture Practice |
1. Executive Summary
AI models are long-lived, continuously evolving artefacts. Without systematic versioning, organisations cannot answer the two questions regulators and boards ask first: which model made that decision? and who approved it? EAAPL-MDL001 establishes a disciplined versioning schema adapted from semantic versioning principles, a co-located artefact bundle (weights, configuration, tokeniser, preprocessing code, evaluation results), and an approval workflow that gates every version before it reaches production. By registering every version in the enterprise Model Register (EAAPL-GOV001), the organisation achieves full lineage — from training data provenance through evaluation evidence to production retirement. The pattern is deliberately low-complexity: it imposes process rigour, not infrastructure cost. CIOs gain a defensible audit trail for internal audit and external regulatory inquiries. CTOs gain a reproducible deployment pipeline and the ability to perform rapid rollback. Risk officers gain the model card evidence needed for model risk management frameworks. Any organisation deploying more than one AI model to production should treat this pattern as a foundation — it is a prerequisite for shadow deployment (EAAPL-MDL002), canary release (EAAPL-MDL003), and rollback (EAAPL-MDL004).
2. Problem Statement
2.1 Business Problem
Product and risk teams cannot determine which model version is serving a given business function. When a customer complaint or regulatory inquiry arrives, the response is "we think it was the model from Q3" — an answer that destroys regulator confidence and exposes the organisation to material censure.
2.2 Technical Problem
Model artefacts (weights, configuration, evaluation results) are stored in ad-hoc locations — cloud storage buckets without naming conventions, local developer machines, CI/CD pipeline caches. Versions are identified by timestamps or commit hashes that carry no semantic meaning. There is no link between a deployed model and the training run, dataset, or evaluation that produced it.
2.3 Symptoms
- Incident post-mortems cannot identify which model version caused the issue.
- Re-running training with "the same" code and data produces a different model with no explanation.
- Multiple teams deploy models with the same name but different capabilities, causing downstream consumer confusion.
- Evaluation results exist in spreadsheets disconnected from the deployed artefact.
- Deprecation is ad-hoc: consumers discover a model has been retired when their API calls start failing.
2.4 Cost of Inaction
| Category | Indicative Impact |
|---|---|
| Regulatory | Inability to respond to EU AI Act Article 13 transparency or APRA CPS234 audit requests — potential licence or operating conditions |
| Operational | 4–8 hour incident resolution time (vs <30 minutes with lineage) when model defect causes production incident |
| Reputational | Customer or media exposure of uncontrolled AI deployment practices |
| Financial | Re-training cost when original training run is unrecoverable — typically $10K–$500K for large foundation fine-tunes |
3. Context
3.1 When to Apply
- Any organisation deploying one or more AI/ML models to production systems.
- Teams preparing for regulatory examination under EU AI Act, APRA CPS234, or equivalent.
- Organisations implementing a Model Risk Management framework.
- Prior to implementing shadow deployment, canary release, or ensemble patterns.
3.2 When NOT to Apply
- Pure research/experimentation environments with no production deployment path (apply lightweight notebook versioning instead).
- Models trained and discarded within a single batch pipeline with no reuse intent.
3.3 Prerequisites
| Prerequisite | Detail |
|---|---|
| Artefact storage infrastructure | Immutable object storage (S3/GCS/Azure Blob) with versioning enabled |
| Model Register (EAAPL-GOV001) | Central registry to receive model version registrations |
| CI/CD pipeline | Automated pipeline that can invoke version registration on successful training/eval |
| Model Card template | Organisational template for model documentation (per-version) |
3.4 Industry Applicability
| Industry | Applicability | Primary Driver |
|---|---|---|
| Financial Services | Critical | APRA CPS234, ASIC guidance on algorithmic accountability |
| Healthcare | Critical | TGA/FDA AI guidance, patient safety traceability |
| Government | High | APS AI Ethics Principles, FOI obligations |
| Retail / E-commerce | Medium | Product recommendation auditability, A/B governance |
| Technology | Medium | SOC 2 Type II, enterprise customer audit requirements |
| Manufacturing | High | ISO 9001 change control adapted to AI components |
4. Architecture Overview
4.1 Versioning Schema
Model versioning in an enterprise context adapts semantic versioning (SemVer) to the unique characteristics of machine learning models. A version takes the form MAJOR.MINOR.PATCH[-suffix].
MAJOR version increment signals an architecture change that breaks backward compatibility for consumers. Examples: replacing a transformer encoder with a decoder-only architecture; switching from a BERT-class model to a GPT-class model; changing the output schema (e.g., adding a confidence distribution to a previously point-estimate model). Consumers must update their integration when MAJOR changes.
MINOR version increment signals a significant capability change that is backward-compatible at the API level. Examples: fine-tuning on a new domain dataset that meaningfully shifts output distribution; retraining with 12 additional months of data; updating the base model (e.g., from GPT-4o to GPT-4.1) while preserving the same prompt interface. Consumers should validate that downstream quality expectations are still met.
PATCH version increment signals a performance or efficiency change that does not alter model behaviour materially. Examples: INT8 quantisation of weights; ONNX export optimisation; vocabulary trimming; prompt template correction for a minor wording issue. Consumers are expected to accept patch upgrades without re-validation.
A suffix carries pre-release or environment signals: -alpha, -beta, -rc.1, -shadow (currently in shadow testing per EAAPL-MDL002), -deprecated.
4.2 Artefact Bundle
Every model version consists of a versioned artefact bundle — a single, immutable unit of content that travels together through the pipeline. The bundle contains:
- Weights: serialised model weights in a standard format (SafeTensors preferred; ONNX for cross-framework deployments).
- Configuration: model architecture config, hyperparameters, tokeniser vocabulary, and special tokens map.
- Preprocessing code: the exact pre/post-processing code used during training and inference, pinned by commit hash.
- Evaluation results: structured JSON/YAML file containing all evaluation metric values, benchmark dataset references, and evaluation date.
- Model card: structured Markdown document (see Section 4.3) authored and reviewed for this version.
- Provenance manifest: training run ID, training data references (dataset name + version + hash), compute environment (hardware type, framework versions, OS).
The bundle is stored as an immutable archive in object storage with a content-addressable hash. The hash is registered in EAAPL-GOV001 — any tampering is detectable.
4.3 Model Card Standards
Each version requires a model card authored by the team and reviewed by the AI governance function. The model card captures: intended use (primary use case, out-of-scope uses explicitly listed); training data description (provenance, size, temporal range, known biases); evaluation results (quantitative metrics on held-out test set, fairness metrics by demographic subgroup where applicable); known limitations; ethical considerations and risk rating; approval status and approver identity.
4.4 Approval Workflow
No model version may be promoted to a production environment without approval. The approval workflow is triggered by CI/CD on successful evaluation. Low-risk models require sign-off from the model owner and a peer reviewer. High-risk or regulated models additionally require sign-off from the AI Governance function and, for APRA-regulated entities, the Chief Risk Officer or delegate. Approval is recorded in EAAPL-GOV001 with timestamp and approver identity.
4.5 Deprecation Management
Deprecation follows a structured lifecycle: (1) Deprecation notice issued to all registered consumers at least 90 days before sunset for production models, 30 days for non-production. (2) Traffic migration plan agreed with consumers, including a canary ramp schedule per EAAPL-MDL003. (3) Sunset date recorded in the model register. (4) On sunset, the model is removed from serving infrastructure but the artefact bundle remains in cold storage for the regulatory retention period (minimum 7 years for regulated decisions).
5. Architecture Diagram
6. Components
| Component | Type | Responsibility | Technology Options | Criticality |
|---|---|---|---|---|
| Artefact Store | Infrastructure | Immutable, versioned storage for model bundles | AWS S3 (Object Lock), Azure Blob (Immutable), GCS | Critical |
| Model Register | Platform Service | Central registry indexing all model versions, metadata, approval status | MLflow Model Registry, Weights & Biases, custom DB | Critical |
| CI/CD Pipeline | Automation | Orchestrates training, evaluation, bundling, and registration | GitHub Actions, GitLab CI, Jenkins, Argo Workflows | High |
| Evaluation Harness | Platform Service | Runs benchmark suite, computes metrics, gates registration on pass/fail | Pytest + custom harness, Eleuther LM Eval, custom | Critical |
| Approval Workflow | Governance Service | Routes approval requests, captures sign-offs, enforces role-based approvals | Jira, ServiceNow, custom approval microservice | Critical |
| Model Card Generator | Tooling | Produces structured model card from training metadata + eval results | Custom template renderer, Hugging Face model cards | High |
| Deprecation Notifier | Automation | Sends deprecation notices to registered consumers on schedule | Email/Slack webhook, event bus | Medium |
| Version Comparison UI | Tooling | Side-by-side view of metrics, architecture diff, model card diff between versions | MLflow UI, custom dashboard, Grafana | Medium |
7. Data Flow
7.1 Primary Flow
| Step | Actor | Action | Output |
|---|---|---|---|
| 1 | Training Pipeline | Executes training run with pinned dataset and code commit | Trained model weights + training run metadata |
| 2 | Evaluation Harness | Runs benchmark suite against held-out test set | Evaluation results JSON |
| 3 | CI/CD Pipeline | Assembles artefact bundle (weights + config + preprocessing + eval) | Versioned artefact bundle with content hash |
| 4 | CI/CD Pipeline | Uploads bundle to immutable object store | Storage URI + content-addressable hash |
| 5 | Registration Agent | Registers version in Model Register with metadata and artefact reference | Model Register entry with version ID |
| 6 | Model Card Service | Generates draft model card from training + eval metadata | Draft model card document |
| 7 | Model Owner | Reviews and completes model card, submits for approval | Completed model card + approval request |
| 8 | Approval Workflow | Routes to appropriate approvers based on model risk tier | Approved or rejected version record |
| 9 | Deployment Service | On approval, makes version available in serving infrastructure | Version active in target environment |
7.2 Error Flow
| Error Scenario | Detection | Recovery Action |
|---|---|---|
| Evaluation below threshold | Evaluation harness fail gate | CI/CD pipeline rejects bundle; version not registered; team alerted |
| Artefact upload failure | Storage client exception | Pipeline retries 3x then raises incident; training run marked failed |
| Content hash mismatch on verify | Model Register integrity check | Version quarantined; security event raised; bundle re-uploaded |
| Approval workflow timeout | Workflow SLA monitor (72h) | Escalation to model owner's manager; version stays in pending state |
| Deployment to wrong environment | Environment tag validation pre-deploy | Deployment blocked; operator notified; approval re-confirmed required |
8. Security Considerations
8.1 Controls Summary
| Domain | Control |
|---|---|
| Authentication | Pipeline service accounts use short-lived OIDC tokens; no long-lived credentials in CI |
| Authorisation | Model Register write access limited to CI/CD service accounts; human read-only by default |
| Secrets | Model vendor API keys in secrets manager (AWS Secrets Manager / Azure Key Vault); not in artefact bundle |
| Classification | Model artefacts classified at minimum INTERNAL; fine-tuned-on-sensitive-data models classified CONFIDENTIAL |
| Encryption | Artefacts encrypted at rest (AES-256); in-transit TLS 1.3; encryption keys in KMS |
| Auditability | All Model Register mutations logged to immutable audit log; log retention 7 years |
8.2 OWASP LLM Top 10 Relevance
| OWASP LLM Risk | Relevance to This Pattern | Mitigation |
|---|---|---|
| LLM01 Prompt Injection | Low | Versioning layer does not process prompts; mitigation is upstream in inference layer |
| LLM02 Insecure Output Handling | Low | Model cards are rendered in controlled internal tooling; output escaping applied |
| LLM03 Training Data Poisoning | High | Provenance manifest in bundle enables detection of poisoned dataset versions; hash check |
| LLM04 Model Denial of Service | Low | Versioning layer is metadata; DoS risk is in serving layer (EAAPL-INF001) |
| LLM05 Supply Chain Vulnerabilities | High | All base models registered with provenance; third-party model licence and source verified |
| LLM06 Sensitive Information Disclosure | Medium | Model cards must NOT contain training data samples; PII must be absent from evaluation results |
| LLM07 Insecure Plugin Design | Low | Not applicable to versioning pipeline |
| LLM08 Excessive Agency | Low | Approval workflow enforces human-in-loop before any production promotion |
| LLM09 Overreliance | Medium | Version history and model card limitations section counters overreliance risk by documenting known failure modes |
| LLM10 Model Theft | High | Artefact store access logging; export controls on CONFIDENTIAL models; access review quarterly |
9. Governance Considerations
9.1 Responsible AI
Every model version requires a completed model card that addresses: fairness evaluation results (disaggregated by relevant subgroups), intended and prohibited use cases, known biases, and human oversight requirements. The AI Governance function reviews model cards for high-risk models before approval.
9.2 Model Risk Management
Model versioning is the foundation of the organisation's Model Risk Management framework. Each version is a distinct model for MRM purposes. Version MAJOR increments trigger full model validation. MINOR increments trigger targeted validation against changed capability. PATCH increments require regression testing evidence only.
9.3 Human Approval Gates
No automated process may promote a model to production without a recorded human approval. The approval record must include: approver identity (not a service account), approval date, version being approved, risk tier assessment, and any conditions or exceptions noted.
9.4 Governance Artefacts
| Artefact | Owner | Frequency | Location |
|---|---|---|---|
| Model Card | Model Owner | Per version | Model Register + artefact bundle |
| Approval Record | Approval Workflow | Per version | Governance Management System |
| Deprecation Notice | Model Owner | Per deprecation | Model Register + consumer notification |
| Artefact Integrity Report | Security Operations | Monthly | SIEM / audit log |
| Model Inventory (all active) | AI Governance | Quarterly | EAAPL-GOV001 register export |
10. Operational Considerations
10.1 SLOs
| SLO | Target | Measurement Method |
|---|---|---|
| Version registration latency | < 5 minutes | CI/CD pipeline timing from bundle upload to registry entry |
| Approval workflow response time | < 72 hours | Workflow system timestamp from submission to decision |
| Artefact integrity verification | 100% on deploy | Hash verification in deployment pre-check |
| Deprecation notice lead time | ≥ 90 days (prod) | Model Register deprecation date minus notice date |
10.2 Monitoring and Logging
The Model Register emits structured events on: version registration, approval state change, deployment, deprecation notice, and sunset. Events are forwarded to the organisation's SIEM. Alerts are configured for: unapproved versions deployed to production (P1), artefact hash mismatch (P1), overdue approval (P2).
10.3 Incident Response
A model-version incident (wrong version in production, unapproved version detected) triggers the AI Incident Response playbook. The immediate action is traffic shift to the last known-good approved version per EAAPL-MDL004. Root cause analysis is mandatory and must be completed within 5 business days.
10.4 Disaster Recovery
| Scenario | RPO | RTO | Recovery Procedure |
|---|---|---|---|
| Model Register database failure | 1 hour | 4 hours | Restore from hourly snapshot; cross-region replica failover |
| Artefact store unavailability | 0 (immutable) | 1 hour | Failover to secondary region replica; artefacts pre-replicated |
| Approval workflow system down | N/A | 2 hours | Emergency approval via documented out-of-band process |
10.5 Capacity Planning
Model artefact bundles range from 500 MB (small fine-tuned models) to 150 GB (large foundation model fine-tunes). Storage growth is predictable: estimate 3–5 new versions per model per quarter. Retention policy: active versions indefinitely; deprecated versions 7 years cold storage. Plan for 2–5 TB/year of new artefact storage for a 10-model portfolio.
11. Cost Considerations
11.1 Cost Drivers
| Driver | Description | Relative Impact |
|---|---|---|
| Artefact storage | Object storage for versioned bundles; cold tier for deprecated versions | Medium |
| Model Register infrastructure | Database + API hosting for registry service | Low |
| Evaluation compute | GPU/CPU time to run benchmark suite on each new version | Medium-High |
| Engineering time | Model card authoring, approval coordination, pipeline maintenance | High |
| Governance tooling | Workflow system licences or custom development | Low-Medium |
11.2 Scaling Risks
Evaluation compute is the primary scaling risk. As model size grows, full benchmark evaluation can take hours and require expensive GPU instances. Organisations must plan evaluation infrastructure as part of training budget, not as an afterthought.
11.3 Optimisations
- Use spot/preemptible instances for evaluation runs (evaluation is restartable).
- Tier evaluation depth by risk: PATCH versions run a reduced evaluation suite (subset of benchmarks); MAJOR versions run the full suite including adversarial tests.
- Implement incremental evaluation: cache results for unchanged benchmark subsets.
- Cold-tier deprecated artefacts immediately on sunset (typically 80% cost reduction vs hot storage).
11.4 Indicative Cost Range
| Organisation Scale | Monthly Cost Range | Key Assumptions |
|---|---|---|
| Small (1–5 models, <10 versions/month) | $500–$2,000 | 2 TB storage, modest evaluation compute, SaaS registry |
| Medium (5–20 models, 10–50 versions/month) | $2,000–$10,000 | 10 TB storage, dedicated eval GPU nodes, custom registry |
| Large (20+ models, 50+ versions/month) | $10,000–$50,000 | 50+ TB storage, continuous eval infrastructure, enterprise tooling |
12. Trade-Off Analysis
12.1 Versioning Schema Options
| Option | Pros | Cons | Best For |
|---|---|---|---|
| Semantic versioning (this pattern) | Communicates semantic meaning; widely understood; compatible with existing tooling | Requires team discipline to classify changes correctly | Organisations with model governance maturity |
| Date-based versioning (YYYY.MM.DD) | Simple; automatically sortable; no classification burden | Carries no semantic meaning; cannot infer compatibility | Rapid experimentation environments |
| Hash-only (content-addressable) | Guarantees uniqueness; no classification bias | Completely opaque to humans; poor for communication | Internal pipeline references only |
| Monotonic integer (v1, v2…) | Simple; no ambiguity about ordering | No semantic meaning; requires changelog for all changes | Small teams, low model count |
12.2 Architectural Tensions
| Tension | Description | Resolution |
|---|---|---|
| Rigour vs Speed | Approval workflows slow deployment velocity; teams under pressure skip steps | Tiered approval: PATCH = 1 approver, 24h SLA; MAJOR = board-level, 5-day SLA |
| Completeness vs Storage Cost | Storing full model bundles for all versions is expensive | Tiered storage: recent versions hot, deprecated cold; incremental weight storage where feasible |
| Standardisation vs Flexibility | Different model types (classical ML, LLMs, vision) have different artefact structures | Core bundle schema is extensible; type-specific fields added as extensions |
13. Failure Modes
| Failure | Likelihood | Impact | Detection | Recovery |
|---|---|---|---|---|
| Unapproved version deployed | Low | Critical | Model Register audit log alert | Immediate traffic rollback; incident declared; post-mortem |
| Artefact bundle corruption | Very Low | High | Hash verification on deploy | Redeploy prior known-good version; re-upload from backup |
| Evaluation harness misconfiguration | Medium | High | Metrics drift alert; human review | Quarantine affected versions; re-evaluate with corrected harness |
| Model Register unavailability | Low | Medium | Health check monitor | Emergency access to last-known artefact reference; restore from backup |
| Version classification error (wrong MAJOR/MINOR/PATCH) | Medium | Medium | Consumer integration test failure | Increment version correctly; re-issue; notify consumers |
| Model card not updated for version | Medium | Medium | Governance review checklist gate | Block promotion until model card complete and approved |
13.1 Cascading Failure Scenarios
If the Model Register becomes unavailable, deployment pipelines lose the ability to verify approved status of model versions. Without a circuit breaker, pipelines may fall back to deploying unverified versions. Mitigation: deployment pipelines cache the last-known approved version manifest locally with a 24-hour TTL; any deploy during register outage uses the cached manifest and triggers a P2 incident for human review.
14. Regulatory Considerations
| Regulation / Framework | Relevant Clause / Requirement | How This Pattern Addresses It |
|---|---|---|
| EU AI Act (2024/1689) | Article 9 (Risk Management System), Article 13 (Transparency), Article 17 (Quality Management) | Versioned model cards provide Article 13 documentation; approval workflow is Article 9 risk control; artefact bundle supports Article 17 quality records |
| ISO 42001:2023 | Clause 8.4 (AI system lifecycle), Clause 9.1 (Performance evaluation) | Version schema maps to lifecycle stages; evaluation results in bundle satisfy Clause 9.1 |
| NIST AI RMF (2023) | GOVERN 1.1, MANAGE 1.1, MEASURE 2.5 | Governance artefacts satisfy GOVERN; version history enables MANAGE rollback; eval results satisfy MEASURE |
| APRA CPS 234 (2019) | Paragraph 15 (Information security policy), Paragraph 36 (Notification) | Artefact integrity controls and audit log satisfy Paragraph 15; incident notification process covers Paragraph 36 |
| APRA CPS 230 (2025) | Paragraph 52 (Change management) | Approval workflow and versioning schema constitute a formal change management process for AI models |
| Privacy Act 1988 (Cth) | APP 11 (Security of personal information) | Provenance manifest records that training data has undergone PII removal; supports APP 11 accountability |
15. Reference Implementations
15.1 AWS
- Artefact Store: S3 with Object Lock (COMPLIANCE mode) + S3 Intelligent-Tiering for cost.
- Model Register: Amazon SageMaker Model Registry (native versioning + approval workflow).
- Evaluation: SageMaker Processing Jobs for benchmark runs on managed GPU instances.
- Approval Workflow: SageMaker Model Registry approval + SNS notifications to approvers.
- Audit Log: CloudTrail + S3 (immutable log bucket).
15.2 Azure
- Artefact Store: Azure Blob Storage with Immutability Policies + Lifecycle Management.
- Model Register: Azure Machine Learning Model Registry.
- Evaluation: Azure ML Pipelines on GPU compute clusters.
- Approval Workflow: Azure ML + Power Automate for approval routing.
- Audit Log: Azure Monitor + Log Analytics Workspace.
15.3 GCP
- Artefact Store: Cloud Storage with Object Retention + Nearline/Coldline tiers.
- Model Register: Vertex AI Model Registry.
- Evaluation: Vertex AI Pipelines on A2/T4 instances.
- Approval Workflow: Vertex AI Model Registry approval states + Cloud Functions for routing.
- Audit Log: Cloud Audit Logs + BigQuery export.
15.4 On-Premises / Hybrid
- Artefact Store: MinIO (S3-compatible) with WORM buckets or NetApp StorageGRID.
- Model Register: MLflow Tracking Server (self-hosted, PostgreSQL backend).
- Evaluation: Kubernetes Jobs on GPU node pools.
- Approval Workflow: Jira Service Management or custom workflow microservice.
- Audit Log: Elasticsearch + Kibana or Splunk.
16. Related Patterns
| Pattern ID | Pattern Name | Relationship Type | Description |
|---|---|---|---|
| EAAPL-MDL002 | Shadow Model Deployment | Depends On | Shadow deployment consumes versioned model bundles and records comparison against a specific version ID |
| EAAPL-MDL003 | Canary Model Release | Depends On | Canary release requires semantic version to determine promotion/rollback decisions |
| EAAPL-MDL004 | Model Rollback | Depends On | Rollback procedure targets a specific previous version by version ID |
| EAAPL-MDL006 | Fine-Tuning Pipeline | Produces | Fine-tuning pipeline is the primary producer of new MINOR version increments |
| EAAPL-MDL007 | Model Compression and Optimisation | Produces | Compression/optimisation produces new PATCH version increments |
| EAAPL-GOV001 | Model Register | Foundational | This pattern registers into and reads from the Model Register |
17. Maturity Assessment
Overall Maturity: Mature
| Dimension | Score (1–5) | Rationale |
|---|---|---|
| Industry Adoption | 5 | Semantic versioning for ML models is well-established industry practice |
| Tooling Availability | 5 | Native support in SageMaker, Azure ML, Vertex AI, MLflow |
| Standards Alignment | 4 | Aligns with ISO 42001 and NIST AI RMF; model-card standards still evolving |
| Implementation Complexity | 1 (low) | Low-complexity process pattern; tooling is mature and well-documented |
| Regulatory Acceptance | 4 | Accepted by APRA, EU AI Act compliance approaches; specific clauses still being interpreted |
18. Revision History
| Version | Date | Author | Summary of Changes |
|---|---|---|---|
| 1.0 | 2026-06-12 | Enterprise AI Architecture Practice | Initial publication |