[EAAPL-DAT004] Synthetic Data Generation
Category: Data Architecture
Sub-category: Synthetic Data / Privacy-Preserving AI
Version: 1.2
Maturity: Proven
Tags: synthetic-data, GAN, VAE, differential-privacy, k-anonymity, privacy-validation, utility-validation
Regulatory Relevance: GDPR Article 5, Privacy Act Australia APP 3/11, EU AI Act Article 10, APRA CPS 234, ISO 42001 §8.4
1. Executive Summary
Many of the most valuable enterprise AI use cases — fraud detection, clinical risk prediction, credit underwriting — require training on data that organisations cannot freely share: patient records, account transactions, personally identifiable information. Synthetic data generation creates statistically faithful, privacy-preserving datasets that can be used for AI training, testing, and sharing without exposing real individuals.
This pattern defines a production synthetic data generation pipeline covering three generation techniques (GAN/VAE/statistical), privacy validation (differential privacy, k-anonymity, membership inference testing), utility validation (statistical fidelity, downstream model performance parity), and the regulatory framework for accepting synthetic data in lieu of real data for AI training and testing.
Organisations that implement this pattern have unlocked AI use cases previously blocked by privacy constraints, accelerated model development cycle times by 40–60% through unrestricted test data availability, and reduced privacy incident risk in development and testing environments.
Target audience: Chief Privacy Officers, Chief Data Officers, ML Platform leads, Data Science leads.
2. Problem Statement
Business Problem
AI programmes are blocked or slowed by inability to use real data outside production environments. Development and test environments cannot receive production data due to privacy regulation; third-party data science partners cannot access customer data; cross-border data transfer restrictions prevent AI development in global teams.
Technical Problem
- Real patient/customer/financial records cannot legally be used in development, test, or partner-facing environments.
- Data anonymisation (masking, tokenisation) degrades statistical relationships, destroying the signal AI models need.
- Small datasets for rare events (fraud, rare diseases) are insufficient for model training.
- Class imbalance in real data requires augmentation techniques that preserve statistical properties.
- Testing AI edge cases with real data creates production data exposure in test environments.
Symptoms
- AI development cycle stalled waiting for "data access approval" that may never come.
- Test environments using obviously fake data that does not reflect real statistical patterns, causing test AI models to fail in production.
- Third-party data science partners blocked from receiving any data.
- Rare event classes (fraud, rare diseases) under-represented in training data, degrading model recall.
- Privacy incidents caused by real production data in development environments.
Cost of Inaction
| Dimension |
Impact |
| Velocity |
AI use cases delayed 6–18 months waiting for data access approval |
| Privacy risk |
Real PII in dev/test environments creates regulatory exposure and breach risk |
| Model quality |
Artificially balanced or anonymised datasets produce worse models |
| Competitive |
Data-rich competitors accelerate AI while your organisation waits for approvals |
3. Context
When to Apply
- AI training or testing requires data that cannot be shared due to privacy, regulatory, or contractual restrictions.
- Small training datasets need augmentation (class imbalance; rare events).
- Development/test environments need representative data without PII.
- Third-party partners (model vendors, data scientists) need data to work with.
- Cross-border transfer restrictions prevent sharing real data across jurisdictions.
When NOT to Apply
- Real data is freely available and shareable (no privacy constraint) — synthetic data adds cost and validation overhead with no benefit.
- The AI use case requires exact real-world records (e.g., training on specific known fraud patterns that must be preserved exactly).
- Synthetic data utility validation cannot be performed (no access to real data even for validation).
- The risk of generated data containing memorised real records is unacceptable (very small original datasets).
Prerequisites
| Prerequisite |
Minimum Viable |
Preferred |
| Source data access |
Sample of real data for training generator |
Full production dataset with proper access controls |
| Generation tooling |
SDV (Synthetic Data Vault), Faker + statistical |
Dedicated GAN/VAE pipeline; enterprise synthetic data platform |
| Privacy validation |
k-anonymity check |
Differential privacy budget tracking + membership inference testing |
| Utility validation |
Basic statistical comparison |
Downstream model performance parity testing |
| Legal sign-off |
Privacy team review |
External privacy counsel opinion for regulatory-sensitive use cases |
Industry Applicability
| Industry |
Applicability |
Driver |
| Healthcare |
Critical |
Patient privacy; clinical AI training data scarcity |
| Financial Services |
High |
PCI DSS; APRA; customer data privacy; fraud model training |
| Insurance |
High |
Actuarial data privacy; claims data restrictions |
| Government |
High |
Privacy Act; sensitive citizen data |
| Retail |
Medium |
Customer purchase history; personalisation model testing |
| Telecommunications |
Medium |
Call records; network data; churn model development |
4. Architecture Overview
Design Philosophy
Synthetic data generation is not a single technique — it is a pipeline with four stages: generation, privacy validation, utility validation, and certified publication. Skipping any stage creates either privacy risk (insufficiently private synthetic data) or utility failure (synthetic data that does not produce models with parity performance to real data).
Generation Techniques. The pattern supports three generation approaches, selected based on data type and privacy requirements:
Statistical/parametric synthesis uses marginal and joint distributions estimated from real data (SDV's GaussianCopula, CTGAN-light). It is computationally cheap and interpretable but may not capture complex non-linear feature dependencies. Best for tabular data with moderate complexity.
Variational Autoencoders (VAE) learn a continuous latent space representation of the data and sample from it. VAEs are effective for tabular data with complex correlations and naturally support conditional generation (generate samples with a specific class label). They are faster to train than GANs and produce more stable outputs.
Generative Adversarial Networks (GAN) — specifically CTGAN and TabularGAN — train a generator/discriminator pair to produce synthetic records indistinguishable from real. GANs produce the highest-fidelity synthetic data but are prone to mode collapse (under-representing some data regions) and are computationally expensive. Best for high-fidelity requirements where computational budget is available.
Privacy Validation — Three Layers. No single privacy metric is sufficient:
- k-anonymity and l-diversity check that no synthetic record is unique to an identifiable individual in the original dataset. These are necessary but not sufficient.
- Differential privacy (DP) provides mathematical privacy guarantees: the synthetic dataset's statistical properties would be the same regardless of whether any individual record was included in the training data. DP is applied as a privacy mechanism during GAN/VAE training (DP-SGD), adding calibrated noise to gradient updates. The privacy budget (ε) is tracked and reported; typical production thresholds are ε ≤ 1 (high privacy) to ε ≤ 10 (moderate privacy).
- Membership inference attack testing trains an adversarial classifier to determine whether a specific real record was in the training data. If the attack accuracy is near 50% (random chance), the synthetic data provides strong privacy. If attack accuracy is significantly above 50%, the synthetic data leaks real record membership.
Utility Validation. Privacy and utility trade off: more privacy noise reduces data fidelity. Utility validation measures this trade-off across three dimensions:
- Statistical fidelity: Compare marginal distributions (KS test), pairwise correlations, and higher-order statistics between real and synthetic datasets.
- Train-on-Synthetic, Test-on-Real (TSTR): Train an AI model on synthetic data; evaluate on real data. Compare AUC/F1 against a model trained on real data. A TSTR performance ratio ≥ 0.90 indicates high utility.
- Train-on-Real, Test-on-Synthetic (TRTS): Train on real, test on synthetic — validates that synthetic data represents the same distribution as real.
A synthetic dataset is certified for AI training only when both privacy validation and utility validation pass their thresholds.
5. Architecture Diagram
flowchart TD
subgraph Input["Source Data"]
A[Real Restricted Dataset]
B{Generator Selection}
end
subgraph Validation["Privacy and Utility Validation"]
C[Privacy Validation]
D{Privacy Gate}
E[Utility Validation]
F{Utility Gate}
end
subgraph Output["Certified Publication"]
G[(Synthetic Data Catalogue)]
H[Approved Consumers]
end
A --> B
B --> C
C --> D
D -->|fail| B
D -->|pass| E
E --> F
F -->|fail| B
F -->|pass| G
G --> H
style A fill:#dbeafe,stroke:#3b82f6
style B fill:#f3e8ff,stroke:#a855f7
style C fill:#f0fdf4,stroke:#22c55e
style D fill:#f3e8ff,stroke:#a855f7
style E fill:#f0fdf4,stroke:#22c55e
style F fill:#f3e8ff,stroke:#a855f7
style G fill:#fef9c3,stroke:#eab308
style H fill:#d1fae5,stroke:#10b981
6. Components
| Component |
Type |
Responsibility |
Technology Options |
Criticality |
| Data Profiler |
Processing |
Analyses source data distributions, correlations, and data types to configure generator |
YData Profiling, Pandas Profiling, custom |
High |
| Statistical Generator |
ML Model |
Parametric synthesis using marginal + joint distributions |
SDV (GaussianCopula, CopulaGAN), Faker |
Medium |
| VAE Generator |
ML Model |
Latent space sampling for complex tabular data |
Custom PyTorch VAE, SDV TVAE |
High |
| GAN Generator (with DP-SGD) |
ML Model |
High-fidelity synthesis with differential privacy training |
CTGAN + Opacus DP-SGD, Gretel.ai, YData |
High |
| k-Anonymity / l-Diversity Checker |
Processing |
Tests that no synthetic record is uniquely re-identifiable |
Custom Python + ARX library |
Critical |
| Differential Privacy Budget Tracker |
Processing |
Accounts for total privacy cost (ε); validates DP-SGD parameters |
Opacus, Google DP library, custom tracker |
Critical |
| Membership Inference Attack Tester |
Processing |
Adversarial attack simulation to test privacy leakage |
Adversarial Robustness Toolbox (ART), custom |
High |
| Statistical Fidelity Validator |
Processing |
KS test, chi-squared, correlation comparison between real and synthetic |
scipy, custom Python, SDV metrics |
High |
| TSTR / TRTS Validator |
Processing |
Downstream model performance comparison |
Custom ML evaluation harness |
Critical |
| Synthetic Data Certificate |
Artefact |
Machine-readable certificate with privacy + utility scores, usage policy |
JSON schema, stored in data catalogue |
High |
| Synthetic Data Catalogue |
Storage + Discovery |
Governs synthetic dataset publication, access control, expiry |
DataHub, Atlan, custom catalogue |
High |
7. Data Flow
Primary Flow
| Step |
Actor |
Action |
Output |
| 1 |
Data Profiler |
Analyses real dataset; extracts statistical profile |
Data profile (distributions, correlations, data types) |
| 2 |
Generator Selection |
Evaluates data profile + privacy requirements → selects generation approach |
Generator configuration |
| 3 |
Synthetic Generator |
Trains on real data (with DP-SGD if required); generates synthetic dataset |
Raw synthetic dataset |
| 4 |
k-Anonymity Checker |
Tests uniqueness of synthetic records against real dataset |
k-anonymity score; l-diversity score |
| 5 |
DP Budget Tracker |
Verifies DP-SGD parameters; computes cumulative ε budget |
Privacy budget report (ε value) |
| 6 |
Membership Inference Tester |
Trains attack classifier; measures attack accuracy |
Attack accuracy score (target: ≤55%) |
| 7 |
Privacy Gate |
Evaluates all three privacy checks; passes or rejects |
Pass/fail + privacy validation report |
| 8 |
Statistical Fidelity Validator |
Compares distributions, correlations between real and synthetic |
Statistical fidelity scores per feature |
| 9 |
TSTR / TRTS Validator |
Trains models on synthetic/real; evaluates cross-performance |
TSTR ratio; TRTS ratio |
| 10 |
Utility Gate |
Evaluates utility metrics; passes or triggers regeneration |
Pass/fail + utility validation report |
| 11 |
Certification |
Generates Synthetic Dataset Certificate; publishes to catalogue |
Certified synthetic dataset with usage policy |
| 12 |
Approved Consumer |
Accesses synthetic dataset via catalogue; uses for AI training/testing |
AI model trained on synthetic data |
Error Flow
| Error Condition |
Trigger |
Response |
Recovery |
| Privacy gate failure (membership inference >55%) |
Attack accuracy too high |
Synthetic dataset rejected; regenerate with higher DP noise (lower ε) |
Increase DP-SGD noise multiplier; regenerate; re-run privacy validation |
| Utility gate failure (TSTR ratio <0.90) |
Synthetic data too noisy for useful AI training |
Synthetic dataset rejected; privacy-utility trade-off re-evaluated |
Increase training epochs; adjust ε budget; consider less strict privacy target |
| Mode collapse (GAN produces limited variety) |
Generator produces repetitive records |
GAN training failure detected by diversity metric |
Switch to VAE generator; adjust GAN hyperparameters |
| Source data access revoked before validation |
Real data access removed mid-pipeline |
Pipeline paused; cannot complete utility validation |
Resume with new data access grant; or use previously validated synthetic version |
8. Security Considerations
Authentication & Authorisation
- Real source data access for generator training is highly restricted; access logged and time-limited.
- Synthetic dataset access controlled by usage policy in catalogue; different tiers for internal/partner/public.
Secrets Management
- Source data credentials for generator training stored in secrets manager; not retained beyond training session.
- Generator model artefacts access-controlled; a trained generator can be used to generate more synthetic data and must be treated as sensitive.
Data Classification
- Generator model (trained on real data) classified at least as Confidential — it encodes statistical properties of real data.
- Certified synthetic datasets classified per usage policy; may be Internal or Shareable depending on privacy validation.
Encryption
- Source data encrypted at rest during generator training; access keys in KMS.
- Synthetic datasets encrypted at rest; encryption may be relaxed for low-sensitivity certified datasets per policy.
Auditability
- All access to source data for generation logged.
- Synthetic dataset access logged per usage policy.
- Privacy and utility validation results stored immutably with dataset version.
OWASP LLM Top 10 Mapping
| OWASP LLM Risk |
Relevance |
Mitigation |
| LLM06: Sensitive Information Disclosure |
Generator memorises and reproduces real records |
Membership inference attack testing; DP-SGD prevents memorisation |
| LLM03: Training Data Poisoning |
Synthetic data with adversarial patterns used to poison AI model |
Statistical fidelity validation; TSTR validation catches adversarial deviations |
| LLM04: Model Denial of Service |
Generator attacked to produce malformed synthetic data |
Input validation on generation requests; rate limiting |
9. Governance Considerations
Responsible AI
- Synthetic data must preserve demographic representation; if real data is biased, synthetic data may amplify bias.
- Bias audit required as part of utility validation: compare demographic distributions in real vs. synthetic.
Model Risk Management
- Models trained on synthetic data must be validated on real data before production deployment.
- TSTR ratio ≥ 0.90 is minimum bar; risk committee may require higher threshold for high-risk AI.
Human Approval Checkpoints
- Privacy Officer must approve Synthetic Dataset Certificate before publication to external partners.
- Legal counsel review required for cross-border synthetic data sharing.
- Risk committee approval required for synthetic data used in high-risk AI (EU AI Act Annex III).
Governance Artefacts
| Artefact |
Owner |
Cadence |
Purpose |
| Synthetic Dataset Certificate |
Privacy / ML Platform |
Per generation run |
Privacy + utility scores; ε budget; usage policy; expiry date |
| Privacy Validation Report |
Privacy Team |
Per generation run |
k-anonymity, DP budget, membership inference test results |
| Utility Validation Report |
ML Platform |
Per generation run |
Statistical fidelity; TSTR/TRTS ratios |
| Usage Policy Record |
Privacy Officer |
Per publication |
Permitted use cases; sharing permissions; expiry; approved consumers |
| Generator Model Audit Log |
ML Platform |
Continuous |
Who trained/used which generator; source data access log |
10. Operational Considerations
Monitoring
| Metric |
Alert Threshold |
Tooling |
| Membership inference attack accuracy |
>55% |
Validation pipeline output |
| TSTR ratio |
<0.90 |
Validation pipeline output |
| DP budget cumulative ε |
>configured threshold |
Budget tracker |
| Synthetic dataset expiry |
30 days before expiry |
Catalogue alert |
| Generator training compute cost |
>budget threshold |
Cloud cost alert |
SLOs
| SLO |
Target |
Measurement |
| Synthetic dataset generation + validation |
<24 hours end-to-end |
Pipeline execution time |
| Synthetic dataset catalogue availability |
99.9% |
Availability monitor |
| Privacy validation completion |
<4 hours |
Validation pipeline time |
Logging
- All generation runs logged with source dataset version, generator type, privacy parameters, validation results.
- Retained 7 years for regulatory compliance.
Incident Management
- Privacy gate failure with external partner data → P1; Privacy Officer notified immediately.
- Unexpected source data access to generate synthetic data → P1 security incident.
Disaster Recovery
| Component |
RTO |
RPO |
Strategy |
| Synthetic Data Catalogue |
4 hours |
24 hours |
Database backup; synthetic datasets re-generatable |
| Generator Model Artefacts |
8 hours |
24 hours |
Artefact store backup; can retrain if lost |
| Validation Pipeline |
2 hours |
N/A |
Stateless; redeploy from IaC |
11. Cost Considerations
Cost Drivers
| Cost Driver |
Typical Range |
Notes |
| GAN/VAE training compute |
$100–$5,000 per run |
GPU compute; scales with dataset size; amortised across many generation runs |
| Privacy validation compute |
$50–$500 per run |
Membership inference attack training |
| Synthetic data storage |
$10–$200/month |
Modest; synthetic datasets typically smaller than real |
| Enterprise platform licence |
$2,000–$20,000/month |
Gretel.ai, Mostly AI, YData enterprise |
| Legal / privacy review |
$5,000–$20,000 per use case |
One-time for new use case type; ongoing for regulatory changes |
Optimisations
- Use open-source SDV or CTGAN for initial synthetic data; move to enterprise platform only when scale demands.
- Cache trained generators; regenerate synthetic data without retraining if source distribution unchanged.
- Run membership inference testing on a sample rather than full synthetic dataset.
Indicative Cost Range
| Scale |
Monthly Cost |
Basis |
| Small (1–3 use cases, monthly generation) |
$500–$3,000 |
SDV OSS + custom validation + light storage |
| Medium (5–10 use cases, weekly generation) |
$3,000–$15,000 |
CTGAN + validation pipeline + Gretel.ai OSS |
| Large (20+ use cases, daily generation, external sharing) |
$15,000–$60,000 |
Enterprise platform + legal + comprehensive validation |
12. Trade-Off Analysis
Option Comparison
| Option |
Pros |
Cons |
Recommended When |
| A: Full privacy-validated synthetic data pipeline (this pattern) |
Mathematically sound privacy; high utility; regulatory-acceptable |
High setup cost; DP reduces data utility; requires real data for generator training |
Regulated industry; external data sharing; high-risk AI training |
| B: Statistical anonymisation (masking/tokenisation) |
Simple; no generator training needed |
Destroys statistical relationships; models trained on anonymised data perform poorly |
Low-complexity AI; non-statistical test data |
| C: Rule-based test data generation (Faker) |
Zero privacy risk; instant |
No statistical fidelity; useless for ML model training |
Functional software testing only; not ML |
| D: Commercial synthetic data platform (Mostly AI, Gretel) |
Best-in-class fidelity and privacy; legal opinion packages |
High cost; vendor dependency |
Enterprise at scale; legal opinion needed; limited internal ML capacity |
Architectural Tensions
| Tension |
Trade-Off |
Resolution |
| Privacy (low ε) vs. Utility (high TSTR ratio) |
More DP noise → better privacy → worse utility |
Tune ε per use case risk level; accept lower TSTR ratio for high-risk cases |
| Generation fidelity vs. training speed |
GANs produce best synthetic data but are slow and unstable |
Use VAE by default; GAN only when TSTR ratio requirement is very high |
| Internal generation vs. external platform |
Internal = control and cost; external = better fidelity and legal opinion |
Use internal for mature use cases; external for new/sensitive use cases |
13. Failure Modes
| Failure |
Likelihood |
Impact |
Detection |
Recovery |
| Generator memorises outlier real records |
Medium |
High — privacy breach |
Membership inference test |
Retrain with higher DP noise; or exclude outliers from training |
| GAN mode collapse — synthetic data under-represents minority class |
High |
Medium — model trained on synthetic data misses minority class |
Statistical fidelity check on class distribution |
Switch to conditional VAE; oversample minority class in real data before generation |
| Synthetic data used beyond approved use case |
Medium |
High — privacy and legal violation |
Usage policy enforcement in catalogue |
Usage policy automated enforcement; access revocation on violation |
| Utility degrades after real data distribution shift |
Medium |
Medium — TSTR ratio drops; old synthetic data used for new models |
Periodic re-validation of existing synthetic datasets |
Trigger regeneration on source distribution drift detection |
14. Regulatory Considerations
| Regulation |
Requirement |
Pattern Response |
| GDPR Article 5(1)(b) |
Purpose limitation — data used only for specified purposes |
Synthetic dataset usage policy enforces purpose limitation |
| GDPR Recital 26 |
Synthetic data that re-identifies individuals not anonymous |
Membership inference testing + k-anonymity validate true anonymisation |
| Privacy Act (Australia) APP 3 |
Collection of personal information limitation |
Synthetic data reduces real data collection requirements in AI development |
| Privacy Act (Australia) APP 11 |
Security of personal information |
DP-SGD prevents memorisation; generator model access controlled |
| EU AI Act Article 10(3) |
Examine data for biases |
Bias distribution comparison in utility validation |
| EU AI Act Article 10(5) |
Sensitive attribute processing for bias detection/correction |
Utility validation includes demographic distribution comparison |
| APRA CPS 234 |
Data integrity |
Privacy + utility validation certificates provide attestation of synthetic data integrity |
| ISO 42001 §8.4 |
Data governance for AI |
Synthetic Dataset Certificate is a documented governance artefact |
15. Reference Implementations
AWS
| Component |
AWS Service |
| Generator training compute |
SageMaker Training Jobs (GPU) |
| Generator type |
CTGAN on SageMaker + Opacus DP-SGD |
| Privacy validation |
SageMaker Processing Jobs |
| Synthetic data storage |
S3 |
| Catalogue |
AWS Glue Data Catalog + custom certificate store in DynamoDB |
Azure
| Component |
Azure Service |
| Generator training |
Azure ML Compute (GPU) |
| DP framework |
Opacus or SmartNoise on Azure ML |
| Privacy / utility validation |
Azure ML Pipelines |
| Synthetic data storage |
ADLS Gen2 |
| Catalogue |
Azure Purview |
GCP
| Component |
GCP Service |
| Generator training |
Vertex AI Custom Training (GPU) |
| DP framework |
Google DP library + Opacus |
| Validation |
Vertex AI Pipelines |
| Storage |
GCS |
| Catalogue |
Google Dataplex |
On-Premises
| Component |
Technology |
| Generator |
CTGAN + Opacus on GPU Kubernetes |
| Validation |
Custom Python pipeline on Kubernetes |
| Storage |
MinIO |
| Catalogue |
OpenMetadata or DataHub |
| Pattern |
ID |
Relationship |
Notes |
| Privacy by Design for AI Data |
EAAPL-DAT005 |
Complements |
Synthetic data is a privacy-by-design technique |
| AI Training Data Governance |
EAAPL-DAT007 |
Depends on |
Synthetic datasets must be governed in training data registry |
| Data Quality for AI |
EAAPL-DAT002 |
Complements |
Utility validation aligns with quality dimension of training data |
| Active Learning Loop |
EAAPL-HIL002 |
Complements |
Synthetic data augments rare-class samples for annotation |
| Fine-Tuning Pipeline |
EAAPL-MDL006 |
Enables |
Synthetic data enables fine-tuning where real data is restricted |
17. Maturity Assessment
Overall Maturity: Proven — Core synthetic data generation techniques (CTGAN, VAE, SDV) are mature and production-proven. Differential privacy integration (Opacus) is mature. Regulatory acceptance of DP-validated synthetic data is growing but jurisdiction-specific.
| Dimension |
Score (1–5) |
Notes |
| Architectural clarity |
4 |
Generation pipeline well-defined; DP parameter tuning remains specialist skill |
| Tooling maturity |
4 |
CTGAN/VAE/SDV mature; enterprise platforms (Mostly AI) mature |
| Regulatory alignment |
4 |
Strong GDPR alignment; EU AI Act acceptance emerging |
| Operational complexity |
3 |
DP parameter tuning requires expertise; GAN training unstable |
| Cost efficiency |
4 |
OSS stack cost-effective; amortised across many use cases |
| Security |
4 |
DP-SGD prevents memorisation; generator access controls required |
18. Revision History
| Version |
Date |
Author |
Changes |
| 1.0 |
2023-10-01 |
EAAPL Working Group |
Initial publication |
| 1.1 |
2024-04-15 |
EAAPL Working Group |
Added DP-SGD framework; membership inference testing detail |
| 1.2 |
2025-03-01 |
EAAPL Working Group |
Added EU AI Act Article 10(5) alignment; updated enterprise platform options |