Adversarial Input Defence
[EAAPL-SEC010] Adversarial Input Defence
Category: Security / Model Robustness
Sub-category: Adversarial ML Defence
Version: 1.0
Maturity: Emerging
Tags: adversarial-ml model-robustness anomaly-detection certified-defence ensemble-verification adversarial-examples
Regulatory Relevance: EU AI Act Art. 15 (Robustness), NIST AI RMF MANAGE 1.3, APRA CPS234 §21, ISO 42001 §8.4
1. Executive Summary
Adversarial Input Defence addresses a class of attacks specifically designed to manipulate AI model behaviour through carefully crafted inputs — not through natural language manipulation (prompt injection, addressed by EAAPL-SEC002) but through mathematical perturbations and statistical exploitation of model decision boundaries. These attacks can cause image classifiers to misidentify objects, speech recognition systems to transcribe incorrect commands, and ML-based fraud detection systems to approve fraudulent transactions.
While adversarial examples have been studied primarily in computer vision and NLP classification contexts, they represent a growing concern for any AI system whose outputs influence high-stakes decisions. A fraud detection model, a document classification system, or an AI-powered identity verification system can all be targeted with adversarial inputs crafted to produce specific incorrect outputs.
This pattern defines the defence architecture: anomaly detection on inputs (detecting statistically unusual inputs that may be adversarial), certified defences (mathematical guarantees of model prediction within a perturbation radius), ensemble verification (using multiple models to detect inconsistency that indicates adversarial manipulation), model robustness testing (continuous adversarial red-teaming), and human escalation for high-uncertainty decisions. The maturity rating is Emerging — the field is active with strong academic foundations but limited production deployment at scale.
2. Problem Statement
Business Problem
Organisations deploying AI in high-stakes decision-making contexts (fraud detection, identity verification, content moderation, medical imaging, access control) face a targeted attack vector: adversaries who understand the AI system can craft inputs specifically designed to obtain a desired incorrect output. Unlike random errors, adversarial examples are not caught by accuracy metrics — the model performs well on normal data but can be reliably fooled on specially crafted inputs.
Documented adversarial attack scenarios in production AI:
- Adversarial patches on physical objects causing object detectors to misclassify items.
- Adversarial perturbations in audio causing speech systems to recognise hidden commands.
- Adversarial text modifications causing spam or phishing filters to classify malicious content as benign.
- Model inversion attacks recovering training data (including personal data) from model outputs.
Technical Problem
Neural networks learn complex non-linear decision boundaries. These boundaries have a property that makes them vulnerable: small, carefully calculated perturbations to an input can move it across a decision boundary, changing the model's prediction dramatically while the input remains perceptually identical to a human. The perturbation is not random noise — it is calculated to maximally exploit the model's specific decision boundary.
Defence is mathematically challenging because:
- The adversary has (or can approximate) access to the model's decision function.
- Certified defences that provide mathematical guarantees typically come with accuracy/performance tradeoffs.
- Adversarial training (training on adversarial examples) improves robustness but does not provide guarantees.
- Transfer attacks allow adversarial examples crafted against one model to fool related models.
Symptoms
- AI system producing unexpected outputs on edge-case inputs.
- Model performance significantly better on clean test data than production data.
- Identical content classified differently based on subtle metadata variations.
- Suspiciously correlated false negatives in fraud detection or content moderation.
Cost of Inaction
| Dimension | Impact |
|---|---|
| Financial | Adversarial fraud detection bypass enabling financial crime |
| Security | Adversarial identity verification bypass enabling unauthorised access |
| Regulatory | EU AI Act Art. 15 requires robustness for high-risk AI systems — ungoverned adversarial risk is a compliance gap |
| Safety | Medical AI systems producing incorrect outputs on adversarially crafted medical images |
| Reputational | Demonstrated adversarial bypass of AI security control causes significant trust erosion |
3. Context
When to Apply
- AI systems making high-stakes decisions in adversarial environments (fraud detection, access control, content moderation).
- AI systems where adversaries have financial or strategic incentive to obtain specific incorrect outputs.
- Image/audio classification systems in physical security contexts.
- AI-based authentication or identity verification systems.
- Regulated high-risk AI systems where EU AI Act Art. 15 robustness requirements apply.
When NOT to Apply
- AI systems in low-stakes, non-adversarial environments where adversarial input construction is not a plausible threat model.
- Generative LLM text completion where adversarial example attacks are less directly applicable (prompt injection, addressed by SEC002, is the more relevant vector).
- Proof-of-concept systems not processing real adversarial inputs.
Prerequisites
| Prerequisite | Detail |
|---|---|
| Threat Model | Documented threat model identifying adversarial input construction as a plausible attack vector |
| Model Architecture Access | Ability to run inference through multiple models for ensemble verification |
| Input Distribution Understanding | Baseline statistical distribution of legitimate inputs for anomaly detection |
| Red Team Capability | Ability to run adversarial robustness tests against production models |
Industry Applicability
| Industry | Applicability | Key Driver |
|---|---|---|
| Financial Services (Fraud Detection) | High | Adversarial fraud bypass; significant financial incentive for attackers |
| Physical Security / Access Control | High | Adversarial image/facial recognition bypass |
| Healthcare (Medical Imaging AI) | High | Safety-critical; EU AI Act high-risk classification |
| Content Moderation | High | Evading moderation filters to distribute harmful content |
| Autonomous Systems | Critical | Physical-world adversarial attacks (traffic sign manipulation) |
| Government / National Security | Critical | Adversarial attacks by sophisticated threat actors |
4. Architecture Overview
Adversarial input defence is implemented as a defence-in-depth architecture where no single control is relied upon exclusively — adversarial defence is inherently probabilistic, and the combination of controls provides significantly better coverage than any individual technique.
Layer 1: Statistical Input Anomaly Detection
The first layer measures the statistical distance of an incoming input from the distribution of legitimate inputs seen during training and operation. Adversarial examples often have detectable statistical properties: unnaturally smooth perturbation patterns, unusual pixel value distributions, or feature representations that cluster near decision boundaries rather than in the interior of class regions. Detection techniques:
- Feature squeezing: Apply smoothing filters to the input; if the model's prediction changes significantly after smoothing, the input may be adversarial (adversarial perturbations are sensitive to smoothing).
- Input distribution monitoring: Compute the Mahalanobis distance from the input to the training distribution in feature space; flag inputs with unusually large distance.
- Boundary proximity detection: Measure how close the input's feature representation is to the model's decision boundary; inputs deliberately close to the boundary (but not crossing it) may be adversarial probes.
Layer 2: Certified Defences
Certified defences provide mathematical guarantees: for inputs within a specified perturbation radius (e.g., L∞ norm ≤ 0.01), the model's prediction is certified to be stable. The state of the practice for certified defences:
- Randomised smoothing: The model is queried on multiple copies of the input with added Gaussian noise; the majority vote provides a certified prediction under L2 norm perturbations. This is the most practical certified defence for large models.
- Lipschitz-constrained networks: Models trained with constraints on their Lipschitz constant provide formal stability guarantees but with accuracy/capacity tradeoffs.
- Certified defences are computationally expensive (randomised smoothing requires N model evaluations per prediction) and are applied selectively to high-risk decision contexts.
Layer 3: Ensemble Verification
An adversarial example crafted to fool one model typically does not fool an independent model with different architecture, training data order, or random initialisation. Ensemble verification queries multiple independent models and compares their outputs:
- High agreement across models → high confidence prediction.
- Low agreement → possible adversarial input or genuine ambiguity; flag for human review or apply certified defence.
- This technique detects both adversarial examples and genuine edge cases where model confidence should be low.
Layer 4: Continuous Adversarial Red-Teaming
The adversarial landscape evolves. Static defences become less effective as attackers adapt to known defence techniques. Continuous red-teaming:
- Automated adversarial example generation (FGSM, PGD, CW attacks) against production models on a weekly schedule.
- Robustness metrics tracked over time: clean accuracy, adversarial accuracy, certified accuracy.
- Regression gates: model deployment blocked if adversarial accuracy drops below threshold relative to baseline.
Human Escalation
For inputs that trigger anomaly detection or show low ensemble agreement, escalation to human review is the ultimate defence. The escalation mechanism:
- High-risk decisions with adversarial indicators are queued for human review rather than automated action.
- Human review result is fed back into the anomaly detection baseline to improve future detection.
5. Architecture Diagram
6. Components
| Component | Type | Responsibility | Technology Options | Criticality |
|---|---|---|---|---|
| Feature Squeezer | Anomaly Detection | Applies smoothing transformations; detects prediction instability | Custom pre-processing (median filter, bit-depth reduction) | High |
| Distribution Monitor | Anomaly Detection | Measures Mahalanobis distance from input to training distribution in feature space | Custom PyTorch/TF layer; Alibi Detect | High |
| Boundary Proximity Detector | Anomaly Detection | Estimates distance to decision boundary; flags inputs near boundary | Custom; requires access to model gradients or surrogate model | Medium |
| Randomised Smoothing Service | Certified Defence | Runs N noisy model evaluations; computes certified prediction | Smoothed classifiers (Cohen et al.); custom implementation | Medium |
| Ensemble Inference | Ensemble | Queries multiple independent models; computes agreement | Multiple model endpoints (same framework, different architecture/seed) | High |
| Human Review Queue | Escalation | Queues uncertain/suspicious inputs for human decision | ServiceNow, Jira, custom review UI | High |
| Adversarial Attack Generator | Red-Teaming | Generates adversarial examples for automated robustness testing | Adversarial Robustness Toolbox (IBM ART), Foolbox, CleverHans | High |
| Robustness Dashboard | Observability | Tracks clean accuracy, adversarial accuracy, certified accuracy over time | Grafana + Prometheus; MLflow metrics | High |
| Feedback Loop | ML Operations | Incorporates human review decisions into anomaly detection baseline | Custom pipeline; ML feature store | Medium |
7. Data Flow
Primary Flow
| Step | Actor | Action | Output |
|---|---|---|---|
| 1 | Input | Arrives at adversarial defence pipeline | Raw input (image/audio/text/structured) |
| 2 | Feature Squeezer | Applies smoothing; compares predictions before/after | Instability score |
| 3 | Distribution Monitor | Computes feature-space distance from training distribution | Mahalanobis distance score |
| 4 | Boundary Proximity | Estimates proximity to decision boundary | Proximity score |
| 5 | Anomaly Aggregator | Combines scores; determines routing | Low/Medium/High anomaly score |
| 6a | If Low score | Route directly to primary model | Standard inference result |
| 6b | If Medium score | Route to ensemble verification (3+ models) | Agreement level + predictions |
| 6c | If High score | Route to certified defence (randomised smoothing) | Certified prediction + radius OR cannot-certify |
| 7a | High ensemble agreement | Pass to decision aggregator | Confident prediction |
| 7b | Low ensemble agreement | Escalate to human review | Pending human decision |
| 7c | Cannot certify | Escalate to human review | Pending human decision |
| 8 | Decision Aggregator | Combines prediction with adversarial risk score | Final decision + metadata |
8. Security Considerations
OWASP LLM Top 10 Coverage
| OWASP LLM Risk | Adversarial Input Defence Mitigation | Coverage |
|---|---|---|
| LLM01: Prompt Injection | Anomaly detection catches adversarially crafted text inputs that target model decision boundaries | Medium |
| LLM02: Insecure Output Handling | Ensemble disagreement detection can flag anomalous outputs | Medium |
| LLM03: Training Data Poisoning | Red-teaming detects degraded robustness that may indicate training data poisoning | Medium |
| LLM04: Model Denial of Service | Certified defence (N evaluations) is computationally expensive — protect with rate limiting | Low |
| LLM05: Supply Chain Vulnerabilities | Not directly applicable | None |
| LLM06: Sensitive Information Disclosure | Not directly applicable | None |
| LLM07: Insecure Plugin Design | Not directly applicable | None |
| LLM08: Excessive Agency | Not directly applicable | None |
| LLM09: Overreliance | Adversarial risk score in output provides explicit signal against overreliance for adversarial inputs | High |
| LLM10: Model Theft | Model inversion defences limit information returned per query; ensemble inconsistency detection | Medium |
Encryption
- All inter-component communication TLS 1.3.
- Human review queue content encrypted at rest (may contain sensitive inputs).
9. Governance Considerations
Governance Artefacts
| Artefact | Owner | Frequency | Purpose |
|---|---|---|---|
| Robustness Benchmark Report | AI Risk Team | Weekly (automated) + monthly (reviewed) | Tracks adversarial accuracy vs baseline; identifies degradation |
| Human Review Escalation Log | AI Operations | Continuous; weekly review | Record of all inputs escalated for human review; patterns |
| Red Team Exercise Report | Security Team | Quarterly | Documents adversarial attack scenarios tested; findings; remediation |
| Model Robustness Validation (EU AI Act) | AI Risk + Compliance | Pre-deployment; annually | Evidence of robustness testing for high-risk AI Act compliance |
| Anomaly Detection Calibration | AI Platform | Monthly | Reviews anomaly threshold tuning; false positive/negative rates |
10. Operational Considerations
SLOs
| SLO | Target | Measurement |
|---|---|---|
| Adversarial detection rate (known attacks) | >90% of test attack suite flagged | Weekly automated red-team suite |
| False positive rate (clean inputs flagged) | <2% | Monthly audit of human review queue |
| Certified defence latency (p99) | <2s (N=100 evaluations) | Defence latency metric |
| Ensemble verification latency (p99) | <500ms (3 models × 100ms) | Ensemble span |
| Robustness regression gate | Block if adversarial accuracy drops >5% vs baseline | Automated gate in model deployment pipeline |
Incident Management
- Adversarial attack pattern detected in production inputs → P2: Security + AI team investigation; pattern analysis; defence update.
- Adversarial accuracy below threshold → P2: Block model update deployment; investigate training pipeline.
- Human review queue depth > N hours backlog → Alert to AI operations team.
11. Cost Considerations
Cost Drivers
| Cost Driver | Description | Relative Impact |
|---|---|---|
| Ensemble inference | N models × inference cost per request | High |
| Certified defence compute | N=100–1000 model evaluations per prediction | Very High |
| Red-teaming infrastructure | Attack generation compute; model evaluation at scale | Medium |
| Human reviewer time | Reviewing escalated uncertain inputs | Medium |
| Anomaly detection compute | Feature extraction and distance computation | Low–Medium |
Indicative Cost Range
| Scale | Monthly Cost (USD) | Notes |
|---|---|---|
| Small (< 100K predictions/day) | $500–$2,000 | 3-model ensemble; sampling-based anomaly detection |
| Medium (100K–1M predictions/day) | $3,000–$15,000 | Full ensemble; selective certified defence; GPU inference |
| Large (> 1M predictions/day) | $20,000–$80,000 | Distributed ensemble; GPU cluster for certified defence; dedicated red-team |
12. Trade-Off Analysis
Option Comparison
| Option | Description | Pros | Cons | Best For |
|---|---|---|---|---|
| A: Adversarial training only | Train model on adversarial examples to improve robustness | Improved robustness with no inference overhead | No mathematical guarantee; doesn't defend against all attack types; accuracy tradeoff | General robustness improvement; low-risk applications |
| B: Anomaly detection only | Statistical detection of adversarial inputs; no certified defence | Low inference overhead; practical at scale | No guarantees; may be evaded by adaptive attackers | Most production deployments; reasonable threat model |
| C: Full certified + ensemble (this pattern) | Statistical detection + ensemble verification + randomised smoothing | Strongest defence; mathematical guarantees for high-risk decisions | High inference cost (N× evaluations); complex ops | High-risk AI systems; EU AI Act high-risk classification |
| D: Human review only | All uncertain inputs reviewed by human | Perfect coverage; no adversarial bypass | Not scalable; high latency; expensive | Decisions too important to automate at all |
Architectural Tensions
| Tension | Trade-Off |
|---|---|
| Security vs Latency | Certified defence (N=512 evaluations) adds significant latency. Resolution: apply certified defence only to high-risk decisions; use anomaly detection for real-time paths. |
| Guaranteed Robustness vs Accuracy | Certified defences typically sacrifice 5–15% clean accuracy for mathematical robustness guarantees. Resolution: evaluate tradeoff per use case; for high-stakes decisions, accuracy tradeoff is acceptable. |
| Defence Completeness vs Adaptive Attackers | Publicly known defences can be circumvented by adaptive attackers who craft attacks specifically against the defence. Resolution: defence diversity (don't publish exact defence configuration); continuous red-teaming. |
13. Failure Modes
| Failure | Likelihood | Impact | Detection | Recovery |
|---|---|---|---|---|
| Adaptive attacker bypasses known defences | Medium | High | Red-team detects; human review catches consequence | Update defences; increase ensemble diversity; tighten anomaly thresholds |
| Certified defence unable to certify (radius too small) | Medium | Medium (escalates to human) | Cannot-certify rate metric | Acceptable if human review is adequate; indicates model near-boundary |
| False positive spike (clean inputs flagged as adversarial) | Medium | High (legitimate decisions delayed) | FPR metric; human review queue depth | Threshold tuning; model retraining |
| Human review queue overflow | Medium | High (escalated inputs unreviewed) | Queue depth metric | Scale human review team; emergency: automated fallback with logging |
| Ensemble model diverges from primary (legitimate accuracy drop) | Low | Medium (increased escalation rate) | Ensemble agreement rate drop | Retrain ensemble models; investigate training drift |
14. Regulatory Considerations
| Regulation | Requirement | Implementation |
|---|---|---|
| EU AI Act Art. 15 (Robustness and Accuracy) | High-risk AI systems must be resilient to attempts to alter outputs or performance | Adversarial defence directly implements robustness requirement; red-team evidence demonstrates testing |
| EU AI Act Art. 15 (Technical Robustness) | Implemented by design; validated through testing | Continuous red-teaming and robustness benchmarking provides test evidence |
| NIST AI RMF MANAGE 1.3 | Responses to identified risks monitored and adjusted | Feedback loop from human review to anomaly detection implements continuous management |
| ISO/IEC 42001 §8.4 (Incident Management) | AI system incidents monitored and managed | Adversarial attack detection and escalation implements §8.4 |
| APRA CPS234 §21 | Controls commensurate with threat environment | For regulated AI systems, adversarial defences are a commensurate control for adversarial threat actors |
15. Reference Implementations
AWS
| Component | AWS Service / OSS |
|---|---|
| Anomaly detection | SageMaker Model Monitor (data drift) + custom feature-space anomaly |
| Ensemble inference | Multiple SageMaker endpoints; SageMaker inference pipeline |
| Certified defence | Custom Lambda function (randomised smoothing implementation) |
| Red-teaming | SageMaker Processing + Adversarial Robustness Toolbox (IBM ART) |
| Human review | Amazon A2I (Augmented AI) for human review workflows |
Azure
| Component | Azure Service / OSS |
|---|---|
| Anomaly detection | Azure ML data drift detection + custom Mahalanobis |
| Ensemble | Multiple Azure ML endpoints; Azure ML pipelines |
| Certified defence | Custom Azure Functions (randomised smoothing) |
| Red-teaming | Azure ML automated ML + Foolbox |
On-Premises
| Component | Technology |
|---|---|
| Anomaly detection | Alibi Detect (open source) |
| Ensemble | Multiple model servers (Triton Inference Server) |
| Certified defence | Custom PyTorch implementation (Cohen et al. randomised smoothing) |
| Red-teaming | IBM Adversarial Robustness Toolbox (ART) |
| Human review | Custom review UI + Kafka task queue |
16. Related Patterns
| Pattern | ID | Relationship |
|---|---|---|
| Prompt Firewall | EAAPL-SEC002 | SEC002 covers NLP adversarial attacks (prompt injection); SEC010 covers ML-level adversarial examples |
| Model Isolation | EAAPL-SEC003 | Isolation limits blast radius when adversarial bypass succeeds |
| AI Telemetry | EAAPL-OBS001 | Robustness metrics are collected through telemetry infrastructure |
| Model Drift Detection | EAAPL-OBS005 | Adversarial input patterns can cause observable model output drift |
| AI Performance Benchmarking | EAAPL-OBS008 | Adversarial accuracy benchmarking is an extension of SEC010's red-teaming |
17. Maturity Assessment
Overall Maturity: Emerging
| Dimension | Score (1–5) | Rationale |
|---|---|---|
| Pattern definition clarity | 3 | Strong academic foundation; production patterns still evolving |
| Technology availability | 3 | OSS tooling (ART, Alibi Detect) available; production-grade deployment patterns less mature |
| Industry adoption | 2 | Limited to most security-mature organisations and specific high-risk domains |
| Certified defence practicality | 2 | Randomised smoothing practical only for specific model types; accuracy tradeoffs limit adoption |
| Regulatory alignment | 4 | EU AI Act Art. 15 creates direct regulatory driver |
| Research maturity | 5 | Deep academic literature (ICML, NeurIPS, ICLR); active research community |
18. Revision History
| Version | Date | Author | Changes |
|---|---|---|---|
| 1.0 | 2025-01-20 | Security Architecture Team | Initial pattern definition; reflects current state of emerging practice |