EAAPLEnterprise AI Architecture Pattern Library
EAAPLLibraryAI SecurityEAAPL-SEC010
EAAPL-SEC010Proven
⇄ Compare

Adversarial Input Defence

🔐 AI SecurityAPRA CPS234EU AI Act🏭 Field-tested in AU

[EAAPL-SEC010] Adversarial Input Defence

Category: Security / Model Robustness Sub-category: Adversarial ML Defence Version: 1.0 Maturity: Emerging Tags: adversarial-ml model-robustness anomaly-detection certified-defence ensemble-verification adversarial-examples Regulatory Relevance: EU AI Act Art. 15 (Robustness), NIST AI RMF MANAGE 1.3, APRA CPS234 §21, ISO 42001 §8.4


1. Executive Summary

Adversarial Input Defence addresses a class of attacks specifically designed to manipulate AI model behaviour through carefully crafted inputs — not through natural language manipulation (prompt injection, addressed by EAAPL-SEC002) but through mathematical perturbations and statistical exploitation of model decision boundaries. These attacks can cause image classifiers to misidentify objects, speech recognition systems to transcribe incorrect commands, and ML-based fraud detection systems to approve fraudulent transactions.

While adversarial examples have been studied primarily in computer vision and NLP classification contexts, they represent a growing concern for any AI system whose outputs influence high-stakes decisions. A fraud detection model, a document classification system, or an AI-powered identity verification system can all be targeted with adversarial inputs crafted to produce specific incorrect outputs.

This pattern defines the defence architecture: anomaly detection on inputs (detecting statistically unusual inputs that may be adversarial), certified defences (mathematical guarantees of model prediction within a perturbation radius), ensemble verification (using multiple models to detect inconsistency that indicates adversarial manipulation), model robustness testing (continuous adversarial red-teaming), and human escalation for high-uncertainty decisions. The maturity rating is Emerging — the field is active with strong academic foundations but limited production deployment at scale.


2. Problem Statement

Business Problem

Organisations deploying AI in high-stakes decision-making contexts (fraud detection, identity verification, content moderation, medical imaging, access control) face a targeted attack vector: adversaries who understand the AI system can craft inputs specifically designed to obtain a desired incorrect output. Unlike random errors, adversarial examples are not caught by accuracy metrics — the model performs well on normal data but can be reliably fooled on specially crafted inputs.

Documented adversarial attack scenarios in production AI:

  • Adversarial patches on physical objects causing object detectors to misclassify items.
  • Adversarial perturbations in audio causing speech systems to recognise hidden commands.
  • Adversarial text modifications causing spam or phishing filters to classify malicious content as benign.
  • Model inversion attacks recovering training data (including personal data) from model outputs.

Technical Problem

Neural networks learn complex non-linear decision boundaries. These boundaries have a property that makes them vulnerable: small, carefully calculated perturbations to an input can move it across a decision boundary, changing the model's prediction dramatically while the input remains perceptually identical to a human. The perturbation is not random noise — it is calculated to maximally exploit the model's specific decision boundary.

Defence is mathematically challenging because:

  • The adversary has (or can approximate) access to the model's decision function.
  • Certified defences that provide mathematical guarantees typically come with accuracy/performance tradeoffs.
  • Adversarial training (training on adversarial examples) improves robustness but does not provide guarantees.
  • Transfer attacks allow adversarial examples crafted against one model to fool related models.

Symptoms

  • AI system producing unexpected outputs on edge-case inputs.
  • Model performance significantly better on clean test data than production data.
  • Identical content classified differently based on subtle metadata variations.
  • Suspiciously correlated false negatives in fraud detection or content moderation.

Cost of Inaction

Dimension Impact
Financial Adversarial fraud detection bypass enabling financial crime
Security Adversarial identity verification bypass enabling unauthorised access
Regulatory EU AI Act Art. 15 requires robustness for high-risk AI systems — ungoverned adversarial risk is a compliance gap
Safety Medical AI systems producing incorrect outputs on adversarially crafted medical images
Reputational Demonstrated adversarial bypass of AI security control causes significant trust erosion

3. Context

When to Apply

  • AI systems making high-stakes decisions in adversarial environments (fraud detection, access control, content moderation).
  • AI systems where adversaries have financial or strategic incentive to obtain specific incorrect outputs.
  • Image/audio classification systems in physical security contexts.
  • AI-based authentication or identity verification systems.
  • Regulated high-risk AI systems where EU AI Act Art. 15 robustness requirements apply.

When NOT to Apply

  • AI systems in low-stakes, non-adversarial environments where adversarial input construction is not a plausible threat model.
  • Generative LLM text completion where adversarial example attacks are less directly applicable (prompt injection, addressed by SEC002, is the more relevant vector).
  • Proof-of-concept systems not processing real adversarial inputs.

Prerequisites

Prerequisite Detail
Threat Model Documented threat model identifying adversarial input construction as a plausible attack vector
Model Architecture Access Ability to run inference through multiple models for ensemble verification
Input Distribution Understanding Baseline statistical distribution of legitimate inputs for anomaly detection
Red Team Capability Ability to run adversarial robustness tests against production models

Industry Applicability

Industry Applicability Key Driver
Financial Services (Fraud Detection) High Adversarial fraud bypass; significant financial incentive for attackers
Physical Security / Access Control High Adversarial image/facial recognition bypass
Healthcare (Medical Imaging AI) High Safety-critical; EU AI Act high-risk classification
Content Moderation High Evading moderation filters to distribute harmful content
Autonomous Systems Critical Physical-world adversarial attacks (traffic sign manipulation)
Government / National Security Critical Adversarial attacks by sophisticated threat actors

4. Architecture Overview

Adversarial input defence is implemented as a defence-in-depth architecture where no single control is relied upon exclusively — adversarial defence is inherently probabilistic, and the combination of controls provides significantly better coverage than any individual technique.

Layer 1: Statistical Input Anomaly Detection

The first layer measures the statistical distance of an incoming input from the distribution of legitimate inputs seen during training and operation. Adversarial examples often have detectable statistical properties: unnaturally smooth perturbation patterns, unusual pixel value distributions, or feature representations that cluster near decision boundaries rather than in the interior of class regions. Detection techniques:

  • Feature squeezing: Apply smoothing filters to the input; if the model's prediction changes significantly after smoothing, the input may be adversarial (adversarial perturbations are sensitive to smoothing).
  • Input distribution monitoring: Compute the Mahalanobis distance from the input to the training distribution in feature space; flag inputs with unusually large distance.
  • Boundary proximity detection: Measure how close the input's feature representation is to the model's decision boundary; inputs deliberately close to the boundary (but not crossing it) may be adversarial probes.

Layer 2: Certified Defences

Certified defences provide mathematical guarantees: for inputs within a specified perturbation radius (e.g., L∞ norm ≤ 0.01), the model's prediction is certified to be stable. The state of the practice for certified defences:

  • Randomised smoothing: The model is queried on multiple copies of the input with added Gaussian noise; the majority vote provides a certified prediction under L2 norm perturbations. This is the most practical certified defence for large models.
  • Lipschitz-constrained networks: Models trained with constraints on their Lipschitz constant provide formal stability guarantees but with accuracy/capacity tradeoffs.
  • Certified defences are computationally expensive (randomised smoothing requires N model evaluations per prediction) and are applied selectively to high-risk decision contexts.

Layer 3: Ensemble Verification

An adversarial example crafted to fool one model typically does not fool an independent model with different architecture, training data order, or random initialisation. Ensemble verification queries multiple independent models and compares their outputs:

  • High agreement across models → high confidence prediction.
  • Low agreement → possible adversarial input or genuine ambiguity; flag for human review or apply certified defence.
  • This technique detects both adversarial examples and genuine edge cases where model confidence should be low.

Layer 4: Continuous Adversarial Red-Teaming

The adversarial landscape evolves. Static defences become less effective as attackers adapt to known defence techniques. Continuous red-teaming:

  • Automated adversarial example generation (FGSM, PGD, CW attacks) against production models on a weekly schedule.
  • Robustness metrics tracked over time: clean accuracy, adversarial accuracy, certified accuracy.
  • Regression gates: model deployment blocked if adversarial accuracy drops below threshold relative to baseline.

Human Escalation

For inputs that trigger anomaly detection or show low ensemble agreement, escalation to human review is the ultimate defence. The escalation mechanism:

  • High-risk decisions with adversarial indicators are queued for human review rather than automated action.
  • Human review result is fed back into the anomaly detection baseline to improve future detection.

5. Architecture Diagram

ARCHITECTURE DIAGRAM
flowchart TD subgraph Detection["Defence Layers"] A[Incoming Input] B[Anomaly Detector] C{Risk Score} D[Ensemble Verification] end subgraph Decision["Decision + Escalation"] E[Primary Model] F[Human Review Queue] G[Final Decision] end subgraph Monitoring["Continuous Red-Teaming"] H[Attack Generator] I[Robustness Metrics] end A --> B --> C C -->|low risk| E --> G C -->|medium risk| D C -->|high risk| F D -->|agreement| G D -->|disagreement| F H --> I -->|regression gate| E style A fill:#dbeafe,stroke:#3b82f6 style B fill:#f0fdf4,stroke:#22c55e style C fill:#f3e8ff,stroke:#a855f7 style D fill:#f0fdf4,stroke:#22c55e style E fill:#fef9c3,stroke:#eab308 style F fill:#fee2e2,stroke:#ef4444 style G fill:#d1fae5,stroke:#10b981 style H fill:#fee2e2,stroke:#ef4444 style I fill:#fef9c3,stroke:#eab308

6. Components

Component Type Responsibility Technology Options Criticality
Feature Squeezer Anomaly Detection Applies smoothing transformations; detects prediction instability Custom pre-processing (median filter, bit-depth reduction) High
Distribution Monitor Anomaly Detection Measures Mahalanobis distance from input to training distribution in feature space Custom PyTorch/TF layer; Alibi Detect High
Boundary Proximity Detector Anomaly Detection Estimates distance to decision boundary; flags inputs near boundary Custom; requires access to model gradients or surrogate model Medium
Randomised Smoothing Service Certified Defence Runs N noisy model evaluations; computes certified prediction Smoothed classifiers (Cohen et al.); custom implementation Medium
Ensemble Inference Ensemble Queries multiple independent models; computes agreement Multiple model endpoints (same framework, different architecture/seed) High
Human Review Queue Escalation Queues uncertain/suspicious inputs for human decision ServiceNow, Jira, custom review UI High
Adversarial Attack Generator Red-Teaming Generates adversarial examples for automated robustness testing Adversarial Robustness Toolbox (IBM ART), Foolbox, CleverHans High
Robustness Dashboard Observability Tracks clean accuracy, adversarial accuracy, certified accuracy over time Grafana + Prometheus; MLflow metrics High
Feedback Loop ML Operations Incorporates human review decisions into anomaly detection baseline Custom pipeline; ML feature store Medium

7. Data Flow

Primary Flow

Step Actor Action Output
1 Input Arrives at adversarial defence pipeline Raw input (image/audio/text/structured)
2 Feature Squeezer Applies smoothing; compares predictions before/after Instability score
3 Distribution Monitor Computes feature-space distance from training distribution Mahalanobis distance score
4 Boundary Proximity Estimates proximity to decision boundary Proximity score
5 Anomaly Aggregator Combines scores; determines routing Low/Medium/High anomaly score
6a If Low score Route directly to primary model Standard inference result
6b If Medium score Route to ensemble verification (3+ models) Agreement level + predictions
6c If High score Route to certified defence (randomised smoothing) Certified prediction + radius OR cannot-certify
7a High ensemble agreement Pass to decision aggregator Confident prediction
7b Low ensemble agreement Escalate to human review Pending human decision
7c Cannot certify Escalate to human review Pending human decision
8 Decision Aggregator Combines prediction with adversarial risk score Final decision + metadata

8. Security Considerations

OWASP LLM Top 10 Coverage

OWASP LLM Risk Adversarial Input Defence Mitigation Coverage
LLM01: Prompt Injection Anomaly detection catches adversarially crafted text inputs that target model decision boundaries Medium
LLM02: Insecure Output Handling Ensemble disagreement detection can flag anomalous outputs Medium
LLM03: Training Data Poisoning Red-teaming detects degraded robustness that may indicate training data poisoning Medium
LLM04: Model Denial of Service Certified defence (N evaluations) is computationally expensive — protect with rate limiting Low
LLM05: Supply Chain Vulnerabilities Not directly applicable None
LLM06: Sensitive Information Disclosure Not directly applicable None
LLM07: Insecure Plugin Design Not directly applicable None
LLM08: Excessive Agency Not directly applicable None
LLM09: Overreliance Adversarial risk score in output provides explicit signal against overreliance for adversarial inputs High
LLM10: Model Theft Model inversion defences limit information returned per query; ensemble inconsistency detection Medium

Encryption

  • All inter-component communication TLS 1.3.
  • Human review queue content encrypted at rest (may contain sensitive inputs).

9. Governance Considerations

Governance Artefacts

Artefact Owner Frequency Purpose
Robustness Benchmark Report AI Risk Team Weekly (automated) + monthly (reviewed) Tracks adversarial accuracy vs baseline; identifies degradation
Human Review Escalation Log AI Operations Continuous; weekly review Record of all inputs escalated for human review; patterns
Red Team Exercise Report Security Team Quarterly Documents adversarial attack scenarios tested; findings; remediation
Model Robustness Validation (EU AI Act) AI Risk + Compliance Pre-deployment; annually Evidence of robustness testing for high-risk AI Act compliance
Anomaly Detection Calibration AI Platform Monthly Reviews anomaly threshold tuning; false positive/negative rates

10. Operational Considerations

SLOs

SLO Target Measurement
Adversarial detection rate (known attacks) >90% of test attack suite flagged Weekly automated red-team suite
False positive rate (clean inputs flagged) <2% Monthly audit of human review queue
Certified defence latency (p99) <2s (N=100 evaluations) Defence latency metric
Ensemble verification latency (p99) <500ms (3 models × 100ms) Ensemble span
Robustness regression gate Block if adversarial accuracy drops >5% vs baseline Automated gate in model deployment pipeline

Incident Management

  • Adversarial attack pattern detected in production inputs → P2: Security + AI team investigation; pattern analysis; defence update.
  • Adversarial accuracy below threshold → P2: Block model update deployment; investigate training pipeline.
  • Human review queue depth > N hours backlog → Alert to AI operations team.

11. Cost Considerations

Cost Drivers

Cost Driver Description Relative Impact
Ensemble inference N models × inference cost per request High
Certified defence compute N=100–1000 model evaluations per prediction Very High
Red-teaming infrastructure Attack generation compute; model evaluation at scale Medium
Human reviewer time Reviewing escalated uncertain inputs Medium
Anomaly detection compute Feature extraction and distance computation Low–Medium

Indicative Cost Range

Scale Monthly Cost (USD) Notes
Small (< 100K predictions/day) $500–$2,000 3-model ensemble; sampling-based anomaly detection
Medium (100K–1M predictions/day) $3,000–$15,000 Full ensemble; selective certified defence; GPU inference
Large (> 1M predictions/day) $20,000–$80,000 Distributed ensemble; GPU cluster for certified defence; dedicated red-team

12. Trade-Off Analysis

Option Comparison

Option Description Pros Cons Best For
A: Adversarial training only Train model on adversarial examples to improve robustness Improved robustness with no inference overhead No mathematical guarantee; doesn't defend against all attack types; accuracy tradeoff General robustness improvement; low-risk applications
B: Anomaly detection only Statistical detection of adversarial inputs; no certified defence Low inference overhead; practical at scale No guarantees; may be evaded by adaptive attackers Most production deployments; reasonable threat model
C: Full certified + ensemble (this pattern) Statistical detection + ensemble verification + randomised smoothing Strongest defence; mathematical guarantees for high-risk decisions High inference cost (N× evaluations); complex ops High-risk AI systems; EU AI Act high-risk classification
D: Human review only All uncertain inputs reviewed by human Perfect coverage; no adversarial bypass Not scalable; high latency; expensive Decisions too important to automate at all

Architectural Tensions

Tension Trade-Off
Security vs Latency Certified defence (N=512 evaluations) adds significant latency. Resolution: apply certified defence only to high-risk decisions; use anomaly detection for real-time paths.
Guaranteed Robustness vs Accuracy Certified defences typically sacrifice 5–15% clean accuracy for mathematical robustness guarantees. Resolution: evaluate tradeoff per use case; for high-stakes decisions, accuracy tradeoff is acceptable.
Defence Completeness vs Adaptive Attackers Publicly known defences can be circumvented by adaptive attackers who craft attacks specifically against the defence. Resolution: defence diversity (don't publish exact defence configuration); continuous red-teaming.

13. Failure Modes

Failure Likelihood Impact Detection Recovery
Adaptive attacker bypasses known defences Medium High Red-team detects; human review catches consequence Update defences; increase ensemble diversity; tighten anomaly thresholds
Certified defence unable to certify (radius too small) Medium Medium (escalates to human) Cannot-certify rate metric Acceptable if human review is adequate; indicates model near-boundary
False positive spike (clean inputs flagged as adversarial) Medium High (legitimate decisions delayed) FPR metric; human review queue depth Threshold tuning; model retraining
Human review queue overflow Medium High (escalated inputs unreviewed) Queue depth metric Scale human review team; emergency: automated fallback with logging
Ensemble model diverges from primary (legitimate accuracy drop) Low Medium (increased escalation rate) Ensemble agreement rate drop Retrain ensemble models; investigate training drift

14. Regulatory Considerations

Regulation Requirement Implementation
EU AI Act Art. 15 (Robustness and Accuracy) High-risk AI systems must be resilient to attempts to alter outputs or performance Adversarial defence directly implements robustness requirement; red-team evidence demonstrates testing
EU AI Act Art. 15 (Technical Robustness) Implemented by design; validated through testing Continuous red-teaming and robustness benchmarking provides test evidence
NIST AI RMF MANAGE 1.3 Responses to identified risks monitored and adjusted Feedback loop from human review to anomaly detection implements continuous management
ISO/IEC 42001 §8.4 (Incident Management) AI system incidents monitored and managed Adversarial attack detection and escalation implements §8.4
APRA CPS234 §21 Controls commensurate with threat environment For regulated AI systems, adversarial defences are a commensurate control for adversarial threat actors

15. Reference Implementations

AWS

Component AWS Service / OSS
Anomaly detection SageMaker Model Monitor (data drift) + custom feature-space anomaly
Ensemble inference Multiple SageMaker endpoints; SageMaker inference pipeline
Certified defence Custom Lambda function (randomised smoothing implementation)
Red-teaming SageMaker Processing + Adversarial Robustness Toolbox (IBM ART)
Human review Amazon A2I (Augmented AI) for human review workflows

Azure

Component Azure Service / OSS
Anomaly detection Azure ML data drift detection + custom Mahalanobis
Ensemble Multiple Azure ML endpoints; Azure ML pipelines
Certified defence Custom Azure Functions (randomised smoothing)
Red-teaming Azure ML automated ML + Foolbox

On-Premises

Component Technology
Anomaly detection Alibi Detect (open source)
Ensemble Multiple model servers (Triton Inference Server)
Certified defence Custom PyTorch implementation (Cohen et al. randomised smoothing)
Red-teaming IBM Adversarial Robustness Toolbox (ART)
Human review Custom review UI + Kafka task queue

Pattern ID Relationship
Prompt Firewall EAAPL-SEC002 SEC002 covers NLP adversarial attacks (prompt injection); SEC010 covers ML-level adversarial examples
Model Isolation EAAPL-SEC003 Isolation limits blast radius when adversarial bypass succeeds
AI Telemetry EAAPL-OBS001 Robustness metrics are collected through telemetry infrastructure
Model Drift Detection EAAPL-OBS005 Adversarial input patterns can cause observable model output drift
AI Performance Benchmarking EAAPL-OBS008 Adversarial accuracy benchmarking is an extension of SEC010's red-teaming

17. Maturity Assessment

Overall Maturity: Emerging

Dimension Score (1–5) Rationale
Pattern definition clarity 3 Strong academic foundation; production patterns still evolving
Technology availability 3 OSS tooling (ART, Alibi Detect) available; production-grade deployment patterns less mature
Industry adoption 2 Limited to most security-mature organisations and specific high-risk domains
Certified defence practicality 2 Randomised smoothing practical only for specific model types; accuracy tradeoffs limit adoption
Regulatory alignment 4 EU AI Act Art. 15 creates direct regulatory driver
Research maturity 5 Deep academic literature (ICML, NeurIPS, ICLR); active research community

18. Revision History

Version Date Author Changes
1.0 2025-01-20 Security Architecture Team Initial pattern definition; reflects current state of emerging practice
← Back to LibraryMore AI Security