EAAPL-SEC010Proven

Adversarial Input Defence

AI SecurityAPRA CPS234EU AI ActField-tested in AU

[EAAPL-SEC010] Adversarial Input Defence

Category: Security / Model Robustness Sub-category: Adversarial ML Defence Version: 1.0 Maturity: Emerging Tags: adversarial-ml model-robustness anomaly-detection certified-defence ensemble-verification adversarial-examples Regulatory Relevance: EU AI Act Art. 15 (Robustness), NIST AI RMF MANAGE 1.3, APRA CPS234 §21, ISO 42001 §8.4

1. Executive Summary

Adversarial Input Defence addresses a class of attacks specifically designed to manipulate AI model behaviour through carefully crafted inputs — not through natural language manipulation (prompt injection, addressed by EAAPL-SEC002) but through mathematical perturbations and statistical exploitation of model decision boundaries. These attacks can cause image classifiers to misidentify objects, speech recognition systems to transcribe incorrect commands, and ML-based fraud detection systems to approve fraudulent transactions.

While adversarial examples have been studied primarily in computer vision and NLP classification contexts, they represent a growing concern for any AI system whose outputs influence high-stakes decisions. A fraud detection model, a document classification system, or an AI-powered identity verification system can all be targeted with adversarial inputs crafted to produce specific incorrect outputs.

This pattern defines the defence architecture: anomaly detection on inputs (detecting statistically unusual inputs that may be adversarial), certified defences (mathematical guarantees of model prediction within a perturbation radius), ensemble verification (using multiple models to detect inconsistency that indicates adversarial manipulation), model robustness testing (continuous adversarial red-teaming), and human escalation for high-uncertainty decisions. The maturity rating is Emerging — the field is active with strong academic foundations but limited production deployment at scale.

2. Problem Statement

Business Problem

Organisations deploying AI in high-stakes decision-making contexts (fraud detection, identity verification, content moderation, medical imaging, access control) face a targeted attack vector: adversaries who understand the AI system can craft inputs specifically designed to obtain a desired incorrect output. Unlike random errors, adversarial examples are not caught by accuracy metrics — the model performs well on normal data but can be reliably fooled on specially crafted inputs.

Documented adversarial attack scenarios in production AI:

Adversarial patches on physical objects causing object detectors to misclassify items.
Adversarial perturbations in audio causing speech systems to recognise hidden commands.
Adversarial text modifications causing spam or phishing filters to classify malicious content as benign.
Model inversion attacks recovering training data (including personal data) from model outputs.

Technical Problem

Neural networks learn complex non-linear decision boundaries. These boundaries have a property that makes them vulnerable: small, carefully calculated perturbations to an input can move it across a decision boundary, changing the model's prediction dramatically while the input remains perceptually identical to a human. The perturbation is not random noise — it is calculated to maximally exploit the model's specific decision boundary.

Defence is mathematically challenging because:

The adversary has (or can approximate) access to the model's decision function.
Certified defences that provide mathematical guarantees typically come with accuracy/performance tradeoffs.
Adversarial training (training on adversarial examples) improves robustness but does not provide guarantees.
Transfer attacks allow adversarial examples crafted against one model to fool related models.

Symptoms

AI system producing unexpected outputs on edge-case inputs.
Model performance significantly better on clean test data than production data.
Identical content classified differently based on subtle metadata variations.
Suspiciously correlated false negatives in fraud detection or content moderation.

Cost of Inaction

Dimension	Impact
Financial	Adversarial fraud detection bypass enabling financial crime
Security	Adversarial identity verification bypass enabling unauthorised access
Regulatory	EU AI Act Art. 15 requires robustness for high-risk AI systems — ungoverned adversarial risk is a compliance gap
Safety	Medical AI systems producing incorrect outputs on adversarially crafted medical images
Reputational	Demonstrated adversarial bypass of AI security control causes significant trust erosion

3. Context

When to Apply

AI systems making high-stakes decisions in adversarial environments (fraud detection, access control, content moderation).
AI systems where adversaries have financial or strategic incentive to obtain specific incorrect outputs.
Image/audio classification systems in physical security contexts.
AI-based authentication or identity verification systems.
Regulated high-risk AI systems where EU AI Act Art. 15 robustness requirements apply.

When NOT to Apply

AI systems in low-stakes, non-adversarial environments where adversarial input construction is not a plausible threat model.
Generative LLM text completion where adversarial example attacks are less directly applicable (prompt injection, addressed by SEC002, is the more relevant vector).
Proof-of-concept systems not processing real adversarial inputs.

Prerequisites

Prerequisite	Detail
Threat Model	Documented threat model identifying adversarial input construction as a plausible attack vector
Model Architecture Access	Ability to run inference through multiple models for ensemble verification
Input Distribution Understanding	Baseline statistical distribution of legitimate inputs for anomaly detection
Red Team Capability	Ability to run adversarial robustness tests against production models

Industry Applicability

Industry	Applicability	Key Driver
Financial Services (Fraud Detection)	High	Adversarial fraud bypass; significant financial incentive for attackers
Physical Security / Access Control	High	Adversarial image/facial recognition bypass
Healthcare (Medical Imaging AI)	High	Safety-critical; EU AI Act high-risk classification
Content Moderation	High	Evading moderation filters to distribute harmful content
Autonomous Systems	Critical	Physical-world adversarial attacks (traffic sign manipulation)
Government / National Security	Critical	Adversarial attacks by sophisticated threat actors

4. Architecture Overview

Adversarial input defence is implemented as a defence-in-depth architecture where no single control is relied upon exclusively — adversarial defence is inherently probabilistic, and the combination of controls provides significantly better coverage than any individual technique.

Layer 1: Statistical Input Anomaly Detection

The first layer measures the statistical distance of an incoming input from the distribution of legitimate inputs seen during training and operation. Adversarial examples often have detectable statistical properties: unnaturally smooth perturbation patterns, unusual pixel value distributions, or feature representations that cluster near decision boundaries rather than in the interior of class regions. Detection techniques:

Feature squeezing: Apply smoothing filters to the input; if the model's prediction changes significantly after smoothing, the input may be adversarial (adversarial perturbations are sensitive to smoothing).
Input distribution monitoring: Compute the Mahalanobis distance from the input to the training distribution in feature space; flag inputs with unusually large distance.
Boundary proximity detection: Measure how close the input's feature representation is to the model's decision boundary; inputs deliberately close to the boundary (but not crossing it) may be adversarial probes.

Layer 2: Certified Defences

Certified defences provide mathematical guarantees: for inputs within a specified perturbation radius (e.g., L∞ norm ≤ 0.01), the model's prediction is certified to be stable. The state of the practice for certified defences:

Randomised smoothing: The model is queried on multiple copies of the input with added Gaussian noise; the majority vote provides a certified prediction under L2 norm perturbations. This is the most practical certified defence for large models.
Lipschitz-constrained networks: Models trained with constraints on their Lipschitz constant provide formal stability guarantees but with accuracy/capacity tradeoffs.
Certified defences are computationally expensive (randomised smoothing requires N model evaluations per prediction) and are applied selectively to high-risk decision contexts.

Layer 3: Ensemble Verification

An adversarial example crafted to fool one model typically does not fool an independent model with different architecture, training data order, or random initialisation. Ensemble verification queries multiple independent models and compares their outputs:

High agreement across models → high confidence prediction.
Low agreement → possible adversarial input or genuine ambiguity; flag for human review or apply certified defence.
This technique detects both adversarial examples and genuine edge cases where model confidence should be low.

Layer 4: Continuous Adversarial Red-Teaming

The adversarial landscape evolves. Static defences become less effective as attackers adapt to known defence techniques. Continuous red-teaming:

Automated adversarial example generation (FGSM, PGD, CW attacks) against production models on a weekly schedule.
Robustness metrics tracked over time: clean accuracy, adversarial accuracy, certified accuracy.
Regression gates: model deployment blocked if adversarial accuracy drops below threshold relative to baseline.

Human Escalation

For inputs that trigger anomaly detection or show low ensemble agreement, escalation to human review is the ultimate defence. The escalation mechanism:

High-risk decisions with adversarial indicators are queued for human review rather than automated action.
Human review result is fed back into the anomaly detection baseline to improve future detection.

5. Architecture Diagram

ARCHITECTURE DIAGRAM

flowchart TD subgraph Detection["Defence Layers"] A[Incoming Input] B[Anomaly Detector] C{Risk Score} D[Ensemble Verification] end subgraph Decision["Decision + Escalation"] E[Primary Model] F[Human Review Queue] G[Final Decision] end subgraph Monitoring["Continuous Red-Teaming"] H[Attack Generator] I[Robustness Metrics] end A --> B --> C C -->|low risk| E --> G C -->|medium risk| D C -->|high risk| F D -->|agreement| G D -->|disagreement| F H --> I -->|regression gate| E style A fill:#dbeafe,stroke:#3b82f6 style B fill:#f0fdf4,stroke:#22c55e style C fill:#f3e8ff,stroke:#a855f7 style D fill:#f0fdf4,stroke:#22c55e style E fill:#fef9c3,stroke:#eab308 style F fill:#fee2e2,stroke:#ef4444 style G fill:#d1fae5,stroke:#10b981 style H fill:#fee2e2,stroke:#ef4444 style I fill:#fef9c3,stroke:#eab308

6. Components

Component	Type	Responsibility	Technology Options	Criticality
Feature Squeezer	Anomaly Detection	Applies smoothing transformations; detects prediction instability	Custom pre-processing (median filter, bit-depth reduction)	High
Distribution Monitor	Anomaly Detection	Measures Mahalanobis distance from input to training distribution in feature space	Custom PyTorch/TF layer; Alibi Detect	High
Boundary Proximity Detector	Anomaly Detection	Estimates distance to decision boundary; flags inputs near boundary	Custom; requires access to model gradients or surrogate model	Medium
Randomised Smoothing Service	Certified Defence	Runs N noisy model evaluations; computes certified prediction	Smoothed classifiers (Cohen et al.); custom implementation	Medium
Ensemble Inference	Ensemble	Queries multiple independent models; computes agreement	Multiple model endpoints (same framework, different architecture/seed)	High
Human Review Queue	Escalation	Queues uncertain/suspicious inputs for human decision	ServiceNow, Jira, custom review UI	High
Adversarial Attack Generator	Red-Teaming	Generates adversarial examples for automated robustness testing	Adversarial Robustness Toolbox (IBM ART), Foolbox, CleverHans	High
Robustness Dashboard	Observability	Tracks clean accuracy, adversarial accuracy, certified accuracy over time	Grafana + Prometheus; MLflow metrics	High
Feedback Loop	ML Operations	Incorporates human review decisions into anomaly detection baseline	Custom pipeline; ML feature store	Medium

7. Data Flow

Primary Flow

Step	Actor	Action	Output
1	Input	Arrives at adversarial defence pipeline	Raw input (image/audio/text/structured)
2	Feature Squeezer	Applies smoothing; compares predictions before/after	Instability score
3	Distribution Monitor	Computes feature-space distance from training distribution	Mahalanobis distance score
4	Boundary Proximity	Estimates proximity to decision boundary	Proximity score
5	Anomaly Aggregator	Combines scores; determines routing	Low/Medium/High anomaly score
6a	If Low score	Route directly to primary model	Standard inference result
6b	If Medium score	Route to ensemble verification (3+ models)	Agreement level + predictions
6c	If High score	Route to certified defence (randomised smoothing)	Certified prediction + radius OR cannot-certify
7a	High ensemble agreement	Pass to decision aggregator	Confident prediction
7b	Low ensemble agreement	Escalate to human review	Pending human decision
7c	Cannot certify	Escalate to human review	Pending human decision
8	Decision Aggregator	Combines prediction with adversarial risk score	Final decision + metadata

8. Security Considerations

OWASP LLM Top 10 Coverage

OWASP LLM Risk	Adversarial Input Defence Mitigation	Coverage
LLM01: Prompt Injection	Anomaly detection catches adversarially crafted text inputs that target model decision boundaries	Medium
LLM02: Insecure Output Handling	Ensemble disagreement detection can flag anomalous outputs	Medium
LLM03: Training Data Poisoning	Red-teaming detects degraded robustness that may indicate training data poisoning	Medium
LLM04: Model Denial of Service	Certified defence (N evaluations) is computationally expensive — protect with rate limiting	Low
LLM05: Supply Chain Vulnerabilities	Not directly applicable	None
LLM06: Sensitive Information Disclosure	Not directly applicable	None
LLM07: Insecure Plugin Design	Not directly applicable	None
LLM08: Excessive Agency	Not directly applicable	None
LLM09: Overreliance	Adversarial risk score in output provides explicit signal against overreliance for adversarial inputs	High
LLM10: Model Theft	Model inversion defences limit information returned per query; ensemble inconsistency detection	Medium

Encryption

All inter-component communication TLS 1.3.
Human review queue content encrypted at rest (may contain sensitive inputs).

9. Governance Considerations

Governance Artefacts

Artefact	Owner	Frequency	Purpose
Robustness Benchmark Report	AI Risk Team	Weekly (automated) + monthly (reviewed)	Tracks adversarial accuracy vs baseline; identifies degradation
Human Review Escalation Log	AI Operations	Continuous; weekly review	Record of all inputs escalated for human review; patterns
Red Team Exercise Report	Security Team	Quarterly	Documents adversarial attack scenarios tested; findings; remediation
Model Robustness Validation (EU AI Act)	AI Risk + Compliance	Pre-deployment; annually	Evidence of robustness testing for high-risk AI Act compliance
Anomaly Detection Calibration	AI Platform	Monthly	Reviews anomaly threshold tuning; false positive/negative rates

10. Operational Considerations

SLOs

SLO	Target	Measurement
Adversarial detection rate (known attacks)	>90% of test attack suite flagged	Weekly automated red-team suite
False positive rate (clean inputs flagged)	<2%	Monthly audit of human review queue
Certified defence latency (p99)	<2s (N=100 evaluations)	Defence latency metric
Ensemble verification latency (p99)	<500ms (3 models × 100ms)	Ensemble span
Robustness regression gate	Block if adversarial accuracy drops >5% vs baseline	Automated gate in model deployment pipeline

Incident Management

Adversarial attack pattern detected in production inputs → P2: Security + AI team investigation; pattern analysis; defence update.
Adversarial accuracy below threshold → P2: Block model update deployment; investigate training pipeline.
Human review queue depth > N hours backlog → Alert to AI operations team.

11. Cost Considerations

Cost Drivers

Cost Driver	Description	Relative Impact
Ensemble inference	N models × inference cost per request	High
Certified defence compute	N=100–1000 model evaluations per prediction	Very High
Red-teaming infrastructure	Attack generation compute; model evaluation at scale	Medium
Human reviewer time	Reviewing escalated uncertain inputs	Medium
Anomaly detection compute	Feature extraction and distance computation	Low–Medium

Indicative Cost Range

Scale	Monthly Cost (USD)	Notes
Small (< 100K predictions/day)	$500–$2,000	3-model ensemble; sampling-based anomaly detection
Medium (100K–1M predictions/day)	$3,000–$15,000	Full ensemble; selective certified defence; GPU inference
Large (> 1M predictions/day)	$20,000–$80,000	Distributed ensemble; GPU cluster for certified defence; dedicated red-team

12. Trade-Off Analysis

Option Comparison

Option	Description	Pros	Cons	Best For
A: Adversarial training only	Train model on adversarial examples to improve robustness	Improved robustness with no inference overhead	No mathematical guarantee; doesn't defend against all attack types; accuracy tradeoff	General robustness improvement; low-risk applications
B: Anomaly detection only	Statistical detection of adversarial inputs; no certified defence	Low inference overhead; practical at scale	No guarantees; may be evaded by adaptive attackers	Most production deployments; reasonable threat model
C: Full certified + ensemble (this pattern)	Statistical detection + ensemble verification + randomised smoothing	Strongest defence; mathematical guarantees for high-risk decisions	High inference cost (N× evaluations); complex ops	High-risk AI systems; EU AI Act high-risk classification
D: Human review only	All uncertain inputs reviewed by human	Perfect coverage; no adversarial bypass	Not scalable; high latency; expensive	Decisions too important to automate at all

Architectural Tensions

Tension	Trade-Off
Security vs Latency	Certified defence (N=512 evaluations) adds significant latency. Resolution: apply certified defence only to high-risk decisions; use anomaly detection for real-time paths.
Guaranteed Robustness vs Accuracy	Certified defences typically sacrifice 5–15% clean accuracy for mathematical robustness guarantees. Resolution: evaluate tradeoff per use case; for high-stakes decisions, accuracy tradeoff is acceptable.
Defence Completeness vs Adaptive Attackers	Publicly known defences can be circumvented by adaptive attackers who craft attacks specifically against the defence. Resolution: defence diversity (don't publish exact defence configuration); continuous red-teaming.

13. Failure Modes

Failure	Likelihood	Impact	Detection	Recovery
Adaptive attacker bypasses known defences	Medium	High	Red-team detects; human review catches consequence	Update defences; increase ensemble diversity; tighten anomaly thresholds
Certified defence unable to certify (radius too small)	Medium	Medium (escalates to human)	Cannot-certify rate metric	Acceptable if human review is adequate; indicates model near-boundary
False positive spike (clean inputs flagged as adversarial)	Medium	High (legitimate decisions delayed)	FPR metric; human review queue depth	Threshold tuning; model retraining
Human review queue overflow	Medium	High (escalated inputs unreviewed)	Queue depth metric	Scale human review team; emergency: automated fallback with logging
Ensemble model diverges from primary (legitimate accuracy drop)	Low	Medium (increased escalation rate)	Ensemble agreement rate drop	Retrain ensemble models; investigate training drift

14. Regulatory Considerations

Regulation	Requirement	Implementation
EU AI Act Art. 15 (Robustness and Accuracy)	High-risk AI systems must be resilient to attempts to alter outputs or performance	Adversarial defence directly implements robustness requirement; red-team evidence demonstrates testing
EU AI Act Art. 15 (Technical Robustness)	Implemented by design; validated through testing	Continuous red-teaming and robustness benchmarking provides test evidence
NIST AI RMF MANAGE 1.3	Responses to identified risks monitored and adjusted	Feedback loop from human review to anomaly detection implements continuous management
ISO/IEC 42001 §8.4 (Incident Management)	AI system incidents monitored and managed	Adversarial attack detection and escalation implements §8.4
APRA CPS234 §21	Controls commensurate with threat environment	For regulated AI systems, adversarial defences are a commensurate control for adversarial threat actors

15. Reference Implementations

AWS

Component	AWS Service / OSS
Anomaly detection	SageMaker Model Monitor (data drift) + custom feature-space anomaly
Ensemble inference	Multiple SageMaker endpoints; SageMaker inference pipeline
Certified defence	Custom Lambda function (randomised smoothing implementation)
Red-teaming	SageMaker Processing + Adversarial Robustness Toolbox (IBM ART)
Human review	Amazon A2I (Augmented AI) for human review workflows

Azure

Component	Azure Service / OSS
Anomaly detection	Azure ML data drift detection + custom Mahalanobis
Ensemble	Multiple Azure ML endpoints; Azure ML pipelines
Certified defence	Custom Azure Functions (randomised smoothing)
Red-teaming	Azure ML automated ML + Foolbox

On-Premises

Component	Technology
Anomaly detection	Alibi Detect (open source)
Ensemble	Multiple model servers (Triton Inference Server)
Certified defence	Custom PyTorch implementation (Cohen et al. randomised smoothing)
Red-teaming	IBM Adversarial Robustness Toolbox (ART)
Human review	Custom review UI + Kafka task queue

Pattern	ID	Relationship
Prompt Firewall	EAAPL-SEC002	SEC002 covers NLP adversarial attacks (prompt injection); SEC010 covers ML-level adversarial examples
Model Isolation	EAAPL-SEC003	Isolation limits blast radius when adversarial bypass succeeds
AI Telemetry	EAAPL-OBS001	Robustness metrics are collected through telemetry infrastructure
Model Drift Detection	EAAPL-OBS005	Adversarial input patterns can cause observable model output drift
AI Performance Benchmarking	EAAPL-OBS008	Adversarial accuracy benchmarking is an extension of SEC010's red-teaming

17. Maturity Assessment

Overall Maturity: Emerging

Dimension	Score (1–5)	Rationale
Pattern definition clarity	3	Strong academic foundation; production patterns still evolving
Technology availability	3	OSS tooling (ART, Alibi Detect) available; production-grade deployment patterns less mature
Industry adoption	2	Limited to most security-mature organisations and specific high-risk domains
Certified defence practicality	2	Randomised smoothing practical only for specific model types; accuracy tradeoffs limit adoption
Regulatory alignment	4	EU AI Act Art. 15 creates direct regulatory driver
Research maturity	5	Deep academic literature (ICML, NeurIPS, ICLR); active research community

18. Revision History

Version	Date	Author	Changes
1.0	2025-01-20	Security Architecture Team	Initial pattern definition; reflects current state of emerging practice

Track this pattern for APRA/ASIC review

← Back to Library More AI Security →