[EAAPL-SEC002] Prompt Firewall
Category: Security / Threat Prevention
Sub-category: Adversarial Input Defence
Version: 1.3
Maturity: Proven
Tags: prompt-injection jailbreak input-validation content-policy nlp-security classifier defence-in-depth
Regulatory Relevance: APRA CPS234, EU AI Act Art. 9 & 15, OWASP LLM01, NIST AI RMF MANAGE 1.3
1. Executive Summary
The Prompt Firewall is an inline defensive layer that inspects every user input and system-constructed prompt before it reaches a large language model. It detects and blocks prompt injection attacks, jailbreak attempts, policy violations, and adversarial instructions that seek to override the model's intended behaviour or extract sensitive information.
For business stakeholders, the risk is concrete: a single successful prompt injection can cause an AI-powered application to ignore its system instructions, impersonate another user, exfiltrate data from its context window, or generate harmful content — all of which carry regulatory, reputational, and financial consequences. A prompt firewall reduces this risk to near zero for known attack patterns and significantly degrades the success rate of novel attacks through semantic analysis.
Unlike perimeter firewalls that operate on network packets, a prompt firewall operates on natural language — requiring a combination of rule-based detection (fast, deterministic, low false-positive), semantic similarity analysis (catches paraphrased attacks), and ML classifiers (catches novel attack classes). The pattern is deployed as an inline middleware stage, typically within the AI Gateway (EAAPL-SEC001), and adds 20–50ms of latency while providing a material reduction in successful prompt injection incidents.
2. Problem Statement
Business Problem
Organisations deploying AI assistants — customer service bots, internal productivity tools, code generation assistants — face an attack vector with no analogue in traditional software: natural language manipulation of the AI's behaviour. An attacker does not need to find a SQL injection vulnerability or exploit a buffer overflow. They need only craft a message that convinces the model to ignore its instructions, impersonate another user, or disclose information it should not.
High-profile incidents have demonstrated that even production LLM deployments from major vendors are vulnerable to prompt injection. The business consequences include: leakage of system prompts (containing proprietary logic or sensitive context), data exfiltration from the context window (e.g., previous conversation turns containing other users' data), generation of policy-violating content that causes regulatory exposure, and denial of service through resource-exhausting prompts.
Technical Problem
LLMs process user input and system instructions in the same channel (the prompt). Unlike a database that cleanly separates queries from data, an LLM cannot inherently distinguish between "authorised instruction" and "adversarial instruction embedded in user data." Any user-controlled text that reaches the model's context window is potentially an attack surface.
Prompt injection attacks take multiple forms: direct injection (attacker directly sends malicious instructions), indirect injection (attacker embeds malicious instructions in documents or web pages that the AI retrieves), jailbreaking (persuasion-based attempts to bypass safety training), role-play exploitation (convincing the model it is a different, unconstrained entity), and token manipulation (using special characters, encoding tricks, or unusual spacing to bypass simple pattern matching).
Symptoms
- AI application generating content that violates its stated purpose (e.g., a coding assistant generating phishing emails).
- System prompt contents appearing in model responses.
- Users reporting that the AI "acted differently" after an unusual input.
- AI application performing actions it was not instructed to perform by the application (in agentic contexts).
- Sudden spikes in content policy violations in output filtering logs.
Cost of Inaction
| Dimension | Impact |
|---|---|
| Regulatory | Disclosure of system prompt containing proprietary logic or PII; potential Privacy Act breach if user data exfiltrated from context |
| Reputational | Public demonstration of AI jailbreak attracts media attention; erodes user trust in AI-powered product |
| Financial | Regulatory fines; remediation costs; potential liability for AI-generated harmful content |
| Security | System prompt exfiltration reveals application architecture; can be used to craft more targeted attacks |
| Operational | Model abuse through resource-exhausting prompts drives API cost spikes and degraded availability for legitimate users |
3. Context
When to Apply
- Any AI application that accepts user-generated text as input to an LLM.
- AI applications operating in adversarial environments (public-facing, customer-facing, or accessible by untrusted internal users).
- Agentic systems where LLMs can invoke tools, APIs, or execute code — the consequences of injection are significantly higher.
- Applications where the system prompt contains sensitive instructions, proprietary logic, or confidential context.
- Regulated use cases where policy-violating outputs carry compliance risk.
When NOT to Apply
- Fully internal, developer-only AI tools where all users are trusted and the threat model does not include insider adversaries.
- Batch processing pipelines where inputs come exclusively from trusted, validated internal sources with no user-controlled content.
- Scenarios where the latency overhead (20–50ms) is prohibitive and alternative controls (strong output filtering) provide acceptable coverage.
Prerequisites
| Prerequisite | Detail |
|---|---|
| AI Gateway (EAAPL-SEC001) | Firewall is ideally deployed as a stage within the gateway; can also be deployed as an application-level middleware |
| Classifier Model | A fine-tuned text classifier or embedding similarity model for semantic analysis |
| Policy Definitions | Organisation's AI Acceptable Use Policy codified into firewall rules |
| Attack Pattern Library | Maintained library of known prompt injection and jailbreak patterns |
| Observability Stack | Logging and alerting infrastructure for firewall events |
Industry Applicability
| Industry | Applicability | Key Driver |
|---|---|---|
| Financial Services | Critical | Regulatory exposure from AI-assisted advice; system prompt exfiltration risk |
| Healthcare | Critical | Protected health information in context window; safety-critical AI outputs |
| Government | Critical | Classified information protection; adversarial nation-state threat actors |
| E-commerce / Retail | High | Customer-facing AI with promotional/pricing logic in system prompt |
| Technology / SaaS | High | Public-facing AI features; developer tools vulnerable to supply chain injection |
| Education | Medium | Minor users; content policy enforcement |
4. Architecture Overview
The Prompt Firewall is a multi-stage detection pipeline that processes every prompt before it reaches the LLM. The pipeline architecture is designed around a fundamental principle: layered defence with increasing cost and decreasing false-positive rate at each layer. Fast, cheap checks run first; expensive, accurate checks run only when cheap checks are inconclusive.
Layer 1: Pattern Matching (Deterministic)
The first layer operates on character and token sequences. It applies a library of regular expressions and exact-match patterns derived from a constantly updated catalogue of known injection strings, jailbreak templates, and policy-violating phrases. This layer executes in microseconds and catches the vast majority of script-kiddie attacks and known jailbreak variants. The pattern library is maintained as a versioned configuration artefact, updated through a CI/CD pipeline that incorporates patterns from public jailbreak repositories (JailbreakChat, LLM Security research) and internal incident findings.
Layer 2: Semantic Analysis (Vector Similarity)
Pattern matching is defeated by paraphrasing. An attacker who knows the patterns can rephrase an injection attack to avoid any string-match. The semantic layer addresses this by embedding the input into a vector space and computing similarity against a library of known malicious embeddings. A cosine similarity threshold (typically 0.85) triggers a block. This layer catches paraphrased attacks and novel variants that share semantic intent with known attacks. It adds approximately 10–20ms using a lightweight embedding model (e.g., sentence-transformers/all-MiniLM-L6-v2 running on CPU). The embedding library is updated with new malicious examples whenever a new attack pattern is identified in the wild.
Layer 3: ML Classifier (Probabilistic)
The third layer applies a fine-tuned binary classifier trained specifically to distinguish legitimate prompts from adversarial ones. Unlike the semantic layer which measures distance to known attacks, the classifier learns decision boundaries from a labelled dataset of benign and malicious prompts — including novel attack types. This layer provides the highest accuracy but also the highest latency (30–80ms on CPU, 5–15ms on GPU). For latency-sensitive applications, this layer runs asynchronously: the request is allowed through with monitoring, but a definitive classifier decision is stored and used to update the pattern library and trigger retrospective review if the classifier scores high probability of injection.
Policy Enforcement Layer
Beyond injection detection, the firewall enforces content policies: does this input request content that violates the organisation's AI Acceptable Use Policy? This includes checks for: requests for content involving minors, attempts to obtain detailed instructions for illegal activities, requests that target specific individuals, and use-case-specific policy violations (e.g., a financial assistant being asked to produce stock tips). Policy checks use a combination of pattern matching and classifier models trained on the specific policy domain.
Allow/Deny List Management
The firewall maintains per-application allow lists (patterns that should never be blocked regardless of classifier score — e.g., legitimate security research applications) and deny lists (patterns that should always be blocked). Allow lists are critical for preventing false positives in legitimate use cases; they require governance review before addition to prevent allow list abuse.
Sanitisation Path
Not all suspicious inputs result in a block. For inputs that are ambiguous (e.g., a high pattern-match score but low semantic similarity), the firewall can sanitise: stripping suspicious instruction sequences while preserving the legitimate intent of the input. Sanitisation is logged and flagged for review to identify attack pattern evolution.
5. Architecture Diagram
6. Components
| Component | Type | Responsibility | Technology Options | Criticality |
|---|---|---|---|---|
| Pattern Matching Engine | Rule Engine | Deterministic check against library of known injection strings and regex patterns | Hyperscan, PCRE2, re2, custom trie-based matcher | High |
| Embedding Service | ML Inference | Converts input to vector representation for semantic similarity comparison | sentence-transformers, OpenAI Embeddings, Cohere Embed (local deployment preferred) | High |
| Malicious Embedding Library | Vector Store | Pre-computed embeddings of known attack prompts; indexed for ANN search | FAISS, hnswlib, Pinecone (local), ChromaDB | High |
| ML Classifier | ML Inference | Fine-tuned binary classifier for injection/jailbreak detection | DistilBERT fine-tuned, DeBERTa, custom logistic regression on embeddings | High |
| Policy Rule Engine | Rule Engine | Evaluates content policy rules against prompt content | OPA, custom rule DSL, AWS Comprehend Custom Classifier | High |
| Pattern Library | Configuration | Versioned library of known attack patterns (regex, exact match, fuzzy match) | Git-versioned YAML/JSON, updated via CI/CD | Critical |
| Allow/Deny List Manager | Configuration | Per-application overrides for firewall decisions | Key-value store (Redis), configuration service | Medium |
| Sanitisation Engine | Transformation | Strips suspicious instruction fragments while preserving legitimate intent | Custom NLP, regex substitution | Medium |
| Firewall Event Logger | Observability | Structured logging of all firewall events (blocks, allows, sanitisations) for security review | Kafka, Fluentd, CloudWatch Logs | Critical |
| Feedback Pipeline | ML Operations | Routes flagged inputs to analyst review; feeds confirmed attacks into retraining | Label Studio, Prodigy, custom review UI | Medium |
7. Data Flow
Primary Flow
| Step | Actor | Action | Output |
|---|---|---|---|
| 1 | Application / AI Gateway | Submits assembled prompt (system + user turn) to firewall entry point | Prompt text submitted for inspection |
| 2 | Pattern Matching Engine | Applies all regex and exact-match patterns from pattern library; records match details if found | MATCH or NO_MATCH with match details |
| 3 | Embedding Service | Converts prompt to vector embedding using local embedding model | Embedding vector (e.g., 384 dimensions) |
| 4 | Similarity Search | Computes cosine similarity against malicious embedding library using ANN index | Nearest-neighbour distance and similarity score |
| 5 | ML Classifier | Runs fine-tuned classifier on prompt (synchronous for score >0.60; asynchronous below threshold) | Probability score: P(injection), P(jailbreak), P(policy_violation) |
| 6 | Policy Rule Engine | Evaluates content policy rules against prompt; applies use-case-specific deny rules | POLICY_PASS or POLICY_VIOLATION with rule ID |
| 7 | Decision Aggregator | Combines results from all layers; determines final action (BLOCK, SANITISE, ALLOW+WATCH, ALLOW) | Final disposition with reason codes |
| 8 | Firewall Event Logger | Writes structured event record regardless of disposition | Audit log entry with: disposition, reason codes, model scores, timestamp, trace_id |
| 9 | Response to Caller | Returns disposition to AI Gateway / application | ALLOW (forward prompt), BLOCK (return 400), SANITISE (return modified prompt) |
Error Flow
| Error Condition | Firewall Behaviour | Disposition | Alert |
|---|---|---|---|
| Embedding service unavailable | Skip Layer 2; proceed with Layers 1, 3, and policy | ALLOW with degraded confidence flag | Warning alert: Layer 2 unavailable |
| Classifier unavailable | Skip Layer 3; proceed with Layers 1 and 2 | ALLOW with degraded confidence flag | Warning alert: Layer 3 unavailable |
| Pattern library stale (>24h without update) | Continue with cached library | Stale library flag on all decisions | Alert: pattern library update required |
| Firewall latency > 200ms (SLA breach) | Log timeout; fail-open (ALLOW) to protect availability | ALLOW with timeout flag for async review | SLA breach alert |
| All detection layers unavailable | Fail closed: BLOCK all requests | BLOCK | Critical alert: firewall fully unavailable |
8. Security Considerations
Authentication & Authorisation
- The firewall service itself must be accessible only from authorised callers (AI Gateway, application middleware). mTLS or API key authentication prevents direct access.
- Pattern library and classifier model updates are authorised through a signed artefact pipeline — an attacker who can modify the pattern library can blind the firewall to specific attacks.
Secrets Management
- If the firewall uses a cloud embedding API (e.g., OpenAI Embeddings for the similarity layer), the API key must be managed per EAAPL-SEC008. Preferably, use a locally-deployed embedding model to avoid sending potentially sensitive prompt content to an external embedding provider.
Data Classification
- Prompts processed by the firewall may contain sensitive data (PII, confidential context). The firewall should not log full prompt content at INFO level; log only truncated indicators, hashes, or anonymised representations unless explicitly configured for full-content logging under a controlled data handling agreement.
Encryption
- All firewall service communication over TLS 1.3.
- Firewall event logs encrypted at rest.
- Classifier model weights stored in encrypted object storage; access audited.
False Positive Management
- False positives (blocking legitimate inputs) are a security misconfiguration, not a minor inconvenience. High false-positive rates cause users to route around the firewall or disable it. Maintain false-positive rate <0.5% of legitimate traffic.
Auditability
- Every firewall decision is logged with full reasoning: which layers triggered, what scores were returned, which pattern matched. This supports both security operations (investigating incidents) and model improvement (identifying false positives).
OWASP LLM Top 10 Coverage
| OWASP LLM Risk | Prompt Firewall Mitigation | Coverage |
|---|---|---|
| LLM01: Prompt Injection | Primary purpose: detect and block direct and indirect prompt injection | Critical |
| LLM02: Insecure Output Handling | Prevents injection attacks that cause unsafe outputs at the source | High (upstream of output) |
| LLM03: Training Data Poisoning | Out of scope for this pattern | None |
| LLM04: Model Denial of Service | Detects resource-exhausting prompt patterns (extremely long nested instructions) | Medium |
| LLM05: Supply Chain Vulnerabilities | Pattern library update pipeline must be secured against supply chain attack | Medium |
| LLM06: Sensitive Information Disclosure | Blocks prompts crafted to elicit disclosure of system prompt or context window contents | High |
| LLM07: Insecure Plugin Design | Blocks injection attacks targeting agentic tool call triggering | High |
| LLM08: Excessive Agency | Blocks prompts that attempt to expand model's scope of action beyond intended permissions | High |
| LLM09: Overreliance | Out of scope | None |
| LLM10: Model Theft | Blocks prompts designed to extract model training data or behaviour through systematic querying | Medium |
9. Governance Considerations
Responsible AI
- The prompt firewall enforces the organisation's AI Acceptable Use Policy at the input layer. Policy rules must be reviewed by the AI Ethics and Governance function before deployment to ensure they do not introduce discriminatory filtering (e.g., blocking inputs in non-English languages disproportionately).
Model Risk Management
- The classifier model used in Layer 3 is itself an AI model and subject to model risk management: it must be validated on representative samples of legitimate traffic before deployment, and its false-positive and false-negative rates must be documented.
Human Approval
- ALLOW+WATCH dispositions (medium-confidence suspicious inputs that were allowed through) must be reviewed by a security analyst within 24 hours. Confirmed injections trigger classifier retraining.
Traceability
- Every block event is traceable to the specific pattern, embedding similarity score, or classifier score that triggered it. This supports appeals processes (a user who believes their input was wrongly blocked can request a review) and regulatory enquiries.
Governance Artefacts
| Artefact | Owner | Frequency | Purpose |
|---|---|---|---|
| Pattern Library Release Notes | Security Team | With each library update | Documents new patterns added, patterns retired, false-positive corrections |
| Classifier Validation Report | AI Risk Team | Quarterly; with each model update | Documents FPR, FNR, precision, recall on validation dataset |
| Firewall Policy Review | AI Governance | Quarterly | Reviews policy rules for AUP alignment, discriminatory impact assessment |
| False Positive Trend Report | AI Platform Team | Monthly | Tracks FPR trend; triggers tuning if >0.5% |
| Security Incident Log | Security Operations | Continuous | Record of all BLOCK events with confirmed/unconfirmed injection classification |
10. Operational Considerations
Monitoring
- Real-time dashboard: block rate by layer (Pattern / Semantic / Classifier / Policy), false-positive rate (from analyst review), latency per layer, classifier confidence distribution.
- Alerting: block rate spike (>10× baseline) = possible coordinated attack; FPR spike = classifier degradation; layer unavailability = degraded defence posture.
SLOs
| SLO | Target | Measurement |
|---|---|---|
| Firewall decision latency p99 | <80ms (synchronous path) | Span: firewall_entry → firewall_decision |
| False-positive rate | <0.5% of legitimate traffic | Monthly analyst review sample |
| Pattern library freshness | <24h since last update check | Library update timestamp metric |
| Detection rate for known attacks | >99% of test attack suite blocked | Weekly automated red-team test suite |
| Firewall availability | 99.9% (fail-open if unavailable) | Synthetic health checks |
Logging
- Structured JSON. Mandatory fields:
trace_id,disposition,layer_triggered,pattern_id(if pattern match),semantic_score,classifier_score,policy_rule_id,latency_ms,input_hash,timestamp_utc. - Full input content logged only at
AUDITlevel under controlled access; standard logs contain only hash and truncated prefix.
Incident Management
- Block rate spike → automated alert to Security Operations.
- Confirmed novel injection technique → Security Operations escalates to threat intelligence team; pattern library update initiated within 4 hours.
- Classifier false-positive spike → immediate escalation to AI Platform team; temporary threshold relaxation if FPR >2%.
DR
| Scenario | RTO | Recovery |
|---|---|---|
| Layer 3 classifier unavailable | 0 (fail-open without Layer 3) | Deploy classifier to backup endpoint; alert |
| Embedding service unavailable | 0 (fail-open without Layer 2) | Restore embedding service; alert |
| Pattern library corruption | 15min | Rollback to previous version via artefact registry |
| Complete firewall service failure | 0 (fail-open; alert) | Immediate recovery required; escalate to P1 |
Capacity
- Pattern matching: CPU-bound, scales linearly with rule count × request rate. 10,000 patterns at 1,000 req/s: ~2 CPU cores.
- Embedding inference: 30ms/request on single CPU core; 8 cores handles ~260 req/s; GPU (T4): ~5ms/request → 200 req/s/GPU.
- Classifier inference: similar to embedding; can be batched for throughput.
11. Cost Considerations
Cost Drivers
| Cost Driver | Description | Relative Impact |
|---|---|---|
| ML inference compute | GPU or CPU instances for embedding model + classifier | High |
| Pattern library maintenance | Security engineer time to curate, test, and release pattern updates | Medium |
| Classifier retraining | Periodic retraining on new labelled examples; GPU compute for training | Medium |
| False-positive review | Analyst time to review ALLOW+WATCH decisions | Low–Medium |
| Embedding model licensing | If using commercial embedding API (OpenAI, Cohere) | Medium (eliminated with local deployment) |
Scaling Risks
- Classifier inference becomes a bottleneck at high request rates if running on CPU. Provision GPU inference early.
- Embedding library grows with each new attack pattern added; ANN search latency increases. Prune stale embeddings and monitor search latency.
Optimisations
- Deploy embedding and classifier models as shared services (not per-application) to amortise GPU cost.
- Cache pattern matching results for identical inputs (hash-based deduplication) — many attackers repeat the same payload.
- Run Layer 3 classifier asynchronously for low-risk inputs to reduce synchronous path latency and allow CPU inference to be sufficient.
Indicative Cost Range
| Scale | Monthly AWS Cost (USD) | Notes |
|---|---|---|
| Small (< 500K req/day) | $300–$800 | 2 CPU inference instances (c6i.2xlarge), ElastiCache for embedding cache |
| Medium (500K–10M req/day) | $1,500–$5,000 | 1–2 g4dn.xlarge GPU instances, load balanced; auto-scaling |
| Large (> 10M req/day) | $10,000–$30,000 | GPU inference cluster (g4dn.12xlarge × N); model server (Triton) |
12. Trade-Off Analysis
Option Comparison
| Option | Description | Pros | Cons | Best For |
|---|---|---|---|---|
| A: Rule-only firewall | Layer 1 (pattern matching) only | Extremely fast (<1ms); zero ML dependencies; deterministic | Defeated by paraphrasing; requires manual pattern maintenance; cannot detect novel attacks | Low-risk internal tools; latency-critical scenarios |
| B: Semantic + Rule firewall | Layers 1 + 2 (pattern + embedding similarity) | Catches paraphrased attacks; moderate latency (20–30ms); no classifier training cost | Does not generalise to truly novel attack classes; embedding library requires curation | Most production use cases; balanced cost/protection |
| C: Full three-layer firewall | Layers 1 + 2 + 3 (pattern + embedding + classifier) | Highest detection rate; generalises to novel attacks; continuous improvement via feedback | Highest latency (50–80ms sync); ML ops burden (classifier maintenance); GPU cost | High-risk, public-facing AI applications; regulated use cases |
| D: Cloud-native content safety | Azure AI Content Safety, AWS Bedrock Guardrails, Google Cloud DLP | Low operational burden; managed SLAs; continuously updated by provider | Limited customisation; sends prompt content to external service (data residency risk); may not cover all injection types | Cloud-committed organisations; non-sensitive content |
Architectural Tensions
| Tension | Trade-Off |
|---|---|
| Detection Rate vs Latency | More detection layers = higher accuracy but higher latency. Resolution: async Layer 3 for medium-confidence inputs; sync only for high-confidence suspects. |
| Sensitivity vs False Positives | Lowering classifier thresholds catches more attacks but blocks more legitimate inputs. Resolution: tune thresholds against organisation-specific traffic using A/B shadow mode before enforcing. |
| Centralisation vs Application Context | A shared gateway-level firewall lacks application-specific context (e.g., a coding assistant has different legitimate input patterns than a customer service bot). Resolution: per-application allow lists and policy profiles configurable in the shared firewall. |
| Local vs Cloud Embedding | Local deployment protects data residency; cloud embedding APIs are faster to deploy and continuously updated. Resolution: default to local; allow cloud only for non-sensitive use cases with contractual data processing agreements. |
13. Failure Modes
| Failure | Likelihood | Impact | Detection | Recovery |
|---|---|---|---|---|
| Pattern library not updated (stale patterns) | Medium | High (missed novel attack variants) | Pattern library age metric > 24h → alert | Automated CI/CD pipeline for pattern library updates; runbook for manual update |
| Classifier model drift (degraded accuracy over time) | Medium | High (increased FNR for evolved attack styles) | Weekly automated red-team test suite; FNR trend | Quarterly retraining; rollback to previous model version |
| Embedding library too large (ANN search latency spike) | Low | Medium (latency SLO breach) | ANN search latency metric | Prune stale embeddings; increase ANN index resources |
| False positive spike (legitimate inputs blocked) | Medium | High (user experience degradation; firewall bypass attempts) | FPR metric from analyst review | Threshold relaxation; allow list additions; root cause investigation |
| Layer 1 + 2 both fail simultaneously | Very Low | Critical (reliance on Layer 3 only or fail-open) | Layer health metrics | Multi-AZ deployment; independent failure domains for each layer |
| Adversarial evasion of all three layers | Low | Critical (successful injection reaching LLM) | Anomalous LLM output patterns (caught by SEC006 output filter) | Output filter provides second defence; incident response; pattern library update |
Cascading Failure
If the firewall fails open (allowing all traffic) during a targeted attack, the LLM's output filter (EAAPL-SEC006) becomes the last line of defence. Output filters are less effective at preventing injection (they can only catch the consequences, not the attack itself). Ensure output filtering is independently deployed and does not share failure domains with the input firewall.
14. Regulatory Considerations
| Regulation | Requirement | Prompt Firewall Implementation |
|---|---|---|
| APRA CPS234 §21 | Controls must be commensurate with vulnerability and threat environment | Three-layer detection architecture with continuous pattern updates matches threat-proportionate control requirement |
| EU AI Act Art. 9 (Risk Management) | High-risk AI systems must implement appropriate risk management | Prompt firewall directly implements input risk management for high-risk AI use cases |
| EU AI Act Art. 15 (Robustness & Accuracy) | High-risk AI systems must be resilient against attempts to alter outputs | Explicit jailbreak and injection defence addresses robustness requirement |
| Australian Privacy Act 1988 | Prevent unauthorised access to personal information | Blocking injection attacks that attempt to exfiltrate personal information from context window |
| NIST AI RMF MANAGE 1.3 | Responses to identified risks are monitored and adjusted | Feedback loop from analyst review to classifier retraining implements continuous risk management |
| ISO/IEC 42001 §8.4 (AI System Operation) | Monitor AI system inputs and outputs | Firewall event log provides required input monitoring artefact |
15. Reference Implementations
AWS
| Component | AWS Service |
|---|---|
| Pattern matching | Lambda (custom Hyperscan-based filter) triggered from API Gateway |
| Embedding service | SageMaker endpoint (sentence-transformers) or Bedrock Titan Embeddings |
| Similarity search | OpenSearch k-NN index |
| Classifier | SageMaker endpoint (fine-tuned DeBERTa) |
| Policy rules | AWS Bedrock Guardrails (content filtering) + custom Lambda rules |
| Event logging | CloudWatch Logs + Kinesis Firehose → S3 |
Azure
| Component | Azure Service |
|---|---|
| Pattern + classifier | Azure AI Content Safety (prompt shield) + custom APIM policy |
| Embedding | Azure OpenAI text-embedding-ada-002 (or local via AKS) |
| Similarity search | Azure AI Search with vector search |
| Policy rules | Azure AI Content Safety content filters |
| Event logging | Azure Monitor → Log Analytics → Immutable storage |
GCP
| Component | GCP Service |
|---|---|
| Pattern matching | Cloud Functions (custom) + Sensitive Data Protection (DLP) |
| Embedding | Vertex AI Text Embeddings |
| Similarity search | Vertex AI Vector Search |
| Classifier | Vertex AI custom model endpoint |
| Event logging | Cloud Logging → BigQuery → Cloud Storage |
On-Premises
| Component | Technology |
|---|---|
| Pattern matching | Hyperscan library in Go/Rust service |
| Embedding | Sentence-transformers on GPU server (NVIDIA T4) |
| Similarity search | FAISS (Facebook AI Similarity Search) |
| Classifier | ONNX Runtime + fine-tuned DeBERTa |
| Policy rules | OPA (Open Policy Agent) with custom Rego rules |
| Event logging | Kafka → Elasticsearch |
16. Related Patterns
| Pattern | ID | Relationship |
|---|---|---|
| AI Gateway | EAAPL-SEC001 | Parent pattern: prompt firewall deployed as a stage within the AI Gateway |
| LLM Input Sanitisation | EAAPL-SEC005 | Complementary: SEC005 handles PII/schema validation; SEC002 handles adversarial intent detection |
| AI Output Filtering | EAAPL-SEC006 | Defence-in-depth pair: SEC002 blocks at input; SEC006 catches consequences at output |
| Adversarial Input Defence | EAAPL-SEC010 | Extends SEC002 to handle adversarial ML attacks beyond prompt injection |
| AI Data Classification | EAAPL-SEC009 | Classification labels inform SEC002 policy rules (higher-sensitivity data = stricter injection detection threshold) |
| Secure Tool Invocation | EAAPL-SEC004 | SEC002 blocks injection attacks targeting tool call manipulation; SEC004 enforces safe execution after the prompt passes the firewall |
17. Maturity Assessment
Overall Maturity: Proven
| Dimension | Score (1–5) | Rationale |
|---|---|---|
| Pattern definition clarity | 5 | Well-defined scope and detection pipeline |
| Technology availability | 4 | Strong OSS options; cloud-native solutions emerging; GPU inference required for full pipeline |
| Industry adoption | 3 | Adopted by security-mature AI teams; not yet universal; underestimated by many organisations |
| Attack landscape coverage | 4 | Covers known attack classes well; novel attacks remain a challenge |
| Operational tooling | 3 | Pattern library management and classifier MLOps require custom tooling investment |
| Regulatory alignment | 4 | Strong alignment with EU AI Act robustness requirements; increasingly referenced in financial services guidance |
| Community knowledge | 3 | Growing body of research (OWASP LLM, academic); practitioner knowledge still developing |
18. Revision History
| Version | Date | Author | Changes |
|---|---|---|---|
| 1.0 | 2024-02-10 | Security Architecture Team | Initial pattern definition |
| 1.1 | 2024-05-15 | Security Architecture Team | Added indirect injection detection; expanded Layer 2 semantic analysis detail |
| 1.2 | 2024-08-20 | Security Architecture Team | Updated OWASP LLM Top 10 mapping to 2024 edition; added agentic context guidance |
| 1.3 | 2025-01-10 | Security Architecture Team | Added async Layer 3 mode; updated cost guidance; added cloud-native option (Option D) |