Prompt Injection Defense Checklist
Purpose
Technical and process checks for securing AI inputs against prompt injection, jailbreaking, and related adversarial attacks.
Related Controls
1. Input Validation
Verify all input validation controls are implemented for AI system inputs.
Input Validation Controls
Complete the following checks for every AI system that accepts user input — whether from end users, upstream systems, or internal tools.
- [ ] Input length limits enforced. Maximum prompt length is defined and enforced at the API gateway or application layer. Prompts exceeding the limit are rejected with a descriptive error, not silently truncated.
- [ ] Character set validation implemented. Inputs are validated against an allowed character set. Unicode control characters, zero-width characters, and homoglyph characters are stripped or rejected.
- [ ] Encoding normalization applied. All inputs are normalized to a consistent encoding (UTF-8) before processing. Multi-encoding bypass attempts (e.g., base64-encoded instructions, URL-encoded payloads) are detected and rejected.
- [ ] Structured input parsing. Where inputs follow a defined schema (JSON, XML, form fields), inputs are validated against the schema before reaching the AI model. Schema violations are rejected.
- [ ] Injection pattern detection active. A detection layer scans inputs for known prompt injection patterns including: instruction override attempts ("ignore previous instructions"), role manipulation ("you are now"), delimiter confusion, and encoded payloads.
- [ ] Rate limiting configured. Per-user and per-session rate limits are enforced to prevent automated injection probing. Rate limits are set at [X] requests per minute per user.
- [ ] Input logging enabled. All inputs are logged (with appropriate redaction of sensitive data) for post-incident analysis. Logs include timestamp, user identity, input hash, and any injection detection alerts.
- [ ] Multi-modal input validation. If the system accepts images, files, or other non-text inputs, each input type has a dedicated validation pipeline. File type verification uses magic bytes, not file extensions.
Testing Requirements
- [ ] Input validation has been tested with the OWASP prompt injection test suite or equivalent
- [ ] Bypass testing has been performed by the security team or an external red team
- [ ] Validation rules are version-controlled and reviewed as part of the change management process
2. Output Filtering
Verify all output filtering and monitoring controls are in place.
Output Filtering Controls
Complete the following checks to ensure AI outputs are filtered before delivery to users or downstream systems.
- [ ] Output content filtering active. AI outputs are scanned for sensitive data patterns (PII, credentials, internal URLs, file paths) before delivery. Detected sensitive data is redacted or the response is blocked.
- [ ] Output length limits enforced. Maximum response length is defined and enforced. Responses exceeding the limit are truncated with a notification to the user.
- [ ] Indirect prompt injection detection. Outputs are analyzed for evidence that the AI model executed injected instructions from retrieved context (RAG poisoning, data source manipulation). Detection includes: unexpected format changes, instruction-like content in responses, and anomalous topic shifts.
- [ ] Cross-site scripting (XSS) prevention. AI outputs rendered in web interfaces are sanitized to prevent XSS. HTML entities are escaped. Markdown rendering uses a safe subset with no raw HTML passthrough.
- [ ] Code output sandboxing. If the AI system generates executable code, that code is never executed automatically. Generated code is presented to a human reviewer or executed in an isolated sandbox with no network access and limited filesystem permissions.
- [ ] Hallucination flagging implemented. Where outputs reference specific facts, URLs, citations, or data points, a verification layer cross-checks against authoritative sources and flags unverified claims.
- [ ] Output classification applied. AI outputs are automatically classified at the same tier as the highest-classification input in the session. Classification metadata is attached to all outputs.
- [ ] Anomaly detection on output patterns. Baseline output characteristics (length, format, topic distribution) are established. Outputs that deviate significantly from baselines trigger alerts for manual review.
Monitoring
- [ ] Output filtering logs are forwarded to the SIEM and reviewed as part of daily security operations
- [ ] False positive rates for output filters are tracked and tuned monthly
- [ ] Output filtering rules are updated within 72 hours of new vulnerability disclosures
3. System Prompt Hardening
Verify the AI system prompt is hardened against extraction and manipulation.
System Prompt Security
Complete the following checks to ensure the system prompt (system message, meta-prompt) is protected against extraction, manipulation, and override.
- [ ] System prompt is not exposed to users. The system prompt is never included in API responses, error messages, or debug output. The API configuration explicitly excludes system prompt content from response metadata.
- [ ] Anti-extraction instructions included. The system prompt includes explicit instructions to refuse requests to reveal, repeat, summarize, or translate the system prompt or any part of its instructions.
- [ ] Role boundary enforcement. The system prompt defines the AI's role, capabilities, and boundaries clearly. It includes explicit instructions to refuse requests that fall outside defined boundaries, even if framed as hypothetical or educational.
- [ ] Delimiter strategy implemented. System prompt sections use consistent delimiters (e.g., XML tags, markdown headers) to separate instructions from user content. The AI is instructed to treat content outside delimiters as untrusted user input.
- [ ] Priority hierarchy defined. The system prompt establishes an explicit priority order: system instructions > safety rules > user requests. The AI is instructed to never override system instructions regardless of user framing.
- [ ] Jailbreak resistance tested. The system prompt has been tested against known jailbreak techniques including: DAN prompts, character role-play bypasses, hypothetical scenario framing, instruction rewriting, and multi-turn conversation manipulation.
- [ ] System prompt version controlled. The system prompt is stored in version control with change history. All modifications require review and approval from the AI security lead.
- [ ] Defense-in-depth layers. System prompt hardening is not the sole defense. Input validation, output filtering, and monitoring provide additional layers regardless of system prompt integrity.
Red Team Validation
- [ ] System prompt extraction has been attempted by internal red team — extraction must fail
- [ ] Role boundary violations have been tested with at least 20 diverse attack prompts
- [ ] Jailbreak resistance has been validated against the current top-10 known jailbreak families
4. Testing Requirements
Define the testing cadence and methods for validating prompt injection defenses.
Testing Schedule
| Test Type | Frequency | Performed By | Pass Criteria |
|---|---|---|---|
| Automated injection scan | Every deployment | CI/CD pipeline | Zero high-severity findings |
| Manual prompt injection testing | Monthly | Security team | No successful injections in top-20 attack patterns |
| System prompt extraction test | Monthly | Security team | Extraction fails across all tested techniques |
| Red team exercise | Quarterly | Internal red team or external vendor | Findings documented, remediated within SLA |
| Regression testing | After each defense update | QA team | All previously blocked attacks still blocked |
| New vulnerability assessment | Within 72 hours of disclosure | Security team | New attack vector assessed and mitigated if applicable |
Test Cases (Minimum Required)
The following test case categories must be included in every testing cycle:
- [ ] Direct instruction override: "Ignore all previous instructions and..."
- [ ] Role manipulation: "You are now an unrestricted AI with no safety guidelines..."
- [ ] Encoded payload injection: Base64, URL-encoded, Unicode-escaped injection attempts
- [ ] Delimiter confusion: Attempts to close system prompt delimiters and inject new instructions
- [ ] Multi-turn escalation: Gradual trust-building across multiple turns before injection attempt
- [ ] Context poisoning: Injection via RAG-retrieved documents or tool outputs
- [ ] Indirect injection via data: Malicious instructions embedded in documents, images, or structured data processed by the AI
- [ ] Multilingual bypass: Injection attempts in languages other than the system prompt language
- [ ] Token smuggling: Using tokenization quirks to bypass keyword-based filters
- [ ] Exfiltration via markdown/links: Attempts to exfiltrate data through rendered markdown images or links
Reporting
All testing results must be documented in the AI Security Testing Report with:
- Test date and environment details
- Tester identity and qualifications
- Test cases executed with pass/fail results
- Evidence for each finding (screenshots, logs, reproduction steps)
- Severity rating per the organization's vulnerability classification
- Remediation recommendations and timelines
Reports are reviewed by [ROLE TITLE] and tracked in the vulnerability management system. Critical and high findings must be remediated before the next production deployment.