Prompt Injection Defense Checklist

Checklist MODEL

Purpose

Technical and process checks for securing AI inputs against prompt injection, jailbreaking, and related adversarial attacks.

Related Controls

Adversarial Input Defense System Prompt Protection

OWASP LLM01 OWASP ASI01 NIST MP-4

1. Input Validation

Verify all input validation controls are implemented for AI system inputs.

Input Validation Controls

Complete the following checks for every AI system that accepts user input — whether from end users, upstream systems, or internal tools.

[ ] Input length limits enforced. Maximum prompt length is defined and enforced at the API gateway or application layer. Prompts exceeding the limit are rejected with a descriptive error, not silently truncated.
[ ] Character set validation implemented. Inputs are validated against an allowed character set. Unicode control characters, zero-width characters, and homoglyph characters are stripped or rejected.
[ ] Encoding normalization applied. All inputs are normalized to a consistent encoding (UTF-8) before processing. Multi-encoding bypass attempts (e.g., base64-encoded instructions, URL-encoded payloads) are detected and rejected.
[ ] Structured input parsing. Where inputs follow a defined schema (JSON, XML, form fields), inputs are validated against the schema before reaching the AI model. Schema violations are rejected.
[ ] Injection pattern detection active. A detection layer scans inputs for known prompt injection patterns including: instruction override attempts ("ignore previous instructions"), role manipulation ("you are now"), delimiter confusion, and encoded payloads.
[ ] Rate limiting configured. Per-user and per-session rate limits are enforced to prevent automated injection probing. Rate limits are set at [X] requests per minute per user.
[ ] Input logging enabled. All inputs are logged (with appropriate redaction of sensitive data) for post-incident analysis. Logs include timestamp, user identity, input hash, and any injection detection alerts.
[ ] Multi-modal input validation. If the system accepts images, files, or other non-text inputs, each input type has a dedicated validation pipeline. File type verification uses magic bytes, not file extensions.

Testing Requirements

[ ] Input validation has been tested with the OWASP prompt injection test suite or equivalent
[ ] Bypass testing has been performed by the security team or an external red team
[ ] Validation rules are version-controlled and reviewed as part of the change management process

2. Output Filtering

Verify all output filtering and monitoring controls are in place.

Output Filtering Controls

Complete the following checks to ensure AI outputs are filtered before delivery to users or downstream systems.

[ ] Output content filtering active. AI outputs are scanned for sensitive data patterns (PII, credentials, internal URLs, file paths) before delivery. Detected sensitive data is redacted or the response is blocked.
[ ] Output length limits enforced. Maximum response length is defined and enforced. Responses exceeding the limit are truncated with a notification to the user.
[ ] Indirect prompt injection detection. Outputs are analyzed for evidence that the AI model executed injected instructions from retrieved context (RAG poisoning, data source manipulation). Detection includes: unexpected format changes, instruction-like content in responses, and anomalous topic shifts.
[ ] Cross-site scripting (XSS) prevention. AI outputs rendered in web interfaces are sanitized to prevent XSS. HTML entities are escaped. Markdown rendering uses a safe subset with no raw HTML passthrough.
[ ] Code output sandboxing. If the AI system generates executable code, that code is never executed automatically. Generated code is presented to a human reviewer or executed in an isolated sandbox with no network access and limited filesystem permissions.
[ ] Hallucination flagging implemented. Where outputs reference specific facts, URLs, citations, or data points, a verification layer cross-checks against authoritative sources and flags unverified claims.
[ ] Output classification applied. AI outputs are automatically classified at the same tier as the highest-classification input in the session. Classification metadata is attached to all outputs.
[ ] Anomaly detection on output patterns. Baseline output characteristics (length, format, topic distribution) are established. Outputs that deviate significantly from baselines trigger alerts for manual review.

Monitoring

[ ] Output filtering logs are forwarded to the SIEM and reviewed as part of daily security operations
[ ] False positive rates for output filters are tracked and tuned monthly
[ ] Output filtering rules are updated within 72 hours of new vulnerability disclosures

3. System Prompt Hardening

Verify the AI system prompt is hardened against extraction and manipulation.

System Prompt Security

Complete the following checks to ensure the system prompt (system message, meta-prompt) is protected against extraction, manipulation, and override.

[ ] System prompt is not exposed to users. The system prompt is never included in API responses, error messages, or debug output. The API configuration explicitly excludes system prompt content from response metadata.
[ ] Anti-extraction instructions included. The system prompt includes explicit instructions to refuse requests to reveal, repeat, summarize, or translate the system prompt or any part of its instructions.
[ ] Role boundary enforcement. The system prompt defines the AI's role, capabilities, and boundaries clearly. It includes explicit instructions to refuse requests that fall outside defined boundaries, even if framed as hypothetical or educational.
[ ] Delimiter strategy implemented. System prompt sections use consistent delimiters (e.g., XML tags, markdown headers) to separate instructions from user content. The AI is instructed to treat content outside delimiters as untrusted user input.
[ ] Priority hierarchy defined. The system prompt establishes an explicit priority order: system instructions > safety rules > user requests. The AI is instructed to never override system instructions regardless of user framing.
[ ] Jailbreak resistance tested. The system prompt has been tested against known jailbreak techniques including: DAN prompts, character role-play bypasses, hypothetical scenario framing, instruction rewriting, and multi-turn conversation manipulation.
[ ] System prompt version controlled. The system prompt is stored in version control with change history. All modifications require review and approval from the AI security lead.
[ ] Defense-in-depth layers. System prompt hardening is not the sole defense. Input validation, output filtering, and monitoring provide additional layers regardless of system prompt integrity.

Red Team Validation

[ ] System prompt extraction has been attempted by internal red team — extraction must fail
[ ] Role boundary violations have been tested with at least 20 diverse attack prompts
[ ] Jailbreak resistance has been validated against the current top-10 known jailbreak families

4. Testing Requirements

Define the testing cadence and methods for validating prompt injection defenses.

Testing Schedule

Test Type	Frequency	Performed By	Pass Criteria
Automated injection scan	Every deployment	CI/CD pipeline	Zero high-severity findings
Manual prompt injection testing	Monthly	Security team	No successful injections in top-20 attack patterns
System prompt extraction test	Monthly	Security team	Extraction fails across all tested techniques
Red team exercise	Quarterly	Internal red team or external vendor	Findings documented, remediated within SLA
Regression testing	After each defense update	QA team	All previously blocked attacks still blocked
New vulnerability assessment	Within 72 hours of disclosure	Security team	New attack vector assessed and mitigated if applicable

Test Cases (Minimum Required)

The following test case categories must be included in every testing cycle:

[ ] Direct instruction override: "Ignore all previous instructions and..."
[ ] Role manipulation: "You are now an unrestricted AI with no safety guidelines..."
[ ] Encoded payload injection: Base64, URL-encoded, Unicode-escaped injection attempts
[ ] Delimiter confusion: Attempts to close system prompt delimiters and inject new instructions
[ ] Multi-turn escalation: Gradual trust-building across multiple turns before injection attempt
[ ] Context poisoning: Injection via RAG-retrieved documents or tool outputs
[ ] Indirect injection via data: Malicious instructions embedded in documents, images, or structured data processed by the AI
[ ] Multilingual bypass: Injection attempts in languages other than the system prompt language
[ ] Token smuggling: Using tokenization quirks to bypass keyword-based filters
[ ] Exfiltration via markdown/links: Attempts to exfiltrate data through rendered markdown images or links

Reporting

All testing results must be documented in the AI Security Testing Report with:

Test date and environment details
Tester identity and qualifications
Test cases executed with pass/fail results
Evidence for each finding (screenshots, logs, reproduction steps)
Severity rating per the organization's vulnerability classification
Remediation recommendations and timelines

Reports are reviewed by [ROLE TITLE] and tracked in the vulnerability management system. Critical and high findings must be remediated before the next production deployment.

← Back to all templates