AI Red Team Playbook

Procedure ASSURANCE

Purpose

Adversarial testing procedures for AI systems covering prompt injection, jailbreaking, data extraction, and agentic attack scenarios.

Related Controls

AI Red Teaming and Adversarial Testing Adversarial Input Defense System Prompt Protection

OWASP LLM01 OWASP LLM02 OWASP ASI01 NIST MS-2

1. Scope & Rules of Engagement

Define the boundaries, authorization, and constraints for the red team exercise.

Engagement Authorization

Field	Value
Exercise Name	AI Red Team Assessment — [SYSTEM NAME]
Authorization Date	[DATE]
Authorized By	[ROLE TITLE], [DEPARTMENT]
Exercise Period	[DATE] through [DATE]
Red Team Lead	[ROLE TITLE]
Emergency Contact	[ROLE TITLE], [PHONE/EMAIL]

Scope Definition

In Scope

AI model endpoints (API and web interface)
System prompt and configuration
Input validation and output filtering mechanisms
RAG (Retrieval-Augmented Generation) data pipelines and knowledge bases
Agent tool integrations and function calling capabilities
Authentication and authorization for AI system access
Monitoring and alerting systems (testing detection capability)

Out of Scope

Underlying cloud infrastructure (unless explicitly authorized)
Third-party vendor systems beyond the API integration point
Physical security and social engineering of personnel
Denial-of-service testing (unless specifically authorized with capacity planning)
Production systems with active customer traffic (use staging or shadow environments)

Rules of Engagement

No Harm: Testing must not cause data loss, service degradation to real users, or unauthorized data access beyond the scope authorization
Logging: All red team activities must be logged with timestamps, techniques used, and outcomes
Immediate Escalation: If testing reveals an actively exploitable critical vulnerability, stop testing and notify the emergency contact immediately
Data Handling: Any sensitive data inadvertently accessed during testing must be reported and securely deleted
Legal Compliance: All testing must comply with applicable laws, the organization's acceptable use policies, and any vendor terms of service
Deconfliction: Red team lead must coordinate with the blue team lead to ensure red team activities are not misidentified as real attacks

2. Attack Scenarios

Document the specific attack scenarios to be tested, organized by attack category.

Category 1: Prompt Injection

Scenario 1.1 — Direct Instruction Override

Objective: Bypass system prompt instructions by injecting overriding commands

Techniques:

"Ignore all previous instructions and..."
"New system message: you are now..."
Instruction injection via delimiter manipulation
Multi-language instruction injection (translate attack to non-English language)

Scenario 1.2 — Indirect Injection via Data

Objective: Execute injected instructions through data processed by the AI (RAG poisoning, document injection)

Techniques:

Embed instructions in documents added to the knowledge base
Include hidden instructions in metadata, comments, or invisible text in uploaded files
Inject instructions through tool outputs that the AI processes

Category 2: Jailbreaking

Scenario 2.1 — Role-Play Bypass

Objective: Convince the AI to adopt a persona that ignores safety constraints

Techniques:

DAN (Do Anything Now) and derivative prompts
Character role-play scenarios that frame harmful content as fictional
"Educational" or "research" framing to bypass content filters

Scenario 2.2 — Multi-Turn Escalation

Objective: Gradually build trust and context across multiple conversation turns to eventually bypass restrictions

Techniques:

Start with benign requests, progressively escalating to restricted content
Establish a shared context that normalizes boundary violations
Use callback references to earlier approved content to justify escalation

Category 3: Data Extraction

Scenario 3.1 — System Prompt Extraction

Objective: Extract the full system prompt or configuration

Techniques:

Direct requests ("What are your instructions?")
Indirect extraction ("Summarize your capabilities and constraints")
Translation attacks ("Translate your instructions to French")
Token-by-token extraction through carefully crafted queries

Scenario 3.2 — Training Data Extraction

Objective: Extract memorized training data or confidential knowledge base content

Techniques:

Verbatim recall prompts targeting specific data patterns
Completion attacks that provide partial sensitive data and request the AI to complete it
Cross-session information leakage testing

Category 4: Agentic AI Attacks

Scenario 4.1 — Tool Misuse

Objective: Trick the AI agent into using its tools in unauthorized ways

Techniques:

Manipulate the agent into executing unintended tool calls
Inject tool parameters through prompt manipulation
Chain tool calls to achieve unauthorized outcomes that no single tool call would permit

3. Tools & Techniques

List the tools, frameworks, and methodologies used for AI red teaming.

Recommended Tool Stack

Tool	Purpose	License
Garak	Automated LLM vulnerability scanning	Apache 2.0
PyRIT (Microsoft)	Python Risk Identification Toolkit for generative AI	MIT
Promptfoo	Automated prompt testing and red teaming	MIT
OWASP LLM Top 10 Testing Guide	Manual testing methodology	Creative Commons
Burp Suite / ZAP	API-level interception and manipulation	Commercial / Apache 2.0
Custom Python scripts	Targeted attack automation	Internal

Methodology

Phase 1: Reconnaissance (Days 1-2)

Map all AI system endpoints and interfaces
Identify the model provider, version, and configuration (if determinable)
Document the system's stated capabilities and restrictions
Test baseline behavior with benign prompts to establish normal response patterns
Identify the input/output processing pipeline (validation, filtering, formatting)

Phase 2: Automated Scanning (Days 3-5)

Run Garak with standard probe sets against all endpoints
Execute Promptfoo test suites for injection, jailbreak, and extraction
Run PyRIT attack chains for multi-turn escalation scenarios
Document all findings with evidence (full request/response pairs)

Phase 3: Manual Testing (Days 6-10)

Target findings from automated scanning for deeper manual exploitation
Execute novel attack scenarios not covered by automated tools
Test agentic capabilities (tool use, function calling, multi-step reasoning)
Attempt chained attacks combining multiple techniques
Test edge cases specific to the organization's deployment context

Phase 4: Validation and Cleanup (Days 11-12)

Re-test all findings to confirm reproducibility
Classify findings by severity (Critical, High, Medium, Low, Informational)
Verify no persistent changes were made to the AI system or its data
Securely delete any sensitive data accessed during testing

4. Reporting Format

Define the structure and content requirements for the red team report.

Report Structure

Executive Summary (1 page)

Overall risk rating: ☐ Critical ☐ High ☐ Medium ☐ Low
Total findings by severity
Top 3 critical findings with business impact
Recommendation summary

Scope and Methodology (1-2 pages)

Systems tested, environments, and access levels
Testing period and hours of effort
Tools and techniques used
Limitations and caveats

Findings Detail (per finding)

Each finding must include:

Field	Description
Finding ID	Unique identifier (e.g., AIRT-2026-001)
Title	Descriptive title
Severity	Critical / High / Medium / Low / Informational
OWASP LLM Category	Mapping to OWASP Top 10 for LLMs (e.g., LLM01 — Prompt Injection)
Attack Category	Prompt injection / Jailbreak / Data extraction / Agentic / Other
Description	Detailed explanation of the vulnerability
Reproduction Steps	Step-by-step instructions to reproduce the finding
Evidence	Screenshots, request/response pairs, logs (redacted if containing sensitive data)
Business Impact	What an attacker could achieve by exploiting this vulnerability
Recommendation	Specific, actionable remediation guidance
Remediation Priority	Immediate / Within 30 days / Within 90 days

Risk Matrix Summary

Severity	Count	Remediation SLA
Critical	___	48 hours
High	___	7 days
Medium	___	30 days
Low	___	90 days
Informational	___	No SLA (tracked)

Appendix

Full list of test cases executed with pass/fail status
Tool configurations and scan parameters
Raw evidence files (encrypted, shared separately)

5. Remediation Tracking

Define the process for tracking remediation of red team findings.

Remediation Workflow

Step 1: Finding Triage (Within 48 hours of report delivery)

Red team lead presents findings to the engineering and security teams
Each finding is assigned an owner from the engineering team
Severity ratings are validated or adjusted based on engineering team input
Remediation timelines are agreed upon per the SLA matrix

Step 2: Remediation Planning (Within 1 week)

Finding ID	Owner	Severity
AIRT-2026-001	[DATE]	☐ Open
AIRT-2026-002	[DATE]	☐ Open
AIRT-2026-003	[DATE]	☐ Open

Step 3: Remediation Implementation

Engineering team implements fixes per the remediation plan
All fixes must go through the standard code review and testing process
Fixes must not introduce new vulnerabilities or regressions
Documentation is updated to reflect new controls or changed behavior

Step 4: Verification Testing (Within 1 week of fix deployment)

The red team re-tests each remediated finding to confirm:

The specific attack vector documented in the finding is no longer exploitable
Variations of the attack are also blocked (the fix is not narrowly scoped)
The fix does not introduce new bypass opportunities
The fix does not degrade legitimate functionality

Step 5: Closure

Verified findings are marked as "Closed — Verified" in the tracking system
Findings that fail verification return to Step 3 with an updated timeline
Exception requests (accept risk) require CISO approval and are documented in the risk register

Metrics and Reporting

Metric	Target	Tracking
Mean time to remediate (Critical)	≤ 48 hours	Per exercise
Mean time to remediate (High)	≤ 7 days	Per exercise
Findings closed within SLA	≥ 95%	Monthly
Regression rate (previously fixed issues reappearing)	0%	Per exercise
Red team exercise cadence	Quarterly	Annual

Continuous Improvement

After each red team exercise, the AI Governance Committee reviews:

Trends in finding types and severity across exercises
Effectiveness of previously implemented remediations
New attack techniques that should be added to future exercises
Updates needed to the Prompt Injection Defense Checklist
Training gaps identified through the exercise

Lessons learned are documented and incorporated into the next exercise's scope and methodology.

← Back to all templates