System Prompt Protection

Tier 2 MODEL

Related Templates

What This Requires

Protect system-level prompts, embedded credentials, API keys, and operational instructions from extraction or disclosure through user interactions. System prompts must be treated as confidential configuration and defended against direct extraction attempts, indirect inference attacks, and side-channel leakage through model behavior.

Why It Matters

System prompts often contain business logic, behavioral constraints, persona definitions, and sometimes embedded credentials that grant the model access to internal systems. If extracted, attackers gain a roadmap for bypassing safety controls, impersonating the AI system, or accessing connected resources. System prompt leakage has been demonstrated against major commercial AI products and is a well-documented attack vector.

How To Implement

Credential Isolation

Never embed API keys, database credentials, or access tokens directly in system prompts. Use secure credential injection at runtime through environment variables or secrets managers. Ensure the model interacts with external services through authenticated middleware rather than holding credentials in its context window.

Prompt Hardening

Design system prompts with extraction-resistant techniques: include explicit instructions to refuse prompt disclosure requests, avoid placing sensitive logic in easily extractable positions, and use multi-layer prompt architectures where critical instructions are reinforced at multiple points.

Extraction Detection

Deploy output monitoring that detects when model responses contain fragments of the system prompt. Use string similarity matching and semantic comparison between outputs and system prompt content. Alert on detected extraction attempts and log the full interaction for forensic analysis.

Periodic Extraction Testing

Conduct quarterly system prompt extraction testing using published techniques (instruction override, role-play scenarios, token-by-token extraction, encoding tricks). Document successful extractions, remediate promptly, and update hardening techniques based on findings.

Evidence & Audit

  • System prompt inventory confirming no embedded credentials
  • Credential injection architecture documentation (secrets manager integration)
  • System prompt hardening guidelines and implementation records
  • Extraction detection rule configuration and alert logs
  • Quarterly extraction testing reports with findings and remediation
  • Incident records for confirmed system prompt leakage events

Related Controls