Prompt Injection Attacks in Defended Systems
Abstract
This paper investigates the effectiveness of prompt injection attacks against large language models (LLMs) that employ defensive mechanisms. We evaluate multiple attack strategies across various defended systems, analyzing success rates and identifying vulnerabilities that persist despite protective measures.
Key Findings
- Indirect and multi-turn injections bypass most defenses: While direct injection attempts were generally blocked, context manipulation, payload splitting, and multi-turn compositional attacks achieved significantly higher success rates against defended systems.
- Defenses overfitted to known attack patterns: Pattern-matching and surface-level input filters perform well against documented attacks but fail against encoding-based obfuscation and novel injection strategies, indicating poor generalization.
- Compositional attacks are particularly effective: Sequences of individually benign-seeming prompts that build adversarial context across turns exploit conversational coherence to make the final injection appear natural rather than malicious.
- No single defense provides comprehensive protection: Instruction-tuned models with safety alignment showed the strongest baseline resistance but remained vulnerable to role-playing and context-switching attacks; external guardrails added latency while still being bypassable.
- Defense-in-depth is the only viable strategy: Effective protection requires combining input analysis, model-level alignment, output classification, and architectural blast-radius limiting, alongside continuous adversarial testing with evolving methodologies.
What are prompt injection attacks and how do LLMs defend against them?
Prompt injection involves crafting inputs that override or subvert a model's intended instructions, causing it to produce outputs that violate its operational constraints. Current defenses include system prompt hardening, input sanitization filters, output classifiers that detect policy violations, and instruction hierarchy approaches that give system-level instructions higher priority than user inputs -- applied either at the model level through fine-tuning and RLHF, or as external guardrails wrapping the model's API.
However, the effectiveness of these defenses under sustained, adversarial pressure remains poorly understood. Most defense evaluations test against a narrow set of known attack patterns, leaving open the question of how well they generalize. As LLMs have moved from research prototypes into production systems handling sensitive tasks -- customer service, code generation, document summarization -- the stakes of successful injection attacks have grown considerably. This paper addresses that gap by systematically evaluating a broad spectrum of prompt injection techniques against models equipped with multiple layers of defense.
How were prompt injection attacks tested against defended LLM systems?
We constructed an evaluation framework that pairs diverse prompt injection strategies with several categories of defended LLM systems. The attack strategies ranged from simple direct injections (e.g., "ignore previous instructions and...") to more sophisticated approaches including payload splitting, context manipulation, role-playing exploits, and encoding-based obfuscation techniques. Each attack was formulated in multiple variants to account for surface-level pattern matching by defenses.
On the defense side, we evaluated systems employing input preprocessing filters, instruction-tuned models with safety alignment, models augmented with external classifier guardrails, and systems using structured prompting techniques designed to isolate user input from system instructions. The defended systems were tested as black boxes, reflecting real-world deployment conditions where attackers do not have access to model weights or defense configurations.
Success was measured along multiple dimensions: whether the attack caused the model to deviate from its instructions, whether it produced explicitly prohibited content, and whether the deviation was detectable by the defense layer itself. This multi-dimensional evaluation provides a more nuanced picture than simple binary success/failure metrics.
Which prompt injection techniques bypass LLM defenses?
- Defense bypass rates: Certain prompt injection techniques achieve significant success even against defended models, with indirect and multi-turn injection strategies proving particularly difficult to defend against
- Attack taxonomy: Classification of injection methods by their effectiveness against specific defenses, revealing that no single defense strategy provides comprehensive protection
- Defense-attack asymmetry: Defenses that perform well against direct injection often fail against context manipulation and encoding-based attacks, suggesting that current approaches are overfitted to known attack patterns
- Layered defense gaps: Even systems combining multiple defensive mechanisms exhibit exploitable weaknesses when attacks are composed in multi-step sequences
- Recommendations: Guidelines for improving LLM security based on identified weaknesses, including the need for adversarial evaluation during defense development
How well do current LLM defenses resist prompt injection attacks?
Pattern-matching and surface-level input defenses are substantially less robust than semantic-level defenses, and no single defense category provides comprehensive protection. While direct injection attempts were generally well-handled, attacks that embedded payloads within seemingly benign context, split instructions across multiple turns, or used encoding tricks to bypass input filters achieved markedly higher success rates against all defended systems tested.
Among the defense categories tested, instruction-tuned models with safety alignment showed the strongest baseline resistance, but were still vulnerable to role-playing and context-switching attacks. External classifier guardrails caught many policy-violating outputs but introduced latency and could themselves be bypassed when the model's output was crafted to appear compliant while still achieving the attacker's objective.
A particularly notable finding was the effectiveness of compositional attacks -- sequences of individually benign-seeming prompts that, taken together, induce policy-violating behavior. These attacks exploit the model's context window and its tendency to maintain conversational coherence, effectively building up a context that makes the final injection appear natural rather than adversarial.
Why is defense-in-depth essential for LLM security?
No single defensive mechanism should be treated as sufficient for LLM security. A defense-in-depth approach -- combining input analysis, model-level alignment, output classification, and careful system architecture that limits the blast radius of successful injections -- is the only viable strategy, alongside continuous adversarial testing using evolving attack methodologies rather than static benchmark evaluations.
The fundamental challenge is that the same flexibility and contextual understanding that make language models useful also make them difficult to constrain. Defenses must contend with an enormous space of possible inputs, while attackers need only find a single path through. This asymmetry is compounded by the fact that many defenses are developed against known attack patterns and do not generalize well to novel strategies.
What does this research mean for production LLM deployments?
Current defensive mechanisms for LLMs require significant improvement before they can be trusted in production environments. Developers deploying LLMs need standardized evaluation frameworks that test defenses against a comprehensive and evolving set of adversarial techniques, rather than relying on static benchmarks that give a false sense of security.
Related Topics
LLM-as-a-Judge Vulnerabilities · Adversarial Attacks on LLM Judges · Trojan Detection in LLMs
Cite as
@inproceedings{khomsky2024prompt,
title={Prompt Injection Attacks in Defended Systems},
author={Khomsky, Dmitry and Maloyan, Narek and Nutfullin, Bulat},
booktitle={DCCN},
year={2024}
}