Prompt Injection Attacks in Defended Systems

Authors: D. Khomsky, N. Maloyan, B. Nutfullin
Published: International Conference on Distributed Computer and Communication Networks (DCCN), 2024
Prompt Injection LLM Security Adversarial ML


Abstract

This paper investigates the effectiveness of prompt injection attacks against large language models (LLMs) that employ defensive mechanisms. We evaluate multiple attack strategies across various defended systems, analyzing success rates and identifying vulnerabilities that persist despite protective measures.

Key Findings

Diagram showing how multi-step compositional attacks bypass layered LLM defenses including input filters, safety alignment, and output classifiers by spreading benign-seeming payloads across multiple turns
Multi-step compositional attacks bypass layered defenses by spreading individually benign payloads across turns.

What are prompt injection attacks and how do LLMs defend against them?

Prompt injection involves crafting inputs that override or subvert a model's intended instructions, causing it to produce outputs that violate its operational constraints. Current defenses include system prompt hardening, input sanitization filters, output classifiers that detect policy violations, and instruction hierarchy approaches that give system-level instructions higher priority than user inputs -- applied either at the model level through fine-tuning and RLHF, or as external guardrails wrapping the model's API.

However, the effectiveness of these defenses under sustained, adversarial pressure remains poorly understood. Most defense evaluations test against a narrow set of known attack patterns, leaving open the question of how well they generalize. As LLMs have moved from research prototypes into production systems handling sensitive tasks -- customer service, code generation, document summarization -- the stakes of successful injection attacks have grown considerably. This paper addresses that gap by systematically evaluating a broad spectrum of prompt injection techniques against models equipped with multiple layers of defense.

How were prompt injection attacks tested against defended LLM systems?

We constructed an evaluation framework that pairs diverse prompt injection strategies with several categories of defended LLM systems. The attack strategies ranged from simple direct injections (e.g., "ignore previous instructions and...") to more sophisticated approaches including payload splitting, context manipulation, role-playing exploits, and encoding-based obfuscation techniques. Each attack was formulated in multiple variants to account for surface-level pattern matching by defenses.

On the defense side, we evaluated systems employing input preprocessing filters, instruction-tuned models with safety alignment, models augmented with external classifier guardrails, and systems using structured prompting techniques designed to isolate user input from system instructions. The defended systems were tested as black boxes, reflecting real-world deployment conditions where attackers do not have access to model weights or defense configurations.

Success was measured along multiple dimensions: whether the attack caused the model to deviate from its instructions, whether it produced explicitly prohibited content, and whether the deviation was detectable by the defense layer itself. This multi-dimensional evaluation provides a more nuanced picture than simple binary success/failure metrics.

Which prompt injection techniques bypass LLM defenses?

How well do current LLM defenses resist prompt injection attacks?

Pattern-matching and surface-level input defenses are substantially less robust than semantic-level defenses, and no single defense category provides comprehensive protection. While direct injection attempts were generally well-handled, attacks that embedded payloads within seemingly benign context, split instructions across multiple turns, or used encoding tricks to bypass input filters achieved markedly higher success rates against all defended systems tested.

Among the defense categories tested, instruction-tuned models with safety alignment showed the strongest baseline resistance, but were still vulnerable to role-playing and context-switching attacks. External classifier guardrails caught many policy-violating outputs but introduced latency and could themselves be bypassed when the model's output was crafted to appear compliant while still achieving the attacker's objective.

A particularly notable finding was the effectiveness of compositional attacks -- sequences of individually benign-seeming prompts that, taken together, induce policy-violating behavior. These attacks exploit the model's context window and its tendency to maintain conversational coherence, effectively building up a context that makes the final injection appear natural rather than adversarial.

Why is defense-in-depth essential for LLM security?

No single defensive mechanism should be treated as sufficient for LLM security. A defense-in-depth approach -- combining input analysis, model-level alignment, output classification, and careful system architecture that limits the blast radius of successful injections -- is the only viable strategy, alongside continuous adversarial testing using evolving attack methodologies rather than static benchmark evaluations.

The fundamental challenge is that the same flexibility and contextual understanding that make language models useful also make them difficult to constrain. Defenses must contend with an enormous space of possible inputs, while attackers need only find a single path through. This asymmetry is compounded by the fact that many defenses are developed against known attack patterns and do not generalize well to novel strategies.

What does this research mean for production LLM deployments?

Current defensive mechanisms for LLMs require significant improvement before they can be trusted in production environments. Developers deploying LLMs need standardized evaluation frameworks that test defenses against a comprehensive and evolving set of adversarial techniques, rather than relying on static benchmarks that give a false sense of security.

Related Topics

LLM-as-a-Judge Vulnerabilities · Adversarial Attacks on LLM Judges · Trojan Detection in LLMs


Cite as

@inproceedings{khomsky2024prompt,
  title={Prompt Injection Attacks in Defended Systems},
  author={Khomsky, Dmitry and Maloyan, Narek and Nutfullin, Bulat},
  booktitle={DCCN},
  year={2024}
}


Narek Maloyan is a PhD candidate at Moscow State University and AI Research Engineer at Zencoder. His research focuses on AI safety, LLM security, and adversarial machine learning. Learn more