Prompt Injection Attacks in Defended Systems
Abstract
This paper investigates the effectiveness of prompt injection attacks against large language models (LLMs) that employ defensive mechanisms. We evaluate multiple attack strategies across various defended systems, analyzing success rates and identifying vulnerabilities that persist despite protective measures.
Key Findings
- Defense bypass rates: Certain prompt injection techniques achieve significant success even against defended models
- Attack taxonomy: Classification of injection methods by their effectiveness against specific defenses
- Recommendations: Guidelines for improving LLM security based on identified weaknesses
Research Impact
This work contributes to the growing body of AI safety research by demonstrating that current defensive mechanisms for LLMs require significant improvement. The findings have implications for developers deploying LLMs in production environments.
Related Topics
LLM-as-a-Judge Vulnerabilities · Adversarial Attacks on LLM Judges · Trojan Detection in LLMs
📄 Access the Paper:
View on Google Scholar