Adversarial Attacks on LLM-as-a-Judge Systems
Abstract
This paper presents a comprehensive analysis of adversarial prompt injection attacks against LLM-as-a-Judge evaluation systems. We demonstrate multiple attack vectors and quantify their effectiveness across different judge architectures.
Key Findings
- Attack taxonomy: Classification of injection strategies by effectiveness
- Vulnerability analysis: Weak points in common judge architectures
- Defense evaluation: Testing existing protective measures
- Recommendations: Guidelines for building robust evaluation systems
Related Topics
📄 Access:
arXiv:2504.18333