Investigating LLM-as-a-Judge Vulnerability to Prompt Injection
Abstract
LLM-as-a-Judge architectures are increasingly used to evaluate AI system outputs. This paper investigates their susceptibility to prompt injection attacks, demonstrating how adversarial inputs can manipulate evaluation scores and compromise the integrity of automated assessment pipelines.
Key Findings
- Attack vectors: Multiple injection techniques that can bias LLM judges
- Success rates: Quantified vulnerability across different judge architectures
- Implications: Risks for automated evaluation in production systems
- Mitigations: Proposed defenses for more robust LLM-based evaluation
Why This Matters
LLM-as-a-Judge systems are used to evaluate chatbots, content moderation systems, and AI alignment. If these judges can be manipulated, it undermines trust in automated evaluation and creates opportunities for gaming AI systems.
Related Topics
Adversarial Attacks on LLM Judges · Prompt Injection in Defended Systems · Trojan Detection in LLMs
📄 Access the Paper:
arXiv:2505.13348 ·
Google Scholar