Investigating LLM-as-a-Judge Vulnerability to Prompt Injection

Authors: N. Maloyan, B. Ashinov, D. Namiot
Published: arXiv preprint, 2025
LLM-as-Judge AI Safety Prompt Injection

Abstract

LLM-as-a-Judge architectures are increasingly used to evaluate AI system outputs. This paper investigates their susceptibility to prompt injection attacks, demonstrating how adversarial inputs can manipulate evaluation scores and compromise the integrity of automated assessment pipelines.

Key Findings

Attack vectors: Multiple injection techniques that can bias LLM judges
Success rates: Quantified vulnerability across different judge architectures
Implications: Risks for automated evaluation in production systems
Mitigations: Proposed defenses for more robust LLM-based evaluation

Why This Matters

LLM-as-a-Judge systems are used to evaluate chatbots, content moderation systems, and AI alignment. If these judges can be manipulated, it undermines trust in automated evaluation and creates opportunities for gaming AI systems.

Abstract

Key Findings

Why This Matters

Related Topics