Adversarial Attacks on LLM-as-a-Judge Systems

Authors: N. Maloyan, D. Namiot
Published: arXiv preprint, 2025
Adversarial Attacks LLM Evaluation Security

Abstract

This paper presents a comprehensive analysis of adversarial prompt injection attacks against LLM-as-a-Judge evaluation systems. We demonstrate multiple attack vectors and quantify their effectiveness across different judge architectures.

Background

Automated evaluation using LLMs has become a cornerstone of modern AI development. From ranking chatbot responses in arena-style competitions to generating reward signals for reinforcement learning from human feedback, LLM judges are now embedded in critical decision-making loops. Their appeal is clear: they scale far beyond human annotation capacity while producing evaluations that correlate reasonably well with human preferences.

Yet this growing reliance creates a high-value target for adversarial manipulation. Unlike traditional evaluation metrics (BLEU, ROUGE, perplexity), LLM judges process natural language and are therefore susceptible to the same prompt injection attacks that affect any language model. An adversary who can influence a judge's evaluation gains leverage over model rankings, training signals, and content moderation decisions -- all without directly compromising the underlying model being evaluated.

This paper provides a comprehensive treatment of adversarial attacks specifically targeting LLM-as-a-Judge systems. We go beyond demonstrating that attacks are possible and instead focus on characterizing which attack strategies are most effective, which architectural choices make judges more or less vulnerable, and what practical defenses can be deployed.

Methodology

Our study systematically explores the intersection of attack strategies and judge architectures. On the attack side, we developed a taxonomy of prompt injection techniques organized along two axes: directness (whether the injection explicitly requests a favorable score or subtly biases the evaluation) and visibility (whether the injection is easily detectable by a human reviewer). This taxonomy includes direct instruction injection, authority impersonation, criteria anchoring, response padding with evaluative language, and steganographic approaches that hide payloads in formatting or whitespace.

On the architecture side, we evaluated several common judge configurations: single-model pointwise scoring, pairwise preference selection, reference-based evaluation (where the judge compares against a gold answer), and chain-of-thought evaluation (where the judge is instructed to reason step-by-step before scoring). Each configuration was tested with multiple underlying models and system prompts to separate architectural vulnerabilities from model-specific ones.

We measured attack effectiveness using both score displacement (the change in evaluation score caused by the injection) and decision flip rate (the proportion of cases where the injection changes which response is preferred). These metrics capture both the magnitude and practical impact of successful attacks.

Key Findings

Results

Our experiments revealed several important patterns. First, the effectiveness of injection attacks varies substantially across judge architectures, but no architecture we tested was immune. Pointwise scoring systems were the most vulnerable, as the judge has no comparative anchor and its score can be shifted by evaluative language embedded in the response. Pairwise comparison systems showed moderate vulnerability -- the binary nature of the decision means that even small biases can flip outcomes, but the presence of a comparison response provides some grounding.

Chain-of-thought evaluation, where the judge reasons through its assessment before assigning a score, provided the most interesting results. In some cases, the explicit reasoning step helped the judge resist manipulation by forcing it to articulate criteria that conflicted with the injected content. However, in other cases, the injection was incorporated into the chain of thought itself, leading to rationalized but manipulated scores. This suggests that chain-of-thought evaluation is not a reliable defense on its own but may be a useful component of a broader defense strategy.

Among defense strategies, we found that combining output sanitization (removing suspicious patterns from evaluated content) with judge-specific fine-tuning (training the judge model to resist common injection patterns) yielded the strongest protection. However, adaptive attackers who are aware of the sanitization rules can craft payloads that evade detection, highlighting the ongoing arms race dynamic inherent in adversarial security.

Discussion

This work reveals that the security of LLM-as-a-Judge systems cannot be treated as an afterthought. As these systems become more deeply integrated into AI development and deployment pipelines, their vulnerabilities propagate into downstream decisions -- model rankings, training data curation, content moderation, and alignment evaluation. A compromised judge does not merely produce incorrect scores; it corrupts the feedback loops that shape future model behavior.

Our taxonomy and evaluation framework provide a structured basis for ongoing security assessment. We recommend that organizations deploying LLM judges adopt adversarial evaluation as a standard practice, regularly testing their judge configurations against evolving attack techniques. The field would also benefit from shared benchmarks for judge robustness, analogous to existing benchmarks for model capability, to enable systematic comparison and improvement of evaluation security.

Related Topics

LLM-as-a-Judge Vulnerabilities · Prompt Injection Attacks

📄 Access: arXiv:2504.18333