Adversarial Attacks on LLM-as-a-Judge Systems

Authors: N. Maloyan, D. Namiot
Published: arXiv preprint, 2025
Adversarial Attacks LLM Evaluation Security


Abstract

This paper presents a comprehensive analysis of adversarial prompt injection attacks against LLM-as-a-Judge evaluation systems. We demonstrate multiple attack vectors and quantify their effectiveness across different judge architectures.

Key Findings

Diagram comparing normal LLM judge evaluation producing fair scores with adversarial evaluation where injected payloads in model responses manipulate the judge into producing biased scores, corrupting RLHF training signals
Normal vs. adversarial evaluation: injected payloads in responses manipulate judge scoring, potentially corrupting downstream RLHF training.

Background

LLM judges are now embedded in critical decision-making loops -- from ranking chatbot responses in arena-style competitions to generating reward signals for RLHF -- and this growing reliance creates a high-value target for adversarial manipulation. Unlike traditional evaluation metrics (BLEU, ROUGE, perplexity), LLM judges process natural language and are therefore susceptible to the same prompt injection attacks that affect any language model.

An adversary who can influence a judge's evaluation gains leverage over model rankings, training signals, and content moderation decisions -- all without directly compromising the underlying model being evaluated. LLM judges appeal because they scale far beyond human annotation capacity while producing evaluations that correlate reasonably well with human preferences, but this utility comes with inherent security tradeoffs.

This paper provides a comprehensive treatment of adversarial attacks specifically targeting LLM-as-a-Judge systems. We go beyond demonstrating that attacks are possible and instead focus on characterizing which attack strategies are most effective, which architectural choices make judges more or less vulnerable, and what practical defenses can be deployed.

Methodology

Our study systematically explores the intersection of attack strategies and judge architectures. On the attack side, we developed a taxonomy of prompt injection techniques organized along two axes: directness (whether the injection explicitly requests a favorable score or subtly biases the evaluation) and visibility (whether the injection is easily detectable by a human reviewer). This taxonomy includes direct instruction injection, authority impersonation, criteria anchoring, response padding with evaluative language, and steganographic approaches that hide payloads in formatting or whitespace.

On the architecture side, we evaluated several common judge configurations: single-model pointwise scoring, pairwise preference selection, reference-based evaluation (where the judge compares against a gold answer), and chain-of-thought evaluation (where the judge is instructed to reason step-by-step before scoring). Each configuration was tested with multiple underlying models and system prompts to separate architectural vulnerabilities from model-specific ones.

We measured attack effectiveness using both score displacement (the change in evaluation score caused by the injection) and decision flip rate (the proportion of cases where the injection changes which response is preferred). These metrics capture both the magnitude and practical impact of successful attacks.

Key Findings

Results

No judge architecture we tested was immune to adversarial manipulation, with pointwise scoring being the most vulnerable because the judge has no comparative anchor and its score can be shifted by evaluative language embedded in the response. Pairwise comparison systems showed moderate vulnerability -- even small injected biases can flip binary outcomes, though the presence of a comparison response provides some grounding. The most effective attacks were the least visible to human inspection, creating a difficult detection-versus-effectiveness tradeoff.

Chain-of-thought evaluation, where the judge reasons through its assessment before assigning a score, provided the most interesting results. In some cases, the explicit reasoning step helped the judge resist manipulation by forcing it to articulate criteria that conflicted with the injected content. However, in other cases, the injection was incorporated into the chain of thought itself, leading to rationalized but manipulated scores. This suggests that chain-of-thought evaluation is not a reliable defense on its own but may be a useful component of a broader defense strategy.

Among defense strategies, we found that combining output sanitization (removing suspicious patterns from evaluated content) with judge-specific fine-tuning (training the judge model to resist common injection patterns) yielded the strongest protection. However, adaptive attackers who are aware of the sanitization rules can craft payloads that evade detection, highlighting the ongoing arms race dynamic inherent in adversarial security.

Discussion

A compromised LLM judge does not merely produce incorrect scores -- it corrupts the feedback loops that shape future model behavior, propagating vulnerabilities into model rankings, training data curation, content moderation, and alignment evaluation. The security of LLM-as-a-Judge systems cannot be treated as an afterthought as these systems become more deeply integrated into AI development and deployment pipelines.

Our taxonomy and evaluation framework provide a structured basis for ongoing security assessment. We recommend that organizations deploying LLM judges adopt adversarial evaluation as a standard practice, regularly testing their judge configurations against evolving attack techniques. The field would also benefit from shared benchmarks for judge robustness, analogous to existing benchmarks for model capability, to enable systematic comparison and improvement of evaluation security.

Related Topics

LLM-as-a-Judge Vulnerabilities ยท Prompt Injection Attacks

๐Ÿ“„ Access: arXiv:2504.18333

Cite as

@article{maloyan2025adversarial,
  title={Adversarial Attacks on LLM-as-a-Judge Systems: Insights from Prompt Injections},
  author={Maloyan, Narek and Namiot, Dmitry},
  journal={arXiv preprint arXiv:2504.18333},
  year={2025}
}


Narek Maloyan is a PhD candidate at Moscow State University and AI Research Engineer at Zencoder. His research focuses on AI safety, LLM security, and adversarial machine learning. Learn more