Adversarial Attacks on LLM-as-a-Judge Systems

Narek Maloyan; Dmitry Namiot

Adversarial Attacks on LLM-as-a-Judge Systems

Authors: N. Maloyan, D. Namiot
Published: arXiv preprint, 2025
Adversarial Attacks LLM Evaluation Security
Published: April 2025
Last updated: April 15, 2026

Abstract

This paper presents a comprehensive analysis of adversarial prompt injection attacks against LLM-as-a-Judge evaluation systems. We demonstrate multiple attack vectors and quantify their effectiveness across different judge architectures.

Key Findings

Indirect injection outperforms direct: Indirect techniques (criteria anchoring, evaluative padding) prove more robust across judge configurations than direct instruction injection, making them harder to filter with simple pattern matching.
Pointwise scoring most vulnerable: Pointwise scoring is the most exploitable architecture because the judge has no comparative anchor; chain-of-thought evaluation provides partial but inconsistent protection as injections can be incorporated into the reasoning itself.
Effectiveness inversely correlated with visibility: The most effective attacks are the least visible to human inspection, creating a difficult detection-versus-effectiveness tradeoff for quality assurance teams.
Strongest defense requires combining strategies: Combining output sanitization with judge-specific fine-tuning yields the strongest protection, but adaptive attackers aware of sanitization rules can craft evasive payloads.
No architecture is immune: No tested judge architecture was fully resistant to adversarial manipulation -- defense-in-depth with continuous adversarial red-teaming is the only viable strategy.

Diagram comparing normal LLM judge evaluation producing fair scores with adversarial evaluation where injected payloads in model responses manipulate the judge into producing biased scores, corrupting RLHF training signals — Normal vs. adversarial evaluation: injected payloads in responses manipulate judge scoring, potentially corrupting downstream RLHF training.

Background

LLM judges are now embedded in critical decision-making loops -- from ranking chatbot responses in arena-style competitions to generating reward signals for RLHF -- and this growing reliance creates a high-value target for adversarial manipulation. Unlike traditional evaluation metrics (BLEU, ROUGE, perplexity), LLM judges process natural language and are therefore susceptible to the same prompt injection attacks that affect any language model.

An adversary who can influence a judge's evaluation gains leverage over model rankings, training signals, and content moderation decisions -- all without directly compromising the underlying model being evaluated. LLM judges appeal because they scale far beyond human annotation capacity while producing evaluations that correlate reasonably well with human preferences, but this utility comes with inherent security tradeoffs.

This paper provides a comprehensive treatment of adversarial attacks specifically targeting LLM-as-a-Judge systems. We go beyond demonstrating that attacks are possible and instead focus on characterizing which attack strategies are most effective, which architectural choices make judges more or less vulnerable, and what practical defenses can be deployed.

Methodology

Our study systematically explores the intersection of attack strategies and judge architectures. On the attack side, we developed a taxonomy of prompt injection techniques organized along two axes: directness (whether the injection explicitly requests a favorable score or subtly biases the evaluation) and visibility (whether the injection is easily detectable by a human reviewer). This taxonomy includes direct instruction injection, authority impersonation, criteria anchoring, response padding with evaluative language, and steganographic approaches that hide payloads in formatting or whitespace.

On the architecture side, we evaluated several common judge configurations: single-model pointwise scoring, pairwise preference selection, reference-based evaluation (where the judge compares against a gold answer), and chain-of-thought evaluation (where the judge is instructed to reason step-by-step before scoring). Each configuration was tested with multiple underlying models and system prompts to separate architectural vulnerabilities from model-specific ones.

We measured attack effectiveness using both score displacement (the change in evaluation score caused by the injection) and decision flip rate (the proportion of cases where the injection changes which response is preferred). These metrics capture both the magnitude and practical impact of successful attacks.

Key Findings

Attack taxonomy: Classification of injection strategies by effectiveness, with indirect techniques (criteria anchoring, evaluative padding) proving more robust across judge configurations than direct instruction injection
Vulnerability analysis: Weak points in common judge architectures -- pointwise scoring is most vulnerable, while chain-of-thought evaluation provides partial but inconsistent protection
Visibility tradeoff: The most effective attacks are often the least visible to human inspection, creating a difficult detection problem for quality assurance
Defense evaluation: Testing existing protective measures, including output sanitization, instruction repetition, and adversarial training of the judge model
Recommendations: Guidelines for building robust evaluation systems, emphasizing defense-in-depth and the importance of adversarial red-teaming during judge development

Results

No judge architecture we tested was immune to adversarial manipulation, with pointwise scoring being the most vulnerable because the judge has no comparative anchor and its score can be shifted by evaluative language embedded in the response. Pairwise comparison systems showed moderate vulnerability -- even small injected biases can flip binary outcomes, though the presence of a comparison response provides some grounding. The most effective attacks were the least visible to human inspection, creating a difficult detection-versus-effectiveness tradeoff.

Chain-of-thought evaluation, where the judge reasons through its assessment before assigning a score, provided the most interesting results. In some cases, the explicit reasoning step helped the judge resist manipulation by forcing it to articulate criteria that conflicted with the injected content. However, in other cases, the injection was incorporated into the chain of thought itself, leading to rationalized but manipulated scores. This suggests that chain-of-thought evaluation is not a reliable defense on its own but may be a useful component of a broader defense strategy.

Among defense strategies, we found that combining output sanitization (removing suspicious patterns from evaluated content) with judge-specific fine-tuning (training the judge model to resist common injection patterns) yielded the strongest protection. However, adaptive attackers who are aware of the sanitization rules can craft payloads that evade detection, highlighting the ongoing arms race dynamic inherent in adversarial security.

Discussion

A compromised LLM judge does not merely produce incorrect scores -- it corrupts the feedback loops that shape future model behavior, propagating vulnerabilities into model rankings, training data curation, content moderation, and alignment evaluation. The security of LLM-as-a-Judge systems cannot be treated as an afterthought as these systems become more deeply integrated into AI development and deployment pipelines.

Our taxonomy and evaluation framework provide a structured basis for ongoing security assessment. We recommend that organizations deploying LLM judges adopt adversarial evaluation as a standard practice, regularly testing their judge configurations against evolving attack techniques. The field would also benefit from shared benchmarks for judge robustness, analogous to existing benchmarks for model capability, to enable systematic comparison and improvement of evaluation security.

Cite as

@article{maloyan2025adversarial,
  title={Adversarial Attacks on LLM-as-a-Judge Systems: Insights from Prompt Injections},
  author={Maloyan, Narek and Namiot, Dmitry},
  journal={arXiv preprint arXiv:2504.18333},
  year={2025}
}

Narek Maloyan is a PhD candidate at Moscow State University and AI Research Engineer at Zencoder. His research focuses on AI safety, LLM security, and adversarial machine learning. Learn more

Abstract

Key Findings

Background

Methodology

Key Findings

Results

Discussion

Related Topics

Cite as

Related Research