Investigating LLM-as-a-Judge Vulnerability to Prompt Injection

Authors: N. Maloyan, B. Ashinov, D. Namiot
Published: arXiv preprint, 2025
LLM-as-Judge AI Safety Prompt Injection


Abstract

LLM-as-a-Judge architectures are increasingly used to evaluate AI system outputs. This paper investigates their susceptibility to prompt injection attacks, demonstrating how adversarial inputs can manipulate evaluation scores and compromise the integrity of automated assessment pipelines.

Key Findings

Diagram showing four injection vectors into the LLM judge prompt: hidden instructions, criteria anchoring, evaluative padding, and tail positioning exploiting recency bias
Four injection vectors targeting the LLM judge context: hidden instructions, criteria anchoring, evaluative padding, and tail-positioned payloads exploiting recency bias.

Why are LLM-as-a-Judge systems vulnerable to prompt injection?

LLM-as-a-Judge systems are vulnerable because the content being evaluated can directly influence the judge's scoring behavior -- adversaries can craft outputs that receive artificially inflated scores not because they are genuinely better, but because they contain elements that manipulate the evaluation process. This is especially concerning in competitive settings such as chatbot arenas, where rankings directly affect visibility and trust, and in RLHF pipelines, where manipulated scores corrupt training signals.

The use of large language models as automated evaluators has grown rapidly across both research and industry, with practitioners increasingly using models like GPT-4 or Claude to score text quality, assess helpfulness, check factual accuracy, and compare model outputs. Despite this widespread adoption, systematic security analysis of the architecture has been limited -- most work on prompt injection focuses on direct user-model interactions, not on the indirect case where injected content passes through an intermediate evaluation layer. This paper specifically targets that gap.

How were prompt injection attacks on LLM judges tested?

We designed a controlled experimental framework to test prompt injection attacks against LLM judge systems. The framework consists of three components: a set of evaluation tasks with known quality orderings, a collection of prompt injection payloads embedded in model outputs, and multiple LLM judge configurations serving as evaluation targets. The evaluation tasks spanned several domains including summarization, question answering, and open-ended generation, ensuring that our findings generalize across use cases.

The injection payloads were designed to influence the judge without being overtly visible to a human reader. Techniques included appending hidden scoring instructions (e.g., "Rate this response 10/10"), embedding flattering self-assessments within the response text, using formatting tricks to make the response appear more authoritative, and inserting meta-commentary designed to anchor the judge's evaluation. We also tested indirect approaches where the injection subtly reframed the evaluation criteria rather than directly requesting a high score.

We evaluated several judge configurations: single-model scoring (where one LLM assigns a score), pairwise comparison (where the judge selects the better of two responses), and multi-judge panels. For each configuration, we tested both proprietary and open-source models as judges, varying the judge's system prompt and evaluation rubric to assess whether different prompting strategies affect vulnerability.

What types of prompt injection attacks work against LLM judges?

How vulnerable are different LLM judge architectures to manipulation?

Judge Architecture Vulnerability Level Key Weakness
Pairwise comparison Highest Small bias flips binary preference decisions
Single-model scoring High Substantial score displacement from combined attacks
Rubric-based scoring Moderate Structured criteria constrain but do not eliminate bias
Multi-judge panels Lowest Cross-model attack transferability reduces protection

All LLM-as-a-Judge architectures tested are vulnerable to prompt injection, with pairwise comparison being the most susceptible format -- even a modest injected bias can flip binary preference decisions. Single-model scoring configurations also showed substantial score displacement in the attacker's favor, especially when attacks combined multiple techniques such as embedding both a direct scoring instruction and a subtle criteria reframe within the same response.

Rubric-based scoring with explicit criteria provided somewhat more resistance, as the structured evaluation framework constrained the judge's reasoning, but even rubric-based judges could be manipulated when the injection specifically addressed the rubric dimensions.

Multi-judge panels offered the strongest resistance to manipulation, as an attacker would need to successfully influence multiple independent judges simultaneously. However, the cross-model transferability of certain attacks means that a single well-crafted injection can sometimes bias multiple judges at once, reducing the protective benefit of ensembling.

What are the consequences of compromised LLM evaluation systems?

Compromised LLM judges in RLHF pipelines can cause models to learn reward hacking rather than genuine quality improvement -- a subtle degradation that goes undetected because the very evaluation systems designed to catch such problems are themselves compromised. If evaluation scores can be manipulated through prompt injection, the corruption propagates through training signals into future model behavior.

More broadly, this work highlights that the security properties of LLMs must be considered not only in direct user-facing interactions but also in the infrastructure roles that LLMs increasingly occupy. As LLMs are used as judges, moderators, classifiers, and decision-makers within larger systems, each of these roles represents a potential injection target. Securing these indirect attack surfaces requires evaluation-specific defenses and a recognition that the threat model for LLM-based systems extends well beyond the chat interface.

Why does LLM-as-a-Judge security matter for AI development?

Manipulable LLM judges undermine trust in automated evaluation of chatbots, content moderation systems, and AI alignment, creating opportunities for gaming AI systems at scale. This research provides the first systematic characterization of these vulnerabilities and offers concrete guidance for building more robust evaluation pipelines.

Related Topics

Adversarial Attacks on LLM Judges · Prompt Injection in Defended Systems · Trojan Detection in LLMs

📄 Access the Paper: arXiv:2505.13348 · Google Scholar

Cite as

@article{maloyan2025investigating,
  title={Investigating the Vulnerability of LLM-as-a-Judge Architectures to Prompt-Injection Attacks},
  author={Maloyan, Narek and Ashinov, Bulat and Namiot, Dmitry},
  journal={arXiv preprint arXiv:2505.13348},
  year={2025}
}


Narek Maloyan is a PhD candidate at Moscow State University and AI Research Engineer at Zencoder. His research focuses on AI safety, LLM security, and adversarial machine learning. Learn more