Investigating LLM-as-a-Judge Vulnerability to Prompt Injection

Authors: N. Maloyan, B. Ashinov, D. Namiot
Published: arXiv preprint, 2025
LLM-as-Judge AI Safety Prompt Injection

Abstract

LLM-as-a-Judge architectures are increasingly used to evaluate AI system outputs. This paper investigates their susceptibility to prompt injection attacks, demonstrating how adversarial inputs can manipulate evaluation scores and compromise the integrity of automated assessment pipelines.

Key Findings

Why This Matters

LLM-as-a-Judge systems are used to evaluate chatbots, content moderation systems, and AI alignment. If these judges can be manipulated, it undermines trust in automated evaluation and creates opportunities for gaming AI systems.

Related Topics

Adversarial Attacks on LLM Judges · Prompt Injection in Defended Systems · Trojan Detection in LLMs

📄 Access the Paper: arXiv:2505.13348 · Google Scholar