Adversarial Attacks on LLM-as-a-Judge Systems

Authors: N. Maloyan, D. Namiot
Published: arXiv preprint, 2025
Adversarial Attacks LLM Evaluation Security

Abstract

This paper presents a comprehensive analysis of adversarial prompt injection attacks against LLM-as-a-Judge evaluation systems. We demonstrate multiple attack vectors and quantify their effectiveness across different judge architectures.

Key Findings

Attack taxonomy: Classification of injection strategies by effectiveness
Vulnerability analysis: Weak points in common judge architectures
Defense evaluation: Testing existing protective measures
Recommendations: Guidelines for building robust evaluation systems

Abstract

Key Findings

Related Topics