Sleeper Channels and Provenance Gates: Persistent Prompt Injection in Always-on Autonomous AI Agents

Authors: N. Maloyan, D. Namiot
Published: arXiv preprint arXiv:2605.13471, 2026
Agent Security Prompt Injection Always-on Agents


Abstract

Always-on AI agents (OpenClaw, Hermes Agent) run as a single persistent process under the owner's identity, folding messaging, memory, self-authored skills, scheduling, and shell into one authority boundary. This configuration opens what we call sleeper channels: an untrusted input to one surface persists as a memory, skill, scheduled job, or filesystem patch, then fires later through a different surface with no attacker present. Two independent axes define the class: persistence substrate and firing-separation. We walk a confused-deputy cron attack end-to-end through OpenClaw at a pinned commit. The defense is tiered (D1, D2, D3), and D2 carries a soundness theorem against seven named deployment invariants. D2 keys on a canonical action-instance digest with one-shot owner attestations, defeating paraphrase laundering, multi-input grant reuse, and replay. A companion artifact ships the gate, a static audit over the vendored source, and a runtime adapter realising five of the ten mediation hooks (H1, H2, H3, H6, H9) around the cron path. Empirical evaluation is preregistered as follow-on.

Key Findings


What is a sleeper channel in an always-on AI agent?

A sleeper channel is a persistent prompt-injection path that decouples ingestion from firing. Untrusted input arrives at one surface of an always-on agent — a chat message, a webhook payload, a tool response, an email body — and is written into a long-lived substrate the agent maintains: its memory store, a self-authored skill it can re-invoke, a scheduled job, or a filesystem region it later reads. The payload then sits dormant. When the firing surface activates (a scheduler tick, the model recalling a memory, the agent loading a skill, a later file read) the embedded directive is acted on, with no attacker present and no fresh consent.

Sleeper channels generalise the prompt-injection threat from "what did the model just read" to "what is the agent quietly about to do as a result of something it read days ago." We formalise the class along two independent axes: the persistence substrate — where the payload lives between ingestion and firing — and the firing-separation — how far apart the ingesting surface and the actuating surface are in time, principal, and modality. Both axes matter for defense, because mitigations that check authorisation at ingestion time miss the firing-time gap, and mitigations that gate at firing time miss the latent substrate.

Why are always-on agents like OpenClaw and Hermes Agent particularly exposed?

Always-on agents such as OpenClaw and Hermes Agent are deployed as a single persistent process running under the owner's identity. Inside that one process, five distinct authority surfaces are folded into a single principal: messaging (inbound and outbound user contact), memory (long-lived embeddings and notes), self-authored skills (the agent writes its own callable code), scheduling (the agent installs cron jobs against its own future self), and shell (file and command execution under the owner's account).

Each of these is, in classical security terms, a distinct authority. Collapsing them into one principal means there is no per-surface separation between "who can write to memory" and "who can execute shell commands at 3 a.m." — both happen under the same identity. A sleeper channel exploits exactly that collapse: it does not need to escalate privilege because all of the surfaces it touches already share the same one.

How does the OpenClaw confused-deputy cron attack work?

We walk the attack end-to-end at a pinned OpenClaw commit. An untrusted input persuades the agent to schedule a cron job whose surface description is benign — for example, "remind me every morning to check the weather." The action body, however, is parameterised at firing time on context that the attacker has also seeded into memory or skill state. When the cron fires, the scheduler treats the queued action as having been already authorised at schedule time and releases it. The owner is not present, the original message is gone, and the only evidence of consent is a stale string that no longer represents what the action actually does.

This is a textbook confused deputy: the scheduler holds the owner's authority and acts on it, but the question it answers ("was this queued?") is not the question that should be answered ("did the owner authorise this specific firing-time action?"). Scheduling-time intent and firing-time action drift apart, and the gap is where the attack lives.

What are the persistence substrates and firing-separation axes?

The persistence substrate axis enumerates where a sleeper payload can wait. Four substrates cover the cases we observe in always-on agents: memory (embedded notes the model retrieves), skills (callable code or templates the agent has saved for itself), scheduled jobs (cron or one-shot timers), and filesystem patches (durable file edits the agent later reads). The firing-separation axis describes how the actuating surface differs from the ingesting one: temporal separation (days later), principal separation (a different user or system triggers the firing), and modality separation (ingested as text, fires as a shell command or an outbound message).

Together the two axes define a compact taxonomy. Any concrete sleeper channel can be located in this 2D space, and the defense burden differs per cell. Memory + temporal separation is the most familiar (poisoned long-term notes); skill + modality separation is the most dangerous (a self-authored shell-executing skill triggered later by a benign-looking prompt).

What defenses prevent sleeper-channel exploitation?

We propose a tiered defense. D1 applies surface-level hygiene at ingestion: input sanitisation, structured rather than free-text memory writes, and explicit confirmation prompts on first-write to high-impact substrates. D1 is necessary but not sufficient, because it cannot recover authority that has already been forwarded across the firing gap.

D2 is the load-bearing layer. It interposes a provenance gate between every action and its actuation: the gate releases an action only when it carries a fresh one-shot owner attestation bound to a canonical action-instance digest. D2 carries a soundness theorem against seven named deployment invariants, and is the layer at which paraphrase laundering, multi-input grant reuse, and replay are defeated. D3 adds architectural separation — running scheduled jobs, skills, and shell under distinct principals with explicit cross-principal grants — which would eliminate the original authority collapse, at significant deployment cost.

How does the D2 provenance gate work?

D2 mediates every action just before it is released. For each action it computes a canonical action-instance digest: a deterministic normalisation of the action's effects (the precise scheduler entry, the precise shell command after variable expansion, the precise outbound message recipient and body) over which the digest is taken. The owner attestation that authorised the action is bound to this digest. At firing time the gate recomputes the digest from the action that is actually about to execute and refuses release unless a matching, unused attestation exists.

This shifts the security question from "did the owner ever consent to something like this" to "did the owner consent to this exact instance, and is the consent still fresh." Because the digest is taken over the canonical normalisation, two surface-different actions with the same effect produce the same digest; two surface-similar actions with different effects do not. Because attestations are one-shot, they cannot be reused for a second firing.

What attacker techniques does the canonical action digest defeat?

Three concrete bypasses are closed by keying on the canonical digest:

Each of these defeats a class of attack that scheduler-level or string-level checks fail to catch on their own.

What ships with the paper as a companion artifact?

The companion artifact has three components. First, the provenance-gate implementation itself: the canonical digest function, the one-shot attestation store, and the release-side checking. Second, a static audit over the vendored OpenClaw source at the pinned commit, mapping ten mediation hooks (H1–H10) to the call sites where actions are about to be released. Third, a runtime adapter that wires the gate into five of those hooks (H1, H2, H3, H6, H9) around the cron path, with 42 tests covering the mediation flow. The artifact targets Node 20 or newer. Empirical evaluation across the remaining hooks (H4, H5, H7, H8, H10) is preregistered as follow-on work.

Related Topics

Prompt Injection in Agentic Coding Assistants · Breaking the Protocol: MCP Security Analysis · Prompt Injection in Defended Systems · LLM-as-a-Judge Vulnerabilities


Cite as

@article{maloyan2026sleeper,
  title={Sleeper Channels and Provenance Gates: Persistent Prompt Injection in Always-on Autonomous AI Agents},
  author={Maloyan, Narek and Namiot, Dmitry},
  journal={arXiv preprint arXiv:2605.13471},
  year={2026}
}


Narek Maloyan is a PhD candidate at Moscow State University and AI Research Engineer at Zencoder. His research focuses on AI safety, LLM security, and adversarial machine learning. Learn more