Sleeper Channels and Provenance Gates: Persistent Prompt Injection in Always-on Autonomous AI Agents
Abstract
Always-on AI agents (OpenClaw, Hermes Agent) run as a single persistent process under the owner's identity, folding messaging, memory, self-authored skills, scheduling, and shell into one authority boundary. This configuration opens what we call sleeper channels: an untrusted input to one surface persists as a memory, skill, scheduled job, or filesystem patch, then fires later through a different surface with no attacker present. Two independent axes define the class: persistence substrate and firing-separation. We walk a confused-deputy cron attack end-to-end through OpenClaw at a pinned commit. The defense is tiered (D1, D2, D3), and D2 carries a soundness theorem against seven named deployment invariants. D2 keys on a canonical action-instance digest with one-shot owner attestations, defeating paraphrase laundering, multi-input grant reuse, and replay. A companion artifact ships the gate, a static audit over the vendored source, and a runtime adapter realising five of the ten mediation hooks (H1, H2, H3, H6, H9) around the cron path. Empirical evaluation is preregistered as follow-on.
Key Findings
- Always-on agents collapse five authority surfaces into one principal: Messaging, memory, self-authored skills, scheduling, and shell all execute under the owner's identity inside a single persistent process. Any surface that can write to a persistent substrate writes to the same trust boundary that later executes the action.
- Sleeper channels are defined by two axes: a persistence substrate (memory, skill, scheduled job, filesystem patch) and a firing-separation between the surface that ingests the payload and the surface that later actuates it. The class is exhaustive along these two axes.
- End-to-end confused-deputy cron attack on OpenClaw: demonstrated at a pinned commit. An untrusted message schedules a benign-looking cron whose action expands at firing time, and the scheduler treats it as already authorised.
- D2 provenance gate carries a soundness theorem: against seven named deployment invariants. D2 keys on a canonical action-instance digest plus one-shot owner attestations, defeating paraphrase laundering, multi-input grant reuse, and replay.
- Companion artifact realises 5/10 mediation hooks: H1, H2, H3, H6, H9 around the cron path. Runtime adapter, static audit over vendored source, 42 tests on Node 20+. Empirical evaluation over the remaining hooks is preregistered.
What is a sleeper channel in an always-on AI agent?
A sleeper channel is a persistent prompt-injection path that decouples ingestion from firing. Untrusted input arrives at one surface of an always-on agent — a chat message, a webhook payload, a tool response, an email body — and is written into a long-lived substrate the agent maintains: its memory store, a self-authored skill it can re-invoke, a scheduled job, or a filesystem region it later reads. The payload then sits dormant. When the firing surface activates (a scheduler tick, the model recalling a memory, the agent loading a skill, a later file read) the embedded directive is acted on, with no attacker present and no fresh consent.
Sleeper channels generalise the prompt-injection threat from "what did the model just read" to "what is the agent quietly about to do as a result of something it read days ago." We formalise the class along two independent axes: the persistence substrate — where the payload lives between ingestion and firing — and the firing-separation — how far apart the ingesting surface and the actuating surface are in time, principal, and modality. Both axes matter for defense, because mitigations that check authorisation at ingestion time miss the firing-time gap, and mitigations that gate at firing time miss the latent substrate.
Why are always-on agents like OpenClaw and Hermes Agent particularly exposed?
Always-on agents such as OpenClaw and Hermes Agent are deployed as a single persistent process running under the owner's identity. Inside that one process, five distinct authority surfaces are folded into a single principal: messaging (inbound and outbound user contact), memory (long-lived embeddings and notes), self-authored skills (the agent writes its own callable code), scheduling (the agent installs cron jobs against its own future self), and shell (file and command execution under the owner's account).
Each of these is, in classical security terms, a distinct authority. Collapsing them into one principal means there is no per-surface separation between "who can write to memory" and "who can execute shell commands at 3 a.m." — both happen under the same identity. A sleeper channel exploits exactly that collapse: it does not need to escalate privilege because all of the surfaces it touches already share the same one.
How does the OpenClaw confused-deputy cron attack work?
We walk the attack end-to-end at a pinned OpenClaw commit. An untrusted input persuades the agent to schedule a cron job whose surface description is benign — for example, "remind me every morning to check the weather." The action body, however, is parameterised at firing time on context that the attacker has also seeded into memory or skill state. When the cron fires, the scheduler treats the queued action as having been already authorised at schedule time and releases it. The owner is not present, the original message is gone, and the only evidence of consent is a stale string that no longer represents what the action actually does.
This is a textbook confused deputy: the scheduler holds the owner's authority and acts on it, but the question it answers ("was this queued?") is not the question that should be answered ("did the owner authorise this specific firing-time action?"). Scheduling-time intent and firing-time action drift apart, and the gap is where the attack lives.
What are the persistence substrates and firing-separation axes?
The persistence substrate axis enumerates where a sleeper payload can wait. Four substrates cover the cases we observe in always-on agents: memory (embedded notes the model retrieves), skills (callable code or templates the agent has saved for itself), scheduled jobs (cron or one-shot timers), and filesystem patches (durable file edits the agent later reads). The firing-separation axis describes how the actuating surface differs from the ingesting one: temporal separation (days later), principal separation (a different user or system triggers the firing), and modality separation (ingested as text, fires as a shell command or an outbound message).
Together the two axes define a compact taxonomy. Any concrete sleeper channel can be located in this 2D space, and the defense burden differs per cell. Memory + temporal separation is the most familiar (poisoned long-term notes); skill + modality separation is the most dangerous (a self-authored shell-executing skill triggered later by a benign-looking prompt).
What defenses prevent sleeper-channel exploitation?
We propose a tiered defense. D1 applies surface-level hygiene at ingestion: input sanitisation, structured rather than free-text memory writes, and explicit confirmation prompts on first-write to high-impact substrates. D1 is necessary but not sufficient, because it cannot recover authority that has already been forwarded across the firing gap.
D2 is the load-bearing layer. It interposes a provenance gate between every action and its actuation: the gate releases an action only when it carries a fresh one-shot owner attestation bound to a canonical action-instance digest. D2 carries a soundness theorem against seven named deployment invariants, and is the layer at which paraphrase laundering, multi-input grant reuse, and replay are defeated. D3 adds architectural separation — running scheduled jobs, skills, and shell under distinct principals with explicit cross-principal grants — which would eliminate the original authority collapse, at significant deployment cost.
How does the D2 provenance gate work?
D2 mediates every action just before it is released. For each action it computes a canonical action-instance digest: a deterministic normalisation of the action's effects (the precise scheduler entry, the precise shell command after variable expansion, the precise outbound message recipient and body) over which the digest is taken. The owner attestation that authorised the action is bound to this digest. At firing time the gate recomputes the digest from the action that is actually about to execute and refuses release unless a matching, unused attestation exists.
This shifts the security question from "did the owner ever consent to something like this" to "did the owner consent to this exact instance, and is the consent still fresh." Because the digest is taken over the canonical normalisation, two surface-different actions with the same effect produce the same digest; two surface-similar actions with different effects do not. Because attestations are one-shot, they cannot be reused for a second firing.
What attacker techniques does the canonical action digest defeat?
Three concrete bypasses are closed by keying on the canonical digest:
- Paraphrase laundering: an attacker rewords the scheduled action's surface form to slip past string-matching scheduler checks. The canonical normalisation collapses phrasing variations, so the digest is invariant to wording.
- Multi-input grant reuse: an owner approval for action A is silently reapplied to a semantically different action B that shares some surface text with A. Because the attestation is bound to A's digest, it does not validate B.
- Replay: a captured attestation is fired a second time for the same action. Attestations are one-shot, so the gate refuses release on the replay.
Each of these defeats a class of attack that scheduler-level or string-level checks fail to catch on their own.
What ships with the paper as a companion artifact?
The companion artifact has three components. First, the provenance-gate implementation itself: the canonical digest function, the one-shot attestation store, and the release-side checking. Second, a static audit over the vendored OpenClaw source at the pinned commit, mapping ten mediation hooks (H1–H10) to the call sites where actions are about to be released. Third, a runtime adapter that wires the gate into five of those hooks (H1, H2, H3, H6, H9) around the cron path, with 42 tests covering the mediation flow. The artifact targets Node 20 or newer. Empirical evaluation across the remaining hooks (H4, H5, H7, H8, H10) is preregistered as follow-on work.
Related Topics
Prompt Injection in Agentic Coding Assistants · Breaking the Protocol: MCP Security Analysis · Prompt Injection in Defended Systems · LLM-as-a-Judge Vulnerabilities
Cite as
@article{maloyan2026sleeper,
title={Sleeper Channels and Provenance Gates: Persistent Prompt Injection in Always-on Autonomous AI Agents},
author={Maloyan, Narek and Namiot, Dmitry},
journal={arXiv preprint arXiv:2605.13471},
year={2026}
}