2026-06-15
The seen-store — give your agent a memory so it stops repeating itself
An autonomous pipeline that can't remember what it already did will, eventually, do it again. The fix is twenty lines, not a database migration.
The moment you put an agent on a schedule, the question stops being "can it do the task" and becomes "will it do the task it already did." A pipeline that runs every morning and can't remember yesterday will re-post the same story, re-render the same video, re-email the same person — confidently, on schedule, forever.
The fix is the cheapest reliability primitive in an autonomous system, and almost everyone adds it after the agent embarrasses them. Add it first.
The seen-store
A seen-store is a persistent set of keys you've already acted on. The rule is two lines of English: check before you act, record after.
import json, hashlib, pathlib
SEEN = pathlib.Path("seen.json")
def _load() -> set[str]:
return set(json.loads(SEEN.read_text())) if SEEN.exists() else set()
def _save(keys: set[str]) -> None:
SEEN.write_text(json.dumps(sorted(keys)))
def key_for(item: dict) -> str:
# Stable across runs: canonical URL beats title — titles get edited.
basis = item.get("url") or item["title"]
return hashlib.sha256(basis.strip().lower().encode()).hexdigest()[:16]
def unseen(items: list[dict]) -> list[dict]:
seen = _load()
return [i for i in items if key_for(i) not in seen]
def mark(items: list[dict]) -> None:
seen = _load()
seen.update(key_for(i) for i in items)
_save(seen)
That's the whole idea. A JSON file is fine until it isn't; swap it for SQLite or Redis when you outgrow it and the interface stays the same.
The key is the actual hard part
unseen() is trivial. key_for() is where the bodies are buried. The key has to be
stable — the same logical item must hash to the same key on every run.
- Don't key on the title. Titles get re-edited; a one-character change and the item looks brand new.
- Canonicalize URLs before hashing — strip
utm_params, trailing slashes, fragments.example.com/postandexample.com/post?utm_source=rssare the same story. - Content-hash when there's no stable id — for near-duplicates (the same story from two outlets), hash a normalized chunk of the body, not the headline.
A bad key gives you one of two failures: too loose and you re-do work, too tight and you silently drop new items. Spend your time here.
Record before or after?
This is the question that decides whether your seen-store is an optimization or a correctness guarantee.
- Record after success and a crash between acting and recording gives you a duplicate next run.
- Record before acting and a crash between recording and acting drops the item forever.
For reversible work (rendering a file you'll overwrite anyway), record-after and accept the occasional repeat. For irreversible side effects — sending an email, posting a video, charging a card — the seen-store is your correctness boundary: guard the irreversible call, and lean toward at-least-once with a downstream dedupe rather than at-most-once that can silently drop. A duplicate is embarrassing; a dropped payment is a bug report.
Keep it bounded
A seen-store that only grows is a slow leak. Cap it: a TTL (drop keys older than N days), or a max size with FIFO eviction. For most content pipelines, "anything older than 30 days won't resurface anyway" is a fine pruning rule.
Build the seen-store before the agent does something twice in public. It's twenty lines, it's the difference between "runs unattended" and "runs unattended until it doesn't," and it's the foundation everything else in a production pipeline stands on.