back to the log

2026-06-14

Use Claude as a scorer, not just a generator

The most underrated way to put an LLM in a pipeline is as a ranking function, not a writer. How to do it cheaply, stably, and without slop.


Everyone's first instinct with an LLM is to make it write something. Generate the post, generate the summary, generate the reply. That's the flashy half. The half that actually keeps an automated pipeline from shipping garbage is the boring one:

Use the model as a scorer. A function from an item to a number. f(item) → score. Then sort, filter, and only spend your expensive generation step on the things that earned it.

Why scoring beats rules

You already know how to filter with code. Keyword lists, regexes, recency windows, source allowlists. They're fast and free and you should still use them as a first pass.

But the judgments that matter in a content pipeline aren't keyword-shaped. Is this story actually newsworthy to a working engineer, or is it recycled hype with a new headline? No regex answers that. A model can — that's exactly the fuzzy, context-heavy call LLMs are good at. The trick is to stop asking it to write about the story and start asking it to judge the story.

The pattern

Score every candidate, keep the top N. Here's the whole thing in Python with the Anthropic SDK:

import anthropic, json

client = anthropic.Anthropic()

RUBRIC = """You score stories for an audience of working AI engineers, 1-5:
5 = they'd stop scrolling and read it today
4 = solid, relevant, shippable
3 = mildly interesting, not urgent
2 = recycled or thin
1 = noise / pure hype

Return ONLY JSON: {"score": <int 1-5>, "reason": "<one line>"}"""

def score(item: dict) -> dict:
    msg = client.messages.create(
        model="claude-haiku-4-5",   # small, fast, cheap — use the current Haiku id
        max_tokens=100,
        temperature=0,              # scoring is not a place for creativity
        system=RUBRIC,
        messages=[{"role": "user", "content": f"{item['title']}\n\n{item['summary']}"}],
    )
    return json.loads(msg.content[0].text)

ranked = sorted(candidates, key=lambda i: score(i)["score"], reverse=True)
top = ranked[:5]

That's it. The model is now a ranking function, and the rest of your pipeline only ever sees the best five things instead of the firehose.

Making it cheap

Scoring runs on every candidate, so cost adds up fast if you're careless.

  • Use a small model. Scoring is a Haiku job, not an Opus job. You're asking for a number, not an essay.
  • Cap max_tokens hard. A score and a one-line reason fit in ~100 tokens. Don't pay for a paragraph you'll throw away.
  • Score in parallel. These calls are independent — fan them out, don't loop.
  • Pre-filter with code first. Don't spend a model call ranking something a recency window already killed.

Making it stable

A scorer that returns 3 today and 5 tomorrow for the same input is worse than useless.

  • temperature=0. You want the same input to land in the same bucket every time.
  • Anchor the scale in the prompt. "Score 1-5" alone invites drift. Spell out what each number means, like the rubric above. The anchors are what make the scores comparable across runs.
  • Don't ask for 1-100. That's false precision — the model can't reliably tell a 73 from a 76, and now neither can you. A tight 1-5 with explicit anchors is honest about the resolution you actually have.
  • Force structured output. Parsing free text as JSON works until the day it doesn't. If you want it bulletproof, define a tool with a typed schema and let the model fill it in, instead of hoping json.loads succeeds.

Making it honest (the anti-slop part)

Here's the failure mode that bites people: the model is confident, the JSON is clean, and you start trusting the scores without ever checking them. Confidence is not correctness.

  • Always require a reason. One line, per item. It costs almost nothing and it's the only way you'll catch the rubric being misread.
  • Log every score with its reason. When the pipeline picks something dumb, you want the receipt, not a mystery.
  • Spot-check the boundary. Read the things that scored a 3. That's where the model's judgment is fuzziest and where your rubric needs sharpening.

If you can't explain why item A beat item B, your rubric is the problem — not the model.

This is the agent-loop skeleton

Score-then-act is most of what "agentic" actually means in production. Generation is the part that demos well; scoring and filtering is the part that makes the output not embarrass you. And it's completely transferable — the same f(item) → score shows up as re-ranking retrieved chunks in RAG, triaging a flood of PRs, prioritizing leads, picking which alert actually pages a human.

Reach for the model as a judge before you reach for it as a writer. Your pipeline gets cheaper, more predictable, and a lot less sloppy.