back to the log

2026-06-16

What "production" actually means for an LLM pipeline

The demo works on your laptop. Production is everything the demo skipped — retries, idempotency, output validation, cost tracking, and what to do when the model returns garbage at 3am.


The gap between a working demo and a system you can leave running is not the model. The model is the easy part — it works in the demo. The gap is all the plumbing the demo got to skip because you were standing right there when it ran.

Here's the field guide to the boring parts, ordered by reliability bought per line of code.

1. Validate the model's output — never trust free text

The model will return garbage eventually: a missing field, prose wrapped around your JSON, an empty string. If your pipeline assumes well-formed output, it dies on the first bad one — at 3am, unattended.

Parse, validate against a schema, and retry on failure. Better, force structured output with a typed tool so malformed responses can't happen in the first place. The rule: the boundary between the model and the rest of your code is untrusted input. Treat it like any other.

2. Retries with backoff

Model APIs rate-limit and blip. A single 429 should not kill a run.

import time, random

RETRYABLE = {408, 409, 429, 500, 502, 503, 529}

def with_retry(fn, *, attempts=5, base=1.0):
    for n in range(attempts):
        try:
            return fn()
        except APIStatusError as e:
            if e.status_code not in RETRYABLE or n == attempts - 1:
                raise
            time.sleep(base * 2 ** n + random.random())  # exponential backoff + jitter

Distinguish retryable (rate limits, overloads, timeouts) from fatal (a 400 — your request is malformed and will be malformed every time). Retrying a fatal error just burns time and money.

3. Idempotency — assume every run can die halfway

A scheduled job that crashes gets retried, which means every step runs twice sometimes. Design so repeating a step is safe: guard irreversible actions with a seen-store, use idempotency keys, write to temp files and rename. "Exactly once" is a distributed-systems fantasy; "at least once, safely" is achievable and enough.

4. Cost tracking

If you're not logging tokens and dollars per run, you find out what your pipeline costs from the bill. It's a counter:

usage = resp.usage  # input_tokens, output_tokens
run_cost += usage.input_tokens * IN_RATE + usage.output_tokens * OUT_RATE

Log it per run with the run id. The day a prompt change quietly triples your token use, this is how you catch it in hours instead of at the end of the month.

5. A kill switch

A runaway loop — a retry storm, a pagination bug, a model that keeps asking for "one more step" — can spend real money fast. Put a ceiling on it: a per-run budget cap, a max-iteration count, a circuit breaker that halts when the error rate spikes. Cheap insurance against the 3am surprise.

6. Observability you can grep

When it breaks, you get logs and nothing else. Make them worth having: structured lines with a run id, the inputs, the decision, the output. "Something failed" is useless; "run a3f scored item X at 2, below threshold, skipped" tells you whether the bug is the model or your code.


The mindset under all six: assume every external call fails, every model output is suspect, and every run can be interrupted and resumed. Build for that and you can actually walk away from the thing.

"Production" isn't a deploy target. It's just the set of failures you've already handled — and the demo has handled none of them. Start with output validation and retries; they buy the most reliability per line. The rest you add the first time each one bites you.