The problem
I shipped a multi-stage Claude pipeline before I had any way to tell whether it was working.
The pipeline crawls a website, classifies every page with Claude Haiku 4.5, ranks the top five by a computed conversion-importance score, takes a screenshot of each, and runs a vision audit with Claude Sonnet 4. The audit output gets synthesized into a report and compiled into a customer-facing PDF. It runs on Cloudflare Queues with D1 for state and KV for artifact storage. Real users, real reports.
What I had for quality control was: reading the outputs and squinting. If a report looked decent, it shipped. There was no way to tell whether a prompt edit improved the audit or quietly regressed it on the long tail of edge cases. There was no way to detect, after a model swap, whether the audit had started inventing page elements that weren’t there. Visual spot-checking on five pages out of every report meant I was sampling about 0.0001% of the output space.
This is a writeup of the eval system I’m building to fix that. Two scorer types (exact-match for classification, LLM-as-judge for the audit), because the two stages have different output shapes and different failure modes, and a single eval would obscure both.
Why one eval type wasn’t enough
Classification output is a small fixed shape: primaryType drawn from an enum of about 25 values (homepage, service, pricing, blog-post, contact, …), funnelPosition from an enum of five (top, middle, bottom, post, support), and a boolean hasEmbeddedForm. For any given page there’s a right answer. Exact match is the appropriate scorer.
Audit output is open-ended. The model returns a JSON object with five category scores, each with three to five findings and three to five recommendations. A recommendation like “move the primary CTA above the fold and increase its font size to 16px” has no ground truth. It’s a free-form string. Exact match can’t grade it. Fuzzy matching won’t either; two well-formed recommendations can use entirely different language while saying the same thing, and two badly-formed recommendations can share half their words while disagreeing on the substance.
Three failure modes worth detecting separately:
- Hallucination. The audit refers to a green Subscribe button when no Subscribe button exists on the page. This is a vision failure, and the audit is the only stage where it can occur.
- Specificity collapse. The audit is technically correct but vague (“improve visual hierarchy”). A designer reading this can’t act on it without further interpretation.
- Actionability collapse. The audit is specific but unactionable for the operator who’s reading the report (“rebuild your checkout in React”).
A single composite “quality” score hides which of these is failing. If the audit prompt regresses, I want to know which dimension fell, not that a number ticked down by 0.04. So classification gets three exact-match scorers (one per output field), and the audit gets three independent LLM-judge rubrics. Six scorers total, never combined.
Classification with exact-match
The classify phase calls Haiku 4.5 with the URL, title, meta description, outbound internal links, and first 2KB of HTML for each page, and gets back a JSON array of {url, primaryType, secondaryType, funnelPosition, hasEmbeddedForm, confidence}.
The load-bearing change before any of this could be evaluated was hoisting the prompt and model ID into a file the eval can import:
// worker/src/prompts.ts
export const CLASSIFY_MODEL = 'claude-haiku-4-5-20251001';
export const AUDIT_MODEL = 'claude-sonnet-4-20250514';
export const CLASSIFICATION_PROMPT = `You are a CRO expert classifying web pages. ...`;
export const AUDIT_PROMPT = `You are an expert CRO analyst performing a deep audit ...`;
And extracting the per-batch LLM call into an exported function. The worker still owns batching, KV reads, and D1 updates; it just delegates the call:
// worker/src/phases/classify.ts
export async function classifyBatch(
pages: ClassifyInputPage[],
apiKey: string,
): Promise<ClassifyOutputRow[]> { /* fetch /v1/messages with CLASSIFICATION_PROMPT */ }
This matters more than it looks. If the eval re-implemented the prompt or the request shape, every result would be suspect. I’d be grading a parallel copy of production, not production. The whole point is that the eval invokes the exact string and code path the worker runs.
The scorer itself is intentionally small:
// evals/src/scorers/exactMatch.ts
export function primaryTypeMatch({ output, expected }: ScorerArgs): ScoreResult {
return {
name: 'primary_type_match',
score: output.primaryType === expected.primaryType ? 1 : 0,
metadata: { actual: output.primaryType, expected: expected.primaryType },
};
}
Three of those, one for each output field. No fuzzy matching, no string normalization. The prompt either picks the right enum value or it doesn’t; partial credit hides regressions.
The dataset is hand-labeled rows sourced from the {reportId}:crawl KV blobs the pipeline already persists. Target is 50 rows, with at least two examples per primaryType enum value and five per funnelPosition. Past 50, marginal labeling time outweighs the marginal information gain. Further growth happens through the production-trace promotion loop described below, not through more upfront labeling.
LLM-as-judge with separate rubrics
The audit can’t be graded by exact match, so three judges, each a Sonnet 4 call with a different rubric. The most instructive one is specificity:
// evals/src/scorers/judge/specificity.ts
const SPECIFICITY_RUBRIC = `You are evaluating CRO audit recommendations for SPECIFICITY.
For each recommendation, rate 1–5:
5 — A designer could implement this without further interpretation. Names a specific element, a specific change, a measurable target.
4 — Specific but missing one detail (target value, exact element, or placement).
3 — Direction is clear but multiple reasonable implementations exist.
2 — Vague direction; a designer would need to guess.
1 — Generic best-practice with no anchor to the page (e.g., "improve visual hierarchy").
Return JSON ONLY: {"per_recommendation":[{"text":"...","score":<1-5>,"reasoning":"..."}],"overall":<1-5>}`;
export async function specificityScorer({ output }: { output: AuditOutput }): Promise<ScoreResult> {
const recs = output.categories.flatMap((c) => c.recommendations.map((r) => r.text));
if (recs.length === 0) {
return { name: 'specificity', score: null, metadata: { reason: 'no recommendations' } };
}
const text = await callAnthropic({
model: AUDIT_MODEL,
system: SPECIFICITY_RUBRIC,
messages: [{ role: 'user', content: `Evaluate these recommendations:\n\n${recs.map((r, i) => `${i + 1}. ${r}`).join('\n')}` }],
});
const parsed = parseJsonObject<JudgeOutput>(text);
return {
name: 'specificity',
score: (parsed.overall - 1) / 4, // normalize 1-5 → 0-1
metadata: { per_recommendation: parsed.per_recommendation, raw_overall: parsed.overall },
};
}
Actionability is the same shape with a different rubric: can the operator who bought the report implement this with the engineering capacity they have? A recommendation can be perfectly specific and still ask for a six-month rebuild.
Hallucination is the most interesting of the three because it’s a vision call. The judge gets the recommendation text and the original screenshot, and decides whether the page elements each recommendation references are actually visible in the image. If the audit says “move the green Subscribe button above the fold” and there is no Subscribe button anywhere on the page, that’s an ungrounded recommendation. It’s a real production failure mode for vision audits, because the model is operating on an image and can fail-silent in ways pure-text generation can’t.
A few design choices worth flagging:
Judge model equals audit model. Both are Sonnet 4. The canonical “judge no weaker than generator” rule of thumb is the reason. A weaker judge tends to rubber-stamp the audit’s own biases. I’m not using Opus for the judge because the per-row cost is roughly 3× and the rubrics are structured enough that Sonnet is sufficient. If that turns out to be wrong, the regression checks I run on the seed dataset should surface it.
All rubrics scored in the same direction. Higher is better for all three, including hallucination, which I score as grounded_fraction rather than “hallucination rate.” CI threshold logic and dashboards don’t have to special-case any dimension.
No composite score. A single “quality” number is the seductively wrong move. If specificity falls and actionability rises by the same amount, the composite doesn’t budge. And I’ve lost the signal I most cared about.
What’s broken, or hasn’t been tested yet
The classification eval is in a good place: small, dull, the kind of thing a reader can scan in 30 seconds and trust. The audit-judge eval is more of a discipline than a system, and I want to be upfront about its weak points.
Judge drift. Sonnet 4 is the judge. When Sonnet 4 itself updates (or when the audit moves to a future model), the judge’s ratings will shift even if nothing in the audit prompt changes. The eval needs a held-back set of human-rated examples it’s never shown the judge, so I can detect when the judge itself starts grading differently. That set doesn’t exist yet.
Rubric and prompt co-evolution. Imagine I add a “mobile-specific” category to the audit prompt but don’t update the judge dataset to include mobile-specific rows. Scores would look stable; mobile quality would actually be drifting. The eval would produce false confidence. The intended discipline against this is a quarterly manual review of the rubric against the prompt, with a written changelog that ties every audit-prompt edit to the corresponding judge update. That discipline is on paper; it hasn’t been tested over time.
Dataset size. Fifty classification rows and twenty-five audit rows is enough to catch obvious regressions and not much else. The plan is to grow the audit set automatically by promoting production traces: every report adds five audit blobs to KV; a curation script pulls them, sanitizes them (hashes domains, drops anything with identifying screenshot content), and appends to audit.jsonl. The bottleneck is sanitization, not infrastructure.
No longitudinal data. This is being written before the eval has been live across enough prompt edits to have a track record. The engineering decisions stand on their own; the maintenance story is aspirational until the rubric has survived a real prompt revision without me having to bend it to fit.
What I’d build next
The first thing not in this writeup that I’d add: a GitHub Actions workflow that runs both evals on every PR touching the audit prompt, the classify prompt, or any phase under worker/src/phases/. Threshold gate: any classification scorer drops more than 5%, or any audit rubric drops more than 0.3 absolute against the main baseline, the PR fails. Per-PR cost is small enough that running on every relevant change is fine; the value is that prompt edits stop being “hopefully fine” and start being checked.
After that, an inter-rater check: each judge rating the same row twice, comparing the two scores, surfacing inconsistent rows. Useful for catching cases where the judge itself is the unstable part of the loop, especially in the hallucination dimension, where small differences in attention to the screenshot can flip a grounding judgement.
The longer-term thing I haven’t written down anywhere yet is the most important one. A dashboard that shows the eval’s score history alongside production signals: specifically, the cases where users contest a recommendation as wrong. If the eval says specificity is up and complaints about vague recommendations also go up, the rubric is wrong and needs to be sharpened. The eval grading itself is the artifact that needs the most discipline; that’s where I’d spend the next stretch of engineering time.