Multi-rubric LLM scoring over a noisy data stream

The problem

Upwork’s job feed is a firehose. A single week of search queries across the consultancy’s service lines pulls back around 700 postings, and maybe 5% are real fits. The rest are off-topic, off-budget, off-geography, posted by clients with a zero hire rate, or three days old and already buried under 80 applicants. Reading them all by hand is the failure mode this pipeline exists to avoid.

This started as an internal tool. We’d found a bottleneck in our own intake process — too many postings, too little time to triage them, and the good ones aging out before anyone looked — and built a system to clear it. So the constraints are an operator’s, not a product’s: it has to be cheap to run, honest about what it discards, and auditable when it tells me a week was empty.

The thing I want to be precise about: this is not a search problem. Keyword filters stop helping after the first cut, because the words that separate “audit I’d take” from “looks like an audit, isn’t” are the same words. Conversion rate, audit, funnel, Shopify — both kinds of post are dense with them. The signal that matters lives in what the client is actually asking for, which is the kind of judgment a fast classifier handles well and a regex doesn’t handle at all.

So it’s a grading problem. And it’s the same grading problem I wrote up in the CRO audit pipeline: an open-ended text input, a structured verdict wanted over it, and “ask Claude” as the right answer — as long as you don’t ask it just one question. This post is the second instance of that pattern in a different domain: a pipeline where Claude isn’t the product, it’s a graded judgment layer over messy real-world input. The architecture below is what falls out when you take that pattern seriously.

Why multi-rubric instead of one score

The temptation, every time, is to ask one Claude call for a single “fit score” from 0 to 10 and call it done. I built that version first. It ran fine and was useless inside a week, for the same reason a single composite is useless in the audit pipeline: when the score moves, you can’t tell which dimension moved.

A posting that’s a perfect topical match but comes from a client who’s never hired anyone, offers half the budget the engagement costs, and forbids US applicants is not the same “no” as a poor topical match with otherwise good signals. The first belongs in a “tune the queries” bucket; the second in a “broaden the rubric” bucket. A 4/10 composite collapses both into one number, and you spend the next month chasing the wrong thing.

The pipeline has four scorers, separated on purpose, each answering a different question:

A classifier routes the posting to exactly one service line, or no_fit. Is this in scope at all?
A topical-fit scorer grades the posting against the routed line’s rubric. Given that it’s in scope, is it the kind of work worth taking?
A deterministic winnability scorer grades client and competition signals. Could we plausibly win this contract if we applied?
A requirements extractor pulls the explicit asks out of the posting — numbered “include in your reply” items, anti-AI shibboleths, hard disqualifiers like “no agencies / EU-only,” format constraints. What would a compliant reply have to literally contain?

The four scores never combine. The output is a gate: a job is worth acting on only when topical fit clears the line’s threshold AND winnability clears its threshold AND no extracted disqualifier applies to us AND we haven’t already engaged the same client this week. Each gate is independent, and each failure produces a distinct status — failed_both, passed_topical_only, passed_winnability_only, disqualified, client_recently_proposed, unscored — so when the funnel is mostly empty, I can tell why it’s empty without re-reading the rows.

This is the “no composite score” argument from the audit-grading post, for the same reason: a single number is operationally a coin flip, because you can’t act on it without unpacking it into the parts it should have stayed as.

The ingestion stage

The ingestion stage is the wide end of the funnel, and most of its design comes from the fact that it has to be cheap. It does pooled GraphQL discovery across every enabled service line (pipeline/discovery.py). Each distinct query runs once even when several lines share it; hits are deduped by job id; and the (service_line_id, query) pairs that matched get recorded as routing hints for the classifier downstream.

Two structural quirks of the Upwork API are worth a sentence each, because they’re the kind of thing that only shows up under real use:

The daysPosted_eq filter is silently ignored, so the date cutoff is enforced client-side against createdDateTime, and pagination stops early once results run past it.
Two fields the search schema declares non-null — client.companyOrgUid and amount — actually come back null for some postings (amount on hourly-only jobs, companyOrgUid for certain clients), and GraphQL bubbles the violation up to the nearest nullable ancestor, nulling the entire edges[N].node. Both fields are dropped from the search query and recovered in a per-job supplement call against the single-job endpoint, where both are nullable.

Both quirks were debugged from logs, not docs. The second one was costing me a noticeable fraction of every page before I figured out which two fields were poisoning the edges.

Discovery is incremental: it walks every results file written to date, finds the most recent created timestamp, and stops paginating once it hits a job older than that. Re-runs pull only genuinely new postings, which keeps per-run cost flat over time instead of growing with the corpus.

Why filter before scoring

The first filter is location: only postings whose client is in the US are fetched at all. The second is the classifier — the cheapest call in the pipeline — and it runs before any scoring. The reason matters: about 92% of fetched postings classify to no_fit, and no_fit short-circuits everything downstream. No topical-fit call, no requirements extraction, no draft attempt. If I let the topical scorer — heavier per-job rubric, adaptive thinking on — see everything, I’d be paying Opus to tell me, ninety-two times in a hundred, that a posting for a UGC video editor isn’t a CRO audit. The classifier exists so the expensive scorer only sees jobs that are plausibly in scope, and the budget for careful grading gets spent on the 8% where careful grading earns something. It’s the same principle as “don’t make the model do work cheap code can do” — except here the cheap code is itself a model, just answering a smaller question against a tighter rubric.

The scoring stage

This is the technical heart: three Claude calls, one deterministic Python scorer, every output structured.

Classifier

pipeline/classification.py. Opus 4.6, batched at 30 jobs per call, cached system prompt, messages.parse with a Pydantic ClassificationBatch model so the SDK does the structured-output parse and I never touch raw text. No thinking budget — this is fast routing, not multi-step reasoning, and adaptive thinking on this call starves the structured-output tokens. (I learned that the wrong way; the note is in the code now.)

The system prompt is assembled per call from the enabled service lines:

def _build_system_prompt() -> str:
    sections = []
    for sl in active():
        sections.append(f"### {sl.id} — {sl.name}\n{sl.classification_hint}")
    lines_block = "\n\n".join(sections)
    valid_ids = ", ".join(f"`{sid}`" for sid in active_ids()) + f", `{NO_FIT_ID}`"
    return f"""You are a job-routing classifier for an Upwork proposal pipeline.

Your job: read a freelance job posting and assign it to exactly ONE service
line, or `{NO_FIT_ID}` if the job doesn't cleanly belong to any line.
...
Valid `service_line_id` values: {valid_ids}.
"""

The output schema is enum-typed: service_line_id validated against active_ids() | {NO_FIT_ID}, a one-sentence reasoning, and confidence as a Literal["high","medium","low"]. Unknown service-line IDs are coerced to no_fit with a warning — the same “treat the model as an unreliable upstream service” rule from the Split Test Pro post, expressed through Pydantic and a whitelist coerce instead of a hand-rolled JSON parser.

One detail matters more than it looks: query matches are passed to the classifier as a hint, not a verdict. A job surfaced by both “shopify migration” and “cro audit” carries both pairs in its matched_queries list. The classifier reads them as evidence but isn’t bound by them, and the prompt says so: “Base the decision on the job content, not on which queries surfaced it — query matches are hints, not verdicts.” The query that surfaces a job is usually broader than the work it’s actually asking for.

Topical-fit scorer (the per-line rubric)

pipeline/scoring.py. Opus 4.6, adaptive thinking on, the per-line rubric cached as the system prompt, messages.parse with a ScoreBatch model. Each service line owns its rubric verbatim — the rubric is the system prompt for that line. There’s no shared scoring code path and no parametric “rubric template,” and that’s deliberate: I tried the single mega-prompt with per-line conditionals first, and it was worse. One prompt trying to hold four lines’ definitions in its head blurred the boundaries between them exactly where they needed to be sharp.

The CRO audit rubric, in its actual prompt-shaped form, looks roughly like:

You are evaluating freelance job postings on Upwork for a Conversion Rate
Optimization (CRO) audit consultant.

The consultant offers CRO audits ONLY — analytical review of an existing
site or funnel ... They do NOT execute the changes themselves under this
service line. Implementation work ... belongs to a separate service line
(cro_implementation) and must NOT score as a fit here. Their typical
engagement is a fixed-price audit around $2,500 (sweet spot ~$1,500–$5,000;
hourly equivalent ~$75–$150/hr).

# Topical fit

## Strong topical fit
- Explicit CRO/conversion rate audit requests
- Landing page audits or reviews focused on conversion
- Funnel audits / funnel optimization analysis — any industry
- ...

## Not relevant
- Implementation / development of CRO changes — that is cro_implementation
- Conversion-focused redesign work where the deliverable is a built site
- ...

# Budget weighting
- In-range budgets ($1,500–$5,000, or $75–$150/hr) — no penalty
- Stretch but workable — slight downward nudge (-1)
- Clearly off-target (fixed under $500 or over $15,000) — meaningful nudge (-2 to -3)
- Missing budget — judge on topical fit alone, no adjustment

Never let budget alone push a topically irrelevant job above 4. Never push
a topically perfect job below 5 just because budget is mid-range.

Be strict. Only score 8+ when the job is clearly an audit AND the budget
is plausible for a $2,500-range engagement.

Three things in that rubric do more work than they look like.

The “Not relevant” block is bigger than the “Strong topical fit” block. The hardest call this scorer makes isn’t “is this CRO” — it’s “is this audit-CRO or implementation-CRO?” The two are identical at keyword distance, and the consultant takes only the first. The rubric leans harder on the negative examples than the positive ones because the negatives are the failure mode; without them, the scorer happily hands 8s to dev work.

The numeric bounds in the budget block are clamps disguised as instructions. “Never let budget alone push a topically irrelevant job above 4. Never push a topically perfect job below 5 just because budget is mid-range.” Same shape as the “Do NOT flag minor daily variation under 10%” line in the Split Test Pro anomaly prompt — a soft signal that preserves the model’s judgment on edge cases while ruling out the failure mode I kept seeing. A perfectly-worded $100 audit gets capped near 4 instead of 7; a topically perfect $10K engagement that’s slightly off-budget gets a 7 instead of a 3.

A one-sentence reasoning field is required beside every score, enforced by the schema. It isn’t decoration — it’s the only artifact I can scan when I disagree with a score. “Score: 6 — explicitly requests a full funnel audit but expects hands-on Shopify implementation, mixed audit/implementation role” tells me immediately whether the 6 was right or whether the rubric is muddling its own definition. Per-score reasoning is the cheapest interpretability you can buy.

Deterministic winnability scorer

pipeline/winnability.py. Pure Python, no Claude. Base score of 5, plus the sum of additive deltas from a frozen winnability_signals dict captured at discovery time, clamped to 0–10. One hard override (preferred_location_blocks_us = true → 0), one soft cap (already_hired > 0 → max 2 regardless of everything else).

It’s deterministic on purpose, for three reasons that are all in the docstring:

Re-scores are stable. Tune a weight tomorrow and every historical job re-scores identically against the new weights.
A reviewer can audit exactly why a job landed where it did. The output carries a components list ranked by |delta|, each with its raw value and a human note: “31 applicants — competitive: −2; client has never hired or posted: −1; $0 spent — first-time buyer risk: −1.”
Tuning is a one-file change — no prompt iteration, no judge re-roll, no rebuild.

This is the one judgment in the system that does not want an LLM, and the line between “use Claude” and “don’t” is exactly here: when the input is structured (integers, enums, bounded ranges) and the policy is something I’m willing to write down, deterministic Python is faster, cheaper, replayable, and auditable. Claude earns its place on the unstructured judgments. Tools should fight for their place.

Requirements extractor

pipeline/requirements.py. Opus 4.6, messages.parse with a RequirementsReport schema covering five fields per job: inclusion_requirements, shibboleths (anti-AI traps, each with a literal_match and a placement enum), disqualifiers (each evaluated against a WHO WE ARE block describing the consultancy, carrying applies_to_us: bool and a single-sentence reasoning that cites the deciding fact), format_constraints, and a has_separate_screening_questions_hint flag.

Two patterns here are worth flagging.

The extractor decides applies_to_us per disqualifier — not a downstream rule. “No agencies. EU timezone only. No AI-generated proposals.” Each is a verbatim string the extractor pulls, plus a boolean it sets after checking the criterion against the consultancy’s stated identity. Doing that at extraction time keeps the gate logic a one-liner — if any(d.applies_to_us) — instead of a growing pile of policy code that has to know what “EU timezone” or “agency” means.

Shibboleths are intentionally pass/fail, no partial credit. The schema’s literal_match field is “the exact string the reply must contain to comply, character-for-character.” A separate verification step checks a draft against the report and is told explicitly: “Shibboleths are pass/fail. Partial credit does not exist — the trap is binary.” The whole point of a shibboleth is that it’s a literal-string test the client uses to filter machine-generated replies; modeling it as a fuzzy score would defeat exactly what makes it work.

The downstream consumer assembles a draft from the structured requirements report and runs one verify-and-regenerate-on-miss pass against it. I mention it only to close the loop — the interesting engineering is the scoring spine; the consumer inherits its verdicts.

How I know the pipeline is doing useful work

As of writing, the pipeline has scored 710 jobs across 18 discovery runs, and the operational payoff is concrete. Throughput is up: I’ve gone from sending roughly 4 carefully tailored proposals per day to 15+ without spending more time at the keyboard. And the client reply rate on those proposals is up ~25% since the pipeline came online. Same hours, four times the volume, better hit rate per proposal. The triage layer is what bought both — the classifier alone discards ~92% of the firehose as no_fit before any of the heavier scoring fires, and the per-line rubrics rank what’s left so the postings I actually read are pre-sorted by how worth-reading they are.

That sounds like validation. Of the aggregate layer, yes — something is working. Of the individual scores, not really. The 25% reply-rate lift could be the layer surfacing better jobs, or the extra reading-time the triage frees up going into better-tailored proposals, or both, or something I’m not measuring. And even within “the layer is working,” I can’t grade any individual score: a job that scored 8, that I acted on, that didn’t land, could be a wrong score, a right score with bad luck, a right score against someone who priced lower, or a right score where the prospect ghosted the thread. Outcomes are too far downstream from the score to grade the score.

The next thing on the list — and the reason I wrote up the classification eval in such detail for the other pipeline — is to build the same kind of small hand-labeled set here: maybe 50 postings I’ve read carefully, each with its correct classification and a defensible target topical-fit score, run on every prompt edit. The same arguments hold: small, dull, boring to read, the kind of thing a reviewer trusts. Until it exists, the honest position is that the scoring is eyeballed, the thresholds are tuned to taste, and the only reason I keep them is that the jobs surviving the gate look right when I read them. Confidently wrong validation is worse than none.

The same pattern, a second time

Two pipelines now, two domains, the same scoring architecture: a structured input pipeline ending in a small handful of independent Claude calls that return structured verdicts, each grading one dimension, no composite, with a deterministic gate at the end. In the CRO audit pipeline the input was a website and the dimensions were classification accuracy, specificity, actionability, grounding. Here the input is a job posting and the dimensions are routing, topical fit, winnability, requirements compliance. The shapes differ. The shape of the judgment is the same: small targeted Claude calls behind a typed boundary, scored separately, combined by code I can read.

I think this is the right default architecture for “LLM as a judgment layer over noisy real-world input.” It generalizes — what changes between domains is the rubrics and the input plumbing, not the spine. Composite scores hide the dimension that’s actually drifting; big single-prompt judges hide what they’re doing. The discipline that survives is: name the dimensions you actually care about, write a separate scorer for each, keep them independent, and resist wrapping them in a model-of-models call that summarizes them away. The summary is the gate, and the gate is code.

That second-instance point is the thing I’d most want a reader to take from this. Reach for Claude as a judge once and it works, it’s a feature. Reach for it twice in different domains and it shapes the same way both times, it’s an architecture. This is the second time.

What’s broken, or hasn’t been tested yet

This is the most important section for anyone evaluating the engineering, because the pipeline is mid-iteration and the weak points are real. Writing them down forces me to defend the ones I’m comfortable with and fix the ones I’m not.

No formal eval. The single largest gap. I have the data to build one — the past proposals and scored jobs are sitting in data/ — and I haven’t done the labeling. Everything else on this list is recoverable; this is the bottleneck that gates how seriously I can iterate on any of it.
Per-line rubrics can drift independently. Each service line owns its rubric verbatim. When two lines could both lay claim to a borderline job — is this a CRO audit or a CRO implementation engagement? — the classifier is the tie-breaker, but the quality of that tie-break depends on both rubrics’ negative-example blocks staying coherent with each other. Nothing fires when two definitions quietly stop being mutually exclusive. I haven’t actually hit this yet; it’s a structural risk I can see, not a bug I’ve debugged. But there’s no test guarding it, and at some point there should be.
The classifier and topical scorer both run on Opus 4.6. That’s overkill for the classifier — a one-of-N routing call with a small enum output and no thinking budget. Haiku 4.5 would almost certainly handle it. I haven’t tried, because the per-job cost there is small and shipping the surrounding work mattered more. Same shape as the prompt-caching gap I flagged in the Split Test Pro post — except here caching is on (system prompts are cache_control: ephemeral throughout, and the requirements extractor batches by service line so the WHO WE ARE block stays cache-stable inside each batch).
Adaptive thinking on the topical scorer is load-bearing but unmeasured. I turned it on after watching the scorer make worse calls without it on borderline cases. I have no A/B between thinking-on and thinking-off against a labeled set, because the labeled set doesn’t exist. So “adaptive thinking helps here” is, right now, a vibe.
The shibboleth verifier trusts the extractor. If the extractor misses a shibboleth in a posting — “we filter for the word ‘banana’ anywhere in your reply” — the verifier never knows to check for it, and the draft sails through marked compliant. The verifier only audits items the extractor surfaced; there’s no second pass to find the ones it missed. That’s the current architecture, and it’s a real hole.
The winnability weights are tuned by intuition, not data. I can tell when the gate is mis-calibrated by reading the survivors. I can’t tell from outcome data whether +2 on “wide-open competition” should really be +1 or +3.
No cross-line dedup at the classifier. A job surfacing under three lines’ queries gets classified once — correctly, the classifier picks one — but if it genuinely fits two lines, I’ve told it to pick anyway. That call is fast, cheap, and probably right, but it’s a one-way door, and I haven’t audited what gets discarded.

The one I’d fix first is the eval set. Everything else is a known, bounded weakness I can reason about; “no eval set” is the thing standing between me and being able to iterate on the rest with any confidence.