I'm Nicholas, an engineer in Los Angeles. I build production systems on top of frontier LLM APIs: currently a Bayesian A/B testing app and a multi-stage Claude vision pipeline behind a CRO audit. I write here about the engineering, statistics, and design decisions behind what I ship.
Posts
- Multi-rubric LLM scoring over a noisy data stream
How I split the judgment problem inside a job-discovery pipeline into a deterministic gate, a routing classifier, a per-line rubric, and a structured requirements extractor — and why one composite score would have hidden everything that matters.
- Grading a Claude pipeline I'd already shipped: two scorer types, one vision audit
How I'm grading a multi-stage Claude pipeline I'd already shipped: exact-match scorers for classification, three independent LLM-as-judge rubrics for the vision audit, and why a single composite quality score would have hidden every failure I cared about.
- One A/B testing product, two very different worlds: building for Shopify and the open web
What it actually takes to run one A/B testing product on both Shopify and arbitrary HTML sites: auth, event ingestion, available data, and why a half-dozen switch statements beat the interface I almost wrote.
- Three boring Claude features inside a stats app, and the patterns that made them ship
LLM features that sit quietly inside a SaaS product: a pre-launch reviewer, an async anomaly detector that returns strict JSON, and a cached post-experiment analyzer. The wrapper is 100 lines.
- Bayesian A/B testing in 200 lines of Go: what 5,000 samples actually buys you
A walkthrough of a production Bayesian A/B testing engine: Beta-Binomial for conversion, Normal for revenue, Monte Carlo sampling, and the LiftDistribution trick that makes credible intervals on dollars interpretable.
- Read-back is the part I don't generate
I generate most of the code for a NetSuite-to-Shopify migration with Claude. The part I write by hand is the read-back: the code that knows what 'wrong' looks like in this specific catalog, where a bad write looks exactly like a good one.
- The parts of E2E tests Claude can't write for you
I generated most of a 6,000-line Playwright suite for a Shopify app with Claude. The parts that mattered, the regression assertions and the comment about not trusting non-deterministic ingestion, are the parts I had to put back by hand.