The parts of E2E tests Claude can't write for you

I generated most of a 6,000-line Playwright suite for a Shopify app with Claude. The parts that mattered, the regression assertions and the comment about not trusting non-deterministic ingestion, are the parts I had to put back by hand.

Split Test Pro is a Shopify app that assigns visitors to experiment variants and counts conversions out of an event stream that lands in InfluxDB v3 a few seconds later. By hand, an E2E test for it is mostly waiting and guessing.

The variant a visitor gets lives in a single cookie value, pipe-delimited if they’re in more than one experiment at once. Conversions land in InfluxDB with a server-stamped timestamp, so a test that wants to backdate anything has to write Line Protocol straight to the bucket. No experiment will start until a subscription exists, so any setup that goes through the billing API will hang.

I generated most of the Playwright suite with Claude. Here is what worked and where it was wrong.

The shape of the problem

The backend serves two clients with different shapes. The Shopify app posts events to one analytics endpoint with a checkout object on the payload. The HTML extension, used on non-Shopify sites, posts to a different endpoint with a session_id. Same database table eventually, different fields on the way in. A test that asserts against the Shopify endpoint and forgets the HTML one will pass and ship broken behavior.

Variant assignment is encoded into a single string. One experiment looks like expA-uuid:varA-uuid. Two looks like expA-uuid:varA-uuid|expB-uuid:varB-uuid. The frontend has to parse this on every conversion and fan one POST out per experiment without confusing the delimiters. The bug that comes from getting this wrong is silent: one experiment gets its conversion logged, the other does not.

InfluxDB v3 ingestion is the part that makes naive E2E tests flaky. The analytics endpoint stamps every event with time.Now() on the server. There is no API for backdating. So if a test wants to seed historical data (for a date-range filter, say), it has to write Line Protocol directly to the InfluxDB bucket’s write endpoint. A dedicated backfill script exists for exactly this reason.

The subscription gate is the last one. Experiment start is gated on an active subscription, and the seed scripts provision that state directly rather than going through the billing API. Going through the billing provider’s sandbox works, but it’s slow, and it’s another moving piece in the generated code.

How I actually used Claude

The first thing that mattered was a written context dump. The login flow, the dev OTP bypass, where the auth token lands, where the assignment string lives: none of this is documented. It lives across the auth handler, the experiment extension, and a comment in one seed script. I put the relevant pieces in the prompt every time.

The workflow that produced usable tests was one new test file at a time. Feed Claude one existing test as a template (the canonical login plus create-experiment scaffold), plus the file under test, plus one assertion target. Ask for that, nothing more. “Write tests for the new traffic-allocation feature” produced a 400-line file that asserted on six things at once and was wrong in three of them. “Write a test that verifies a 50% traffic allocation persists across reload” produced 30 lines that worked.

Two structural patterns are worth showing because Claude reuses them well once the first one exists in the suite. The first is page.route('**/*', ...) to capture POSTs into an array without booting the backend. This is how a self-contained test of the HTML extension’s trackConversion works, with no server in the loop:

if (url.endsWith('/api/html/experiment/analytics') && request.method() === 'POST') {
  let body = null;
  try { body = JSON.parse(request.postData() || '{}'); } catch (_) {}
  analyticsPosts.push(body);
  return route.fulfill({ status: 204, body: '' });
}

The second is page.waitForRequest to assert on what the frontend sent, not on what the backend rendered. The date-range filter test uses it to verify that clicking the “Last 7 days” preset produces a /results request with from and to query params spanning seven days:

const requestPromise = page.waitForRequest((req) =>
  req.url().includes(`/api/experiments/${expId}/results`) && req.url().includes('from='),
  { timeout: 10000 },
);
await page.click('#date-range-popover [data-preset="7d"]');
const req7d = await requestPromise;
const u = new URL(req7d.url());
const from = u.searchParams.get('from');
const to = u.searchParams.get('to');
const span = (new Date(to) - new Date(from)) / 86400e3;
assert(span > 6.9 && span < 7.1, `range span of about 7 days`);

Both patterns avoid the InfluxDB flush entirely. That is the point.

The seed scripts followed a different rhythm. The main user-seeding script is mostly loop work: create user, create workspace, create goals, create experiments, generate synthetic sessions, fan out events. Claude wrote almost all of it. The backfill script, which writes Line Protocol directly to InfluxDB v3, was the opposite. I wrote the first draft of the encoder by hand and used Claude to fill in the loop. The Line Protocol escaping rules (commas, equals, and spaces in tag values, plus the nanosecond-precision timestamp) are the kind of detail Claude gets 80% right and 20% silently wrong.

Where it got it wrong

Four things stand out.

The first was the | bug in multi-experiment variant assignment. Claude’s first pass at the conversion-tracker split the assignment string on : only. For a visitor in two experiments at once, the first POST got varA-uuid|expB-uuid as its variant ID, and the second experiment’s conversion was dropped. The test that catches this now lives in the multi-experiment conversion suite:

// Pre-fix bug: variantId would be `varA-uuid|expB-uuid` for the first POST.
assert(
  analyticsPosts.every((p) => !String(p?.experiments?.[0]?.variant_id || '').includes('|')),
  'no variant_id contains "|" (the pre-fix bug)'
);

That assertion exists because the bug existed. Generated tests do not get there on their own. You write the bug, the bug fires in QA, you write the assertion. The generated path stops at “the cookie was parsed.”

The second was asserting on InfluxDB-flushed data. The first version of the purchase-tracking test posted a conversion event, navigated to the results page, and asserted the count went up by one. It passed on my machine and failed in CI the next morning. InfluxDB v3’s flush is non-deterministic in local dev: events show up within a few seconds, usually, except when they don’t. Two tests in the suite now carry a comment at the top noting that they intentionally do not assert on flushed data. The replacement is to assert on the outbound HTTP shape. The endpoint accepted the payload and the payload had the right fields. The count check lives in a Go unit test.

The third was login flakiness. The login form sometimes does not paint the email input on first navigation. The symptom is rare enough that the answer was a defensive retry. Claude does not write it unprompted:

try {
  await page.waitForSelector('#email', { state: 'visible', timeout: 4000 });
} catch {
  await page.reload({ waitUntil: 'networkidle' });
  await page.waitForSelector('#email', { state: 'visible', timeout: 10000 });
}

This pattern is invisible until the test runs five times and fails once. The generated version is the happy path. The two-attempt version is what you need in CI.

The fourth was backend gates that change underneath the test. The experiment-conflicts suite shipped working in April. A week later the backend added a gate on a primary metric being set before an experiment could start. The test broke. The fix was a six-line helper to set the primary goal before starting the experiment. Not really a Claude problem, but a generated-test problem: the test encoded the API as it was, with no idea that the next feature would tighten the contract. Hand-written tests carry more context about why each step is there. That makes the fix obvious instead of forensic.

What I delegate and what I write by hand

The division has gotten fairly stable.

What I let Claude write:

  • New test files that follow an existing template. The login plus create-experiment scaffold is in every test in the suite, and Claude reproduces it correctly from a single example.
  • DOM-walking and selector wiring once the selectors are listed. If I hand it the IDs and data-* attributes for the form under test, it picks them.
  • Synthetic session and event fan-out in seed scripts. The “300 sessions per variant, jitter the timestamps, fan out N events per session” loop is the kind of code that gets boring fast.
  • Assertion phrasing on payload shape. Given a JSON example of a POST, Claude can write field-level asserts that cover every key.

What I write by hand or rewrite heavily:

  • Anything writing Line Protocol to InfluxDB directly. Tag-vs-field choices, escaping rules, nanosecond timestamps. The first draft is usually wrong in a way that ingests fine and queries empty.
  • The first test for any new feature. It sets the template the rest will copy, so it is worth doing carefully.
  • Regression assertions. The |-bug assert above means nothing without the commit message it came from. A generated test will not write that line in advance of the bug.
  • Anything that crosses the Shopify/HTML divergence. The two endpoints have different payload shapes, and Claude will write a test that passes for one and silently skips the other.

The Playwright suite is 40 files and around 6,000 lines. Claude wrote most of those lines. The parts that matter (the regression assertion, the comment about not asserting on InfluxDB, the retry around the login form) are the parts I had to put back in by hand. That ratio is fine. It is not going to change.

This is the same division I keep landing on. Whether the generated artifact is a data migration or a scoring pipeline, generation does the writing and a hand-built layer does the checking. Tests are just the version of that where the checking is the whole deliverable.

← All posts