Read-back is the part I don't generate

I generate most of the code for a NetSuite-to-Shopify migration with Claude. The part I write by hand is the read-back: the code that knows what 'wrong' looks like in this specific catalog, where a bad write looks exactly like a good one.

I generate most of the script code for this migration with Claude, and it saves me an order of magnitude of time. The interesting part, the part that makes that leverage safe on live production data, is the discipline I build around it.

The project: a specialty hardware retailer with 2,000+ skus running inventory in NetSuite and four storefronts in Shopify. Shipping data comes from Pacejet, the freight rating service, sitting next to NetSuite. SEO has to survive a brand consolidation off an old Magento store. The connector in the middle is an iPaaS (integration platform as a service). I write the Python that handles what the iPaaS cannot or should not. Most of that Python is generated. The rest of this post is about where the line between “generated” and “hand-written” actually falls, and why.

What makes this sync hard

NetSuite stores variant products as “matrix items.” A parent record holds the option axes and a set of child records hold the actual SKUs. Shopify has a flat product-and-variant model. Mapping between them is mostly mechanical, but the edges are bad.

Child SKUs in NetSuite look like PARENT : CHILD-CODE in some places and CHILD-CODE in others, depending on which API surface you came in through. The shipping-data upsert script splits on : and takes the right half. The day NetSuite changes the formatting, it writes shipping data onto the wrong variant. There is no error.

The mapping code also stores the parent’s NetSuite internal ID in a field literally named parent_sku. There is a comment in the mapping file admitting it. Anything downstream that joins on that field as if it were a SKU silently misses.

Pacejet ships about 25 shipping and logistics fields per variant: package dimensions, freight class, LTL surcharges, lift-gate flags. They are upserted into Shopify as metaobjects, then linked to each variant via a metafield. The upsert has no idempotency key. A rerun after a partial failure produces orphan metaobjects.

That is the catalog side. The store side has its own class of bug. The same SKU shows up in more than one of the four storefronts, sometimes with different prices, sometimes only on one. Tag and collection drift between stores within a week of going live.

Images are a mess. Some come from the old Magento media path. The rest are scraped, hand-uploaded, or pulled from supplier sites. Filenames have spaces and parentheses.

Redirects come from Ahrefs, not a clean URL dump. Multiple old URLs map to the same new product. Some old URLs do not map to anything.

None of this is exotic. It is just a long tail of edges, and the failure mode for most of them is silent. That last property, silent failure, is the whole reason the rest of this post exists.

The loop

The workflow looks the same every time. I give Claude a sample of the source data, the target schema, and the API I am writing against. For the shipping-data upsert that was: a JSON sample of NetSuite item fields, the Shopify metaobject definition I had set up, and the productVariantUpdate mutation shape. For the redirects: a CSV from Ahrefs, the Shopify redirects format, and a sketch of the fallback order I wanted to try.

The first version of the script is always recognizably close. Field mapping is right. The GraphQL shape is right. The pagination loop is right. This is most of the work, and generating it is a genuine multiplier. The boilerplate, the API plumbing, the transforms between two known shapes all come back correct on the first pass far more often than not.

It is also almost always wrong in one specific way. Not a bug in the bug-tracker sense, but a silent assumption.

Where generated code goes quietly wrong

These aren’t reasons not to generate. They’re the specific category of thing generation can’t know from a schema and a sample, and once you can name the category, you can build for it. Five examples from this project.

SKU shape is a stable string. The clean_sku helper assumes that any SKU containing : is a matrix child and that the child code is on the right side of the separator. That has been true so far. It has not always been true historically. The variant lookup succeeds whether the assumption holds or not. The metaobject link succeeds. The product just has the wrong package dimensions until someone notices the freight quote is off.

HTML entities are decodable by lookup table. The mapping file’s entity decoder handles <, >, ", &, ™ and a handful of others, plus all numeric character references. It does not handle most named entities. NetSuite descriptions are pasted in from vendor PDFs and they contain °, ½, ·, …. Those pass through as literal entity text on the storefront, in the middle of a product description, where nobody on the team looks until a customer mentions it.

URLs are safe to concatenate. The image upload script builds a source URL by joining a base path with a filename from a CSV. The filenames contain spaces and parentheses, because the Magento export contained spaces and parentheses. Shopify accepts the URL and quietly fails to fetch the asset. The product has the right number of media slots and the wrong number of pictures.

Bulk writes are idempotent if the API call is idempotent. The metaobject upsert uses metafieldsSet, which is idempotent at the call level. It is not idempotent across a retry that creates fresh metaobjects, because metaobject creation is a separate mutation upstream of the link. A partial failure mid-batch followed by a rerun leaves orphan metaobjects with nothing pointed at them.

Two URLs that look the same redirect the same way. The redirect generator drops the query string at split('?')[0] and does not normalize trailing slashes. The Ahrefs export has both /products/foo and /products/foo/ as separate rows. Sorted category URLs with filter parameters all collapse to the bare category. Shopify de-dupes on import and emits a warning, not an error.

None of these would have been caught by a unit test of the script. The code does exactly what it says. The bug is the assumption underneath the code, and the assumption is invisible to the generator because it lives in the data’s history, not in its shape.

Read-back is the part I don’t generate

Every script that writes to a production system has a dry-run mode that produces a CSV before any mutation runs. The bulk SKU update script for one storefront requires a verbatim confirmation phrase on the command line, the kind of guardrail you add after one bad run. The matching script that feeds it computes a separate match-quality score against SKU, barcode, and vendor code, and writes a CSV that shows which key it matched on. The first version of that script trusted SKU alone and got it wrong for 55 variants.

For images, a verifier script reads every product back from Shopify, pulls the media list, and substring-matches uploaded filenames against the source CSV. Substring, not equality, because Shopify rewrites filenames on upload and the verifier has to be wrong-tolerant in the right direction.

For orders, a forensic query flags any NetSuite Sales Order that references the same Shopify order name more than once. It exists because a sync app pushed duplicates and nobody noticed for a while.

Nothing reaches production unreviewed. A teammate, James, QAs everything these scripts produce. The important part is that the review is a person reading the dry-run diff and the match-quality CSV, not a person eyeballing the catalog. That distinction is the whole point. The failures here are invisible to eyeballing, so the artifacts are what make the human gate able to catch anything at all. A reviewer staring at a storefront would never notice that a freight class landed on the wrong variant. A reviewer reading a match-quality CSV sees the 55 rows that matched on the wrong key.

The pattern across all of these is the same. Generated code writes. Hand-written code reads back and checks. A person signs off on the read-back, not on the raw output. The read-back path is the one I do not delegate, and it’s worth being precise about why.

Generated code that checks its own work tends to check the easy cases. It will assert that the API call succeeded. It will assert that the response shape matches. It will not check the thing you didn’t think to mention, because you didn’t think to mention it. The validation has to be written by the person who knows what wrong looks like in this data, and wrong-looking data in a catalog of marine hardware is different from wrong-looking data in a catalog of e-books. The generator knows the shape of the API call. It does not know that a freight class of zero is impossible, or that a fifty-pound anchor with no package dimensions is a data-entry miss rather than a valid record. That knowledge is the asset, and it belongs to the team, not the model.

This is the same division of labor I keep landing on in other systems. The grading pipeline behind a CRO audit and a job-discovery pipeline for our intake both put the load-bearing work in a verification layer the model doesn’t get to write. Different domains, same spine: let generation do the writing, keep the checking by hand.

The line

So the boundary is explicit, and naming it is what lets me generate aggressively on everything safe.

I’m happy to generate:

  • Transforms between two known shapes.
  • API call boilerplate.
  • CSV parsers and writers.
  • One-shot exports.
  • Pagination and rate-limit handling.

I don’t generate:

  • Anything that writes to a production system without a dry-run and a diff.
  • Anything that assumes the source schema is stable.
  • Anything where silent failure looks the same as success.

And one I keep outside the list because it’s load-bearing: the verifier that watches the generated code. Everything above the line, I hand to Claude without hesitation, because the failure mode is loud: it throws, or the shape is wrong, or the dry-run diff looks visibly off. Everything below the line is where a wrong result looks exactly like a right one, and that’s precisely where a human who knows the data has to stand.

Where this still fails me

The honest weakness: the read-back layer encodes assumptions about what wrong looks like, and it has the same blind spot the generated code does, just one level up. If a category of wrong never occurred to us, we didn’t write a check for it, and now both layers miss it together. The 55 mis-matched variants got caught because the matching script scored match quality across three keys. The failure nobody imagined yet won’t be. The discipline narrows the surface, it doesn’t close it. The read-back is a better net than self-checking generated code, but it’s still a net woven by hand, and it only catches the shapes someone anticipated.

That’s the actual frontier here, and I don’t have a clean answer to it, only the practice of widening the net every time production teaches us a new shape of wrong.

What the line buys

Generation handles the parts of a migration where the shape is known and the failure mode is loud, which is most of the parts. What’s left, the read-back that knows what wrong looks like in this catalog, is exactly where a person still earns their place. Get that division right and you can generate fearlessly on everything above the line, because you’ve built the thing that catches it when you’re wrong below it.

← All posts