Bayesian A/B testing in 200 lines of Go: what 5,000 samples actually buys you

A walkthrough of a production Bayesian A/B testing engine: Beta-Binomial for conversion, Normal for revenue, Monte Carlo sampling, and the LiftDistribution trick that makes credible intervals on dollars interpretable.

Most A/B testing tools still report p-values. Then the merchant asks “so does the new checkout work or not?” and a salesperson — or worse, a dashboard — answers “the result is statistically significant at p < 0.05.” Nobody on either side of that conversation actually knows what that means.

When I started building the stats layer of Split Test Pro, I wanted the engine to answer a different question — the one merchants are actually asking: “what is the probability that this variant beats the control, given what we’ve seen so far?” That question only has an answer in a Bayesian frame. This post is about how that engine works, what 5,000 Monte Carlo samples actually buy you, and the one design choice I’m most happy with: LiftDistribution, which makes percentage-based credible intervals on continuous metrics (revenue, AOV) just work.

The case for Bayesian, in one paragraph

Frequentist A/B testing answers: “if the variants were identical, how often would we see a difference at least this large by chance?” Bayesian A/B testing answers: “given the data, what’s the probability the variant is better?” The second is the question every product manager and merchant is actually asking. It also degrades gracefully — peeking at the dashboard 14 times during a 30-day test doesn’t blow up your false-positive rate the way it does with NHST. (More on the limits of that claim below; I want to be honest about it.) And multi-arm tests fall out for free instead of requiring Bonferroni-style corrections.

The engineering case is even simpler: a Bayesian engine lets the UI present things like “93% probability that Variant B is the best” and “95% credible interval for lift: +2.1% to +7.8%” without having to lie about what those numbers mean.

Two distributions, one interface

An experiment has two kinds of metrics: conversion-rate things (did they check out, did they click, did they sign up — binomial outcomes) and continuous things (revenue per session, average order value — real-valued outcomes). They get different distributions, but the rest of the engine doesn’t want to know about it.

So the whole system is bolted onto a single Go interface:

type Distribution interface {
    Sample() float64
    Clone() Distribution
}

That’s it. Every metric becomes a thing you can sample from and clone. Sample() draws one realization from the posterior; Clone() exists because the Monte Carlo loop runs in parallel and each goroutine needs its own random source (gonum’s distuv distributions aren’t goroutine-safe — sharing one will silently corrupt your samples). More on parallelism in a minute.

Conversion metrics: Beta-Binomial with a weak prior

For binomial outcomes, the conjugate prior is Beta. With k conversions out of n sessions, the posterior is Beta(α + k, β + n − k). Choosing α = β = 1 gives a flat prior — every conversion rate from 0 to 1 is equally likely before we see data. That’s a deliberately weak prior; on real experiments with hundreds of sessions per arm, the data dominates almost immediately, and we don’t want the prior whispering “checkouts are usually around 3%” into the merchant’s results when their checkout might be 0.5% or 12%.

// /workspace/app/internal/experiment/results.go:1163
alpha := conversions + 1.0
beta  := sessions - conversions + 1.0

// /workspace/app/internal/stats/bayes/bayes.go:76
func (mb *MeasurementBinomial) InitPosteriorDistribution() {
    if mb.Alpha > 0 && mb.Beta > 0 {
        mb.PosteriorDistribution = &distuv.Beta{
            Alpha: float64(mb.Alpha),
            Beta:  float64(mb.Beta),
            Src:   NewRandSource(),
        }
    }
}

Continuous metrics: Normal with sampling standard error

For revenue and AOV, we use a Normal posterior over the mean. The trap people fall into here is using the sample’s standard deviation (σ) instead of the standard error of the mean (σ / √n). With 10,000 sessions, those differ by a factor of 100; using the wrong one gives you laughably wide credible intervals.

func (mc *MeasurementContinuous) InitNormalDistribution() {
    sigma := mc.SamplingStdDev  // standard error of the mean
    if sigma == 0 {
        sigma = mc.StdDev       // fallback for zero-data cases
    }
    if mc.Mean != 0 && sigma != 0 {
        mc.NormalDistribution = &distuv.Normal{
            Mu:    mc.Mean,
            Sigma: sigma,
            Src:   NewRandSource(),
        }
    }
}

This is technically a Normal approximation rather than a full Bayesian posterior with a Normal-Inverse-Gamma prior. In practice, at sample sizes where merchants make decisions (≫ 30 per arm), the approximation is indistinguishable from the conjugate update, and the math is far easier to reason about when something looks off.

Monte Carlo over closed-form

You can compute “probability variant beats control” for two Beta distributions in closed form. There’s a tidy result involving a sum over binomial coefficients. We don’t use it.

Instead, the engine draws 5,000 samples from each variant’s posterior and counts how often the variant sample exceeds the control sample:

func (s *Service) CalculateProbabilityToBeatControl(
    control, variant Distribution, sampleSize int,
) float64 {
    var winningSampleCount int64
    var wg sync.WaitGroup
    wg.Add(numGoroutines)  // 8

    for i := 0; i < numGoroutines; i++ {
        go func() {
            // Each goroutine clones distributions so it has its own RNG
            localControl := control.Clone()
            localVariant := variant.Clone()

            localCount := 0
            samplesPerGoroutine := sampleSize / numGoroutines
            for j := 0; j < samplesPerGoroutine; j++ {
                if localVariant.Sample() > localControl.Sample() {
                    localCount++
                }
            }
            atomic.AddInt64(&winningSampleCount, int64(localCount))
            wg.Done()
        }()
    }
    wg.Wait()

    return float64(winningSampleCount) / float64(sampleSize) * 100
}

Why? Three reasons, in order of how much I care about each:

  1. The same code handles 2 arms and 6 arms. The closed-form for “probability of being the best variant” across three or more Beta distributions is genuinely awful. With sampling, you just draw from each arm in parallel, take the max, and increment a counter. More on this below.
  2. Continuous metrics don’t have a tidy closed-form anyway. The moment you want “probability that revenue-per-session in B beats A”, you’re integrating a difference of Normals. Yes, that’s analytically tractable; no, I don’t want two code paths.
  3. It’s plenty fast. 5,000 samples × 8 goroutines × one metric is around 10–40 ms on the box this runs on. The merchant clicked a button; nobody will notice 30 ms.

Why 5,000 specifically? Because the standard error of a Monte Carlo probability estimate is about √(p(1−p)/N). At N=5000, even worst-case p=0.5 gives ±0.7%, which is well below the precision merchants actually read off the screen. You could push it to 1,000 and save a few milliseconds; you’d have to push it to 50,000 before merchants saw a difference in displayed numbers.

The trick I’m most happy with: LiftDistribution

This is the one piece of the engine I keep pointing people at. The problem: for continuous metrics like revenue per session, you want the credible interval to be a percentage (“lift is between +2.1% and +7.8%”), not raw dollars (“lift is between +$0.043 and +$0.158 per session”). Merchants think in percentages; the raw dollar number is meaningless without a baseline they have to mentally divide by.

The naive way to do this is to compute the credible interval on raw differences and then divide each bound by the control mean. That’s wrong: it doesn’t account for the uncertainty in the control mean itself.

The right way is to put the fractional improvement itself in a Distribution:

// LiftDistribution samples the fractional improvement (variant - control) / control.
type LiftDistribution struct {
    Control Distribution
    Variant Distribution
}

func (l *LiftDistribution) Sample() float64 {
    c := l.Control.Sample()
    if c <= 0 {
        return 0
    }
    v := l.Variant.Sample()
    return (v - c) / c
}

func (l *LiftDistribution) Clone() Distribution {
    return &LiftDistribution{
        Control: l.Control.Clone(),
        Variant: l.Variant.Clone(),
    }
}

Now the same CalculateCredibleIntervals function that handles a Beta or a Normal also handles “the fractional lift of revenue-per-session.” Sample 5,000 times, take the 2.5th and 97.5th quantiles, multiply by 100. Out comes a 95% credible interval on the percentage lift — correctly propagating uncertainty from both the control and the variant.

The fact that it composes — a Distribution made of two other Distributions — is the part I like. New metric types and new ways of summarizing them slot in without touching the Monte Carlo loop.

Multi-variant for free

Three or more variants is where Bayesian gets really nice. “Probability of being best” is just: for each sample iteration, draw from every arm, see which is highest, increment that arm’s counter. No correction factors, no fiddly multiple-comparisons math.

for j := 0; j < samplesPerGoroutine; j++ {
    var maxSample float64
    var maxID uuid.UUID
    found := false

    for id, dist := range localDists {
        sample := dist.Sample()
        if !found || sample > maxSample {
            maxSample = sample
            maxID = id
            found = true
        }
    }
    if found {
        localCounts[maxID]++
    }
}

Six variants? Same loop. The probabilities sum to 100% by construction. A merchant running a 4-arm color test gets four numbers that add up; no statistician needs to be in the room.

What this engine does not give you

I want to be precise about this part because it’s where most “Bayesian A/B testing” blog posts overpromise.

It does not give you anytime-valid inference. Naive Bayesian A/B testing — including this implementation — still loses calibration under aggressive optional stopping if you keep peeking and stopping the first time probability-to-beat crosses 95%. The result is biased toward whichever variant happens to look good early. There’s good work on always-valid Bayesian sequential testing (e.g., mSPRT, e-values), and adding it is on the list. For now, we surface the probabilities continuously and trust the merchant to commit to a duration up front, the same way a frequentist tool would.

There’s no automatic “declare a winner” threshold. The engine surfaces probability-of-best and a credible interval; the merchant decides what to do with them. I went back and forth on this. Auto-declaring a winner at 95% would feel decisive, but it shifts the decision boundary into the tool — and merchants who don’t understand what’s behind it will trust the green checkmark uncritically. Showing the numbers and the interval forces a moment of thought.

It does not guard against Simpson’s paradox. Stats are computed at the experiment level; we don’t currently stratify by traffic source or device. A future version probably should.

Sample ratio mismatch is handled separately. If the 50/50 split is off by more than 2% after 1,000+ sessions, an anomaly check raises a critical alert before the stats are trustworthy in the first place:

// Rough sketch from /workspace/app/internal/experiment/anomaly_service.go
if totalSessions > 1000 {
    expected := totalSessions / numVariants
    for _, v := range variants {
        deviation := math.Abs(float64(v.Sessions - expected)) / float64(expected)
        if deviation > 0.02 {
            raiseAlert("sample_ratio_mismatch", "critical", ...)
        }
    }
}

The Bayesian engine assumes the assignment is honest; the SRM detector is what enforces that assumption.

What I’d change

If I were starting today:

  • Sequential-valid inference. Replace the naive “probability variant beats control” with an mSPRT-style always-valid test, or report e-values alongside the probability. Merchants peek; we should make peeking honest.
  • Hierarchical priors across a workspace’s experiment history. If a merchant runs 40 checkout tests a year, the average lift is small and centered near zero. Borrow that information into the prior. The current flat-prior implementation is correct but wasteful.
  • Loss-based stopping rules. “Expected loss if we ship now” is more decision-useful than “probability of being best,” and easy to compute on top of the existing samples. I just haven’t surfaced it in the UI yet.

The 200-line punchline

The actual Bayesian engine — distributions, sampling, probability-to-beat, credible intervals, multi-arm — is right around 400 lines of Go. The composable Distribution interface and the LiftDistribution trick are doing most of the work. Everything else is just Monte Carlo and care about RNG.

The lesson I keep coming back to: the math is easy; presenting numbers that don’t lie is the hard part. A 95% credible interval in raw dollars is technically correct and operationally useless. A naive ratio of dollar-CIs is operationally useful and statistically wrong. Composing distributions until the thing you sample is the thing the merchant wants to read — that’s where the engineering work lives.

← All posts