The Hidden Costs of Running Experiments at Scale

· 4 min read

The Hidden Costs of Running Experiments at Scale

Experimentation is essential at scale. Every major product decision should be validated with data. But experimentation has costs that are rarely discussed. Not just compute costs or engineering time, but organizational and cognitive costs that compound as the number of concurrent experiments grows.

Cost 1: Experiment Interference

When you're running 50 experiments simultaneously, they interact. User A is in experiment 1 (variant B), experiment 7 (control), experiment 23 (variant A), and experiment 41 (variant C). The behavior they experience is a unique combination that no single experiment designed for.

Most experimentation frameworks assume independence between experiments. That assumption breaks when two experiments touch the same user surface. A change to ad frequency (experiment 1) interacts with a change to content ranking (experiment 2) because the user sees both simultaneously.

Detecting these interactions requires either mutual exclusion (expensive in terms of user allocation) or post-hoc interaction analysis (complex and easy to get wrong).

Cost 2: The Velocity Tax

Every experiment needs setup: flag creation, variant configuration, metric definition, dashboard creation, allocation verification, and launch review. For a mature experimentation culture, this can take 2-3 days of engineering time before a single line of feature code is written.

Multiply that by 20 experiments per quarter across a team of 8 engineers, and experimentation overhead becomes a significant fraction of total engineering capacity.

Cost 3: Decision Fatigue

Experiments produce data. Data needs interpretation. Interpretation requires meetings, discussions, and decisions.

When 5 experiments mature in the same week, each requiring a ship/no-ship decision, the quality of those decisions degrades. People rush through reviews. Nuances get missed. Guardrail regressions get hand-waved as "probably noise."

The teams I've seen handle this best have a structured decision framework defined before the experiment launches: "If the primary metric moves by X and guardrails stay within Y, we ship. Otherwise, we discuss."

Cost 4: Code Complexity

Experiment code doesn't disappear when the experiment concludes. Dead experiment flags accumulate. Code paths for variants that were never shipped linger. The codebase becomes a geological record of every experiment ever run.

// After two years of experiments, this is what a feature function looks like
fun getAdFrequency(userId: String): Int {
    // Experiment 47 - shipped Q1 2024
    // Experiment 89 - killed Q3 2024 (code still here)
    if (ExperimentService.isInVariant(userId, "exp_89_ad_freq_v2")) {
        return 4 // This variant lost. Why is this still here?
    }
    // Experiment 123 - currently running
    if (ExperimentService.isInVariant(userId, "exp_123_dynamic_freq")) {
        return DynamicFrequencyCalculator.calculate(userId)
    }
    return 5 // Default
}

Experiment cleanup needs to be part of the experiment lifecycle, not an afterthought. The shipped variant becomes the default. The killed variants get deleted. The experiment flag gets removed. If this doesn't happen within two weeks of the decision, it probably never will.

Cost 5: The Survivorship Bias

Teams celebrate experiments that show positive results. They learn less from neutral results and almost nothing from negative results. But the negative results are often the most informative.

An experiment that shows "more ads = more short-term revenue but worse retention" is incredibly valuable. It defines a boundary. It tells you where not to go. Teams that only optimize for positive results miss these signals.

Managing the Costs

The solution isn't fewer experiments. It's better experiment hygiene:

  1. Limit concurrent experiments per surface. No more than 3-4 experiments touching the same user surface simultaneously.
  2. Automate the boilerplate. Experiment setup, flag creation, and metric configuration should be templated.
  3. Clean up aggressively. Set a policy: experiment code is cleaned up within one sprint of the ship/kill decision.
  4. Pre-define decision criteria. Before launch, write down what result would make you ship, kill, or iterate.
  5. Review negative results as thoroughly as positive ones. The learning from a failed experiment is worth the engineering time invested.

Experimentation at scale is powerful. But like any powerful tool, it has costs. Acknowledging and managing those costs is what separates teams that experiment well from teams that just run a lot of A/B tests.

Related Posts