Research

A/B Testing

A controlled experiment that exposes different user segments to different design variants to determine which version produces better outcomes — with statistical confidence.

#ab testing#split testing#experimentation#conversion#statistical significance#hypothesis#variants

What is it?

A/B testing (also called split testing) is a controlled experiment in which two or more variants of a design element are shown to different segments of users simultaneously, and the variant that produces the better outcome on a defined metric is identified with statistical confidence. Unlike qualitative methods, A/B testing does not explain why one variant works better — it only reveals which one does. It is most powerful when combined with qualitative research that generated the hypothesis being tested.

Why it matters

Opinions are free and plentiful. A/B test results are evidence. In teams with strong opinions and stakeholders who override design decisions on instinct, A/B testing is the authoritative mechanism for resolving disagreements. It also prevents "local optimisation" — the trap of making a change that seems logical but actually reduces the metric that matters. Amazon, Netflix, and Booking.com run thousands of A/B tests simultaneously, making A/B testing not just a research method but a product development philosophy. Used correctly, it turns every release into a learning opportunity.

Best Practices

Start with a hypothesis: "If we change X to Y, we expect to see Z because users currently experience [observed problem]." Never test without a directional hypothesis grounded in data.
Test one variable at a time. Changing headline, colour, and CTA position simultaneously makes it impossible to know which change drove the result.
Use multivariate testing only when you have very high traffic and need to test variable interactions — it requires significantly more traffic to reach significance.
Calculate required sample size before starting, using a power analysis. Running a test and stopping when you see a result you like is p-hacking and produces false conclusions.
Run tests for full business cycles (at minimum one week) to account for day-of-week variation in user behaviour.
Define your primary metric before the test — the metric you will make your decision on. Secondary metrics can inform but should not override the primary.
Set a significance threshold before testing (standard is 95% confidence / p < 0.05). Do not lower the bar when results are inconclusive.
Watch for novelty effects: a new design often outperforms initially because of curiosity, then regresses. Run tests long enough to see stable behaviour.
Document every test, hypothesis, result, and conclusion. A/B test history is one of the most valuable institutional knowledge assets a product team can build.
Do not run too many tests on the same users simultaneously — overlapping tests corrupt results.

Common Mistakes

Stopping a test early when results look good — this is the most common cause of false-positive A/B test conclusions.
Testing without a hypothesis — "let's try a red button and see what happens" produces results you cannot learn from or apply.
Insufficient traffic or sample size — declaring a winner with 50 users per variant is statistically meaningless.
Optimising for proxy metrics (click-through rate) rather than business metrics (conversion, retention, revenue per user).
Running A/B tests on low-impact elements (button border radius) when high-impact elements (headline, pricing, CTA copy) have never been tested.
Treating inconclusive results as failures rather than information — a null result tells you the change doesn't matter, which is valuable.
Ignoring segment-level results — a test may show no overall winner while hiding a strong positive result in a specific user segment.

Checklist

A clear, directional hypothesis is written before the test starts

Only one variable is changed between control and variant

Required sample size is calculated using power analysis before launch

Primary success metric is defined before the test

Significance threshold (e.g., 95% confidence) is set before the test

Test runs for at least one full business cycle (minimum 1 week)

Results are not interpreted until the planned sample size is reached

Test, hypothesis, result, and decision are documented for institutional knowledge

Research & Theory

Fisher (1926) — Controlled Experimentation

Ronald Fisher's agricultural field experiments in the 1920s established the principles of controlled experimentation — randomisation, replication, and statistical significance — that underpin every modern A/B test.

Why it's relevant

The same statistical principles that determined which fertiliser grew better crops determine which CTA button converts better. A/B testing is classical experimental science applied to product design.

Microsoft Experimentation Platform (Kohavi et al., 2013)

Ronny Kohavi's work at Microsoft (and later at Google and Airbnb) established that two-thirds of intuitively "good" product changes do not improve key metrics when tested rigorously. Only controlled experimentation reveals truth.

Why it's relevant

Expert intuition is right about one third of the time when it comes to which design changes improve metrics. Testing is not a lack of confidence — it is epistemic rigour.

Booking.com Experimentation Culture

Booking.com runs over 1,000 simultaneous A/B tests at any given time and has established an internal culture where no change ships without a test. Their conversion rate improvements compound at scale.

Why it's relevant

A/B testing as a product culture, not a one-off method, produces compounding returns. Each test informs the next hypothesis. Teams that experiment consistently outperform teams that rely on instinct.

Real-World Examples

Obama 2008 Campaign

Tested 24 combinations of hero images and CTA copy. The winning variant ("Learn More" with a family photo) outperformed the original by 40.6%, generating an estimated $60M in additional donations — one of the most-cited examples of A/B testing impact.

Booking.com

Famous for testing urgency messaging ("Only 2 rooms left!"). Rigorously A/B tested to confirm it increased conversion without increasing cancellations or damaging trust. Every urgency signal on the site has been tested.

Netflix

A/B tests every thumbnail artwork per title per user segment. The same show has dozens of artwork variants simultaneously, with personalised display based on viewing history. Artwork testing alone increased click-through by ~20–30%.