pragmatist
Patrick Joyce

February 15, 2013

Calculating Sample Sizes for AB Tests with Vanity (and R)

You want to run an AB test. How many participants do you need in your test?

As always, the answer is "it depends". In this case, it depends on:

  1. What your base conversion rate is.
  2. How large of a difference you want to be able to detect.
  3. How concerned you are about Type I and Type II errors (false positives and false negatives)

There is no generic rule of thumb. Don't trust any advice like "you want about 3,000 people in the test to be confident." The correct sample size always depends on these 3 parameters for your specific test.

Basically:

  • The lower the base conversion rate the more participants you're going to need
  • To detect smaller differences you're going to need more participants
  • If you want to increase your confidence in your result, you guessed it, you're going to need more participants.

How to calculate necessary sample size

If you know your base conversion rate and what size difference you wish to detect it is easy to calculate the necessary sample size using R.

> power.prop.test(p1=0.25, p2=0.275, power=0.8, alternative='two.sided', sig.level=0.05)

     Two-sample comparison of proportions power calculation 

              n = 4861.202
             p1 = 0.25
             p2 = 0.275
      sig.level = 0.05
          power = 0.8
    alternative = two.sided

 NOTE: n is number in *each* group

So, in an ideal world you would run all tests as follows:

  1. Track your base conversion rate For example, 25% of people who reach a registration page successfully register.
  2. Agree on the size of the difference we want to detect We may only care about detecting relative differences of 10% or more (27.5% or better conversion using the example above)
  3. Decide on the desired significance level. This is the chance of a false positive. It is common to use 0.05 (which represents a 5% chance of a false positive)
  4. Decide on the desired statistical power. This is the chance of a false negative. It is common to use 0.80 (which means that if there is a difference there is a 20% chance we'll miss it)
  5. Calculate the necessary sample size as described above. Using these examples we would need to have 4862 people in each group.
  6. Run the test until you have enough participants in both your control and treatment Don't look at the results while the test is running
  7. End the test
  8. Analyze the test

Unfortunately, that isn't how it normally goes in the real world:

  • We often don't know what the baseline conversion is. Often times conversion rates for the control aren't clearly tracked until you start the test. Sometimes, you're unable to effecively baseline a conversion rate because it varies wildly. I have a little bit of experience dealing with optimizing ecommerce sites where inventory is only available for a limited time. The quality of inventory can have a large effect on the conversion rate, so it is very difficult to compare conversion rates across time.
  • Most AB Testing software provides real time results which make it easy to fall victim to repeated significance testing errors.

To combat these pitfalls we can use the control as an approximation of the true base conversion rate. Then we can use that as the base conversion rate to figure out how much longer we will need to run a test to detect a difference of the size demonstrated.

Further Reading

I am not a statistician. If you want to learn more go read Noah Lorang's post about calculating sample sizes and Evan Miller's explanation of repeated significance testing errors.

More Articles on Software & Product Development

Agile With a Lowercase “a”
”Agile“ is an adjective. It is not a noun. It isn’t something you do, it is something you are.
How Do You End Up With A Great Product A Year From Now?
Nail the next two weeks. 26 times in a row.
Build it Twice
Resist the urge to abstract until you've learned what is general to a class of problems and what is specific to each problem.