A while back I did a post on the birthday problem. At the time, I was just happy to be able to model this problem, using R. Now, I am hoping to go a few steps further and evaluate the confidence we might feel in our result.
Quick background -the question:
How many people would you need in a group, to have an above 50% chance that two of them would have the same birthday?
There is a formula for figuring out the probability for any n-sized group.
And for determining that the correct answer is 23.
This is what I think they call a "tractable" problem (i.e. if you know enough combinatorics and "maths" you'll be able to work out the exact solution). But learning all this mathematics sounds like a lot of hard work.
I like to think not many problems out in the real world are going to be as simple as this (even this question relies upon the assumption that each birthday is equally likely). So, instead of learning the math, I utilised R to simulate a whole series of scenarios.
I ran 1,000 trials for each group size, and I was able to get the following chart (simulation runtime, using the furrr package, was ~3 minutes).
The line shows us the general outline of the situation. That (surprisingly?) we only need a ~50-sized group to have a near certainty that there will be two people sharing the same birthday. And that our "even chance threshold" lies roughly between 20 and 30 people.
The jagged bumps in various places tell us that our cycles of 1,000 trials are not sufficient to squash out any lucky streaks that occur for particular group sizes.
Intuitively, we know there is no reason for a group of 19 people to have a greater chance than a group of 20 people (as is the case above). Therefore, at this point, the individual probabilities cannot be relied upon, in comparison to their immediate neighbours.
It sets up the question:
How sure can we be about the answer we provide?
Running 20 repeats (cycles) of 1,000 trials for group sizes of 22, 23 & 24, will take roughly 11 minutes.
If we go ahead and assume normality (why not, right?), we can make some neat-looking distribution charts.
These charts make a few things clear:
- The 22-sized group is almost certainly not going to be over 50%.
- If the true probability of a 23-sized group (which appears very close to 50%) turns out to be below 50%, then our answer will have to be 24.
Therefore, we just have to focus in on determining the true success_rate for a 23-sized group. Out of the 20 cycles above, 13 of them were over 50%. At this stage, given the mean and sd (0.0143) and our assuming normality, we might say:
we can be 95% certain that the true success_rate lies between 0.479 and 0.535.
Something else we might say is that
we are 68.8% certain that the success_rate is greater than 0.50.
However, although these sound right to me, I have an invisible, miniature statistician on my shoulder, always bugging me, telling me "but aren't you forgetting about..." (and then they trail off, never providing the answer to what it is I'm overlooking).
How much flexibility is there for the peak of our 23-group distribution to shuffle over to the left side of 0.5?
It's probably also important to consider,
Say we say 23 or 24; or even 22. How are they going to assess whether we were right?
For this problem, they already know the correct answer. But for more realistic, complex business scenarios, we would have to account for the potential profit/loss of a correct or incorrect decision?
Assuming, there was no algebraic solution, are they only going to run three trials to test us? Does it then become our responsibility to outline exactly what they should expect from this (given our findings)?
At this point, we may as well just assume that we have a client and they want our utmost certainty that we will provide the correct response (and that they have some unknown but still fair system for assessing our response).
What we need to do is to raise the accuracy of each cycle... by raising the number of trails from 1,000 to something higher. Doing this will iron out those lucky streaks (and therefore tighten up the curves of our distribution).
But "how high?"
With 10,000 trials, 17 out of 20 of the cycles resulted in success_rates over 50%. This took 15 minutes.
The standard deviation has now shrunk considerably (from 0.0143 to 0.00696). We can see it below in comparison to our 20 original 1,000 trial cycles.
Utilising our new stats, we can say that there is a 87.5% probability that a 23-group has an "above even" success rate.
In this case, it seems very fortunate that our group-pair determining function was fast enough to run our simulations within a reasonable timeframe, in order to obtain a roughly satisfactory confidence bounds (we might up the number of trials for increased certainty).
More complex scenarios, however, would require more complex functions and such speed loss could require reductions in capacity to have such confidence in our decisions.
Also, the whole "above even chance" is completely arbitrary. If instead of 50%, the threshold could have been something much closer to the detail of a 23-sized groups actual success_percentage (for example "above or below 50.72%?"). Then, we would not have been able to run enough trials to obtain an answer with certainty at that level of detail.
But onto the practical business applications.
With A/B testing, you are often trying to estimate the effect that a specific change will bring (in contrast to a control or alternative). Presenting a point estimate (eg increase click-through by 20%) with confidence bounds allows the uncertainty to also be conveyed. And to reveal that our estimate of the true (unseeable) effect is really a distribution.
Presenting a distribution, rather than a point estimate allows for some rough follow-up assessment of our estimates. With all things being equal (in the behaviours of customers), if the actual effects should end up being outside our bounds on a consistent basis, it might be worth taking a second look at our processes.
If you would like to see my code, I have published a gist.
I hope this has been a relatively clear and useful walkthrough of a statistical problem I found interesting. I am very happy to receive feedback on any areas that might not be "statistically sound."
I also get the feeling that this discussion has been teetering very closely to the concept of Baysian A/B testing (so hot right now). So I think I will go and consult my much wiser friends to get their insights (and possibly write a follow up). Thanks for reading.