Our latest 'Ask a statistician' question received not one but two different solutions. We published the first response, by Sumit Rahman, back in November. Here, our editorial board chairman Mario Cortina Borja tackles the problem posed by Alec Cambell of Bellvue College in a somewhat different way.

Alec's question, again, is: "I’ve read about the birthday problem, and how you only need 23 randomly chosen people for there to be a 50% chance that two people share a birthday. But how many people would you need for there to be a 50% chance that every possible birthday is represented by at least one person?"

Mario Cortina Borja replies: More than 365 people, clearly. But how many more? According to my estimates, you would need to gather together 2285 people for there to be a greater than 50% chance that all birthdays (excluding the leap year day of 29 February) are taken by at least one person, and more than 2980 for there to be a greater than 90% chance.

In his email to Significance, Alec says he became interested in this question when he noticed that “among our 3000 or so graduates last year only one birthday was not taken”. Based on my estimates, there was less than a 10% chance that this one birthday would go unclaimed.

I used simulation in the statistical software R to estimate these numbers. In general terms, I was looking to work out the probability of observing all possible birth dates among a sample of people, using the simplifying assumption that births within the population are uniformly distributed over all possible days.

Mathematically, we express this as estimating p(n, M), which is the probability of observing all the elements of the set Dn = {1, 2, …, n} in a sample of M subjects, assuming a uniform distribution over Dn. For birthdays, excluding 29 February, we have n = 365. To estimate p(n, M), I simulated B samples of size M using the R function p_hat (see box). I quickly found that M ≈ 3000 was an approximate solution, so I simulated B = 10 000 samples each for sizes 1200 ≤ M ≤ 5000 in increments of 10.

The row marked uniform365d in Table 1 shows values for selected quantiles resulting from these simulations; the values were obtained as predictions from a smoothing spline model. I do not include the confidence intervals for these estimates, but they are quite tight. The median (0.5) is 2285 people, and the 0.9 quantile is 2980.

Table 1. Estimated quantiles for the modified birthday problem, using one uniform and two empirical distributions based on live births from England and Wales, 1979–2014

 Probabilities Distributions .005 .01 .025 .05 .1 .5 .9 .95 .975 .99 .995 uniform365d 1561 1610 1686 1756 1858 2285 2980 3226 3502 3794 4050 empirical366d 1555 1657 1737 1812 1916 2435 3642 4456 5391 6758 7639 empirical356d 1553 1603 1694 1764 1862 2296 3002 3265 3531 3849 4112

 The R code p_hat<- function(n=365, M=3000, B=10000, emp.prob=rep(1,n)/n) {     ### Returns the estimated probability of covering all labels     ### D_n = {1,2,…n}     ###     ### It generates B simulations of extracting M samples     ### from D_n with replacement using the     ### probability distribution specified by weights emp.prob     ### MCB, London, 02.11.16     ###     invisible(         sum(             apply(                matrix(                      sample(1:n, M*B, replace=TRUE, prob=emp.prob),                      nrow=B),                1, function(x){length(unique(x))==n})         )/B     )}

What would happen if we relaxed the assumption of uniformity of birthdays and a 365-day year?

Using data provided by the Office for National Statistics, I considered the birthdays of the 23 872 409 live births registered in England and Wales between 1979 and 2014. This adjusts for (i) leap year births on 29 February, which constitute just 0.068% of all births; (ii) the excess of births in the last week of September, corresponding to conceptions in the Christmas holidays, and the deficit of births in the Christmas holidays, reflecting health services management policies; and (iii) the marked dependence on day of the week of birth, which is integrated out by accumulating the live birth frequencies by day of the year.

Clearly birth dates now vary in frequency, but how does this affect the distribution quantiles? The row in Table 1 marked empirical366d is based on the frequencies of live births including 29 February. The median of 2435 is 6.6% higher than that based on the uniform distribution, while the 0.9 quantile is 22% higher. To clarify this “leap day” effect, I omitted births on 29 February and re-estimated the empirical quantiles. Results in the row marked empirical365d show that the median and 0.9 quantile are now 2296 and 3002, only 1% greater than the uniform distribution quantiles.

• Mario Cortina Borja is chairman of the Significance editorial board, and professor of biostatistics in the Population Policy and Practice Programme, Institute of Child Health, University College London.
• Our next question is: What are the odds of a person becoming a statistician?, as suggested by @BobOHara, via Twitter. Send your answer to significance@rss.org.uk.

Significance Magazine