Published

The central limit theorem (CLT) says that for sufficiently large random samples,
the sample *means* will be approximately normally distributed. The remarkable thing is that the original data
from which you are sampling does **not** need to be normally distributed. It can be as "abnormal" as you like, so long as the sample size is sufficiently large.

Not sure what this all *means* about the sample *means*? 😉

Let's start with a classic example... rolling a dice (or a die if you prefer). If it is a fair dice, all outcomes 1, 2, 3, 4, 5 and 6
are equally likely - they follow a "uniform distribution". But if you were to roll the dice several times and find the mean (ie. the average) of each sample,
look at the way the sample means get distributed in the long run...

Notice that although the original data (the dice rolls) have a flat "uniform" distribution, the sample means have a bell-shaped "normal" distribution. This makes sense when you stop and think about it. Some dice rolls will be higher than average and some will be lower, but in a sample of several rolls the highs and lows will "average out", so the sample mean is far more likely to be close to 3.5, rather than as low as 1 or as high as 6.

## What effect does sample size have on the central limit theorem?

The larger the sample size, the closer the sampling distribution becomes to a normal distribution. In fact, sample size is a key part of the central limit theorem. You may have heard that "the central limit theorem only applies for samples of 30 or larger". But there is nothing special about a sample size of 30, it's just a rough generalisation. Like if I told you that it takes 30 minutes of regular exercise to improve your fitness. Reasonable advice in general, but also highly dependent on many other factors.

You can try running the simulation above with different sample sizes. You will find that for the dice rolls example, the normal approximation is remarkably accurate for sample sizes as low as 5 or 10. This is largely because the dice rolls distribution is "symmetrical" - numbers equally far above and below the mean have equal probabilities. If you would like to see the central limit theorem applied to "skewed" data, scroll down to the examples on "income across the world" and "popular baby names" below.

Increasing the sample size not only causes the sample means to be more bell-shaped and normal, it also causes the the sample means to be closer together. You may have noticed from the simulation above that for larger values of n, the sample means are more closely grouped around the middle - the normal distribution has a smaller "standard deviation". This relationship is quantified in the central limit theorem formula.

## What is the central limit theorem formula?

The central limit theorem states that for a random variable X with mean μ and standard deviation σ, the sampling distribution for the sample means X will approach a normal distribution as the sample size n increases. Also:

- the mean of the sampling distribution will approach the population mean (μ)
- the standard deviation of the sampling distribution will approach the population standard deviation divided by the square root of the sample size (σ ÷ √n)

The central limit theorem gives statisticians great power to make generalisations about data in the real world from relatively small samples.

## How is the central limit theorem used in real life?

The central limit theorem allows us to make estimates from a random sample, and gives us a measure of confidence in the accuracy of those estimates. For example, let's say you suspected that a dice being rolled was not actually "fair", but was "biased" in order to give higher numbers more often. If you rolled the dice 10 times and calculated an average of 4.9, what would this allow you to conclude? Well, using the central limit theorem, you could use a normal distribution to estimate the probability of getting an average as high as 4.9 from 10 rolls of a fair dice to be about 0.5%... unlikely! This would provide some statistical evidence to support your suspicion that the dice is biased.

##
Applying the central limit theorem to skewed data

(Incomes across the world)

Now let's look at some data that is less "uniform" and even more "abnormal". When we compare average income data across different countries, we find the data is highly "skewed". There are many countries whose average annual income is only a few thousand dollars or less, then there are a few countries with much higher income (the highest in the dataset being Norway with an average around $66,000).

Select a sample size and watch what happens to the distribution of sample means...

Data source: World Bank 2021

Running the simulation above for the the highly skewed income data, you will notice that a larger sample size is required to give a normal distribution, compared to the dice rolls example. With a sample size of 5, the peak of the sampling distribution is much further left than the peak of the normal approximation, as it is still influenced by the skew of the original population data (which has skewness coefficient of 1.8). But for sample sizes as small as 10, this abnormality almost disappears and the normal approximation models the sample means quite well, apart from the left tail. With a sample size of 20, the left tail fits the distribution much more closely. As you choose larger values of n you can see the distribution becoming more and more normal.

So for skewed data, you would require a larger sample before applying a normal approximation to create statistical estimates. You can tell whether or not the data is skewed by plotting a histogram for your sample, or through research in your subject area. For example, economic variables such as income, wealth and house price are known to be generally positively skewed.

##
The normal approximation to the binomial distribution

(Baby names starting with "A")

A binary variable has only two possible outcomes. In this example, we will look at whether or not a baby's first name starts with the letter "A". Of the 3.6 million babies born in the USA in 2020, "A" was the most popular starting letter. About 14% of babies (or 1 in 7) had names starting with the letter A, which is quite a significant proportion considering there are 26 letters to choose from!

We can assign a value of 1 to baby names starting with A, and a value of 0 to all other names. We call the probability distribution for a binary variable a "Bernoulli distribution", and when we sum independent binary variables to create a count, we get what we call a "binomial distribution". In our example, if we take a random sample of 10 babies born in 2020, we expect around 1 or 2 of these babies to have a name starting with the letter A. The mean proportion should be 0.14, because about 14% of babies had names starting with the letter A.

Try the simulation below to see how large a sample size is required before the distribution of sample proportions begins to look "normal"...

Data source: USA Social Security Administration 2021

What do we notice this time? (besides the fact that Ava and Amelia were very popular names in 2020) 😉

First, why are we talking about "sample proportions" here
instead of "sample means"? Proportion is a more appropriate term in this case, as we are dividing the number of "letter A babies"
by the total number of babies in the sample. We use the notation p̂ to represent
a sample proportion. However, this sample proportion is just a special case of
the sample mean, which is obtained using a value of 1 for letter A babies and a value of 0 for all other babies.
For example, if in a sample 10 we found 3 letter A babies and 7 others, we obtain a sample mean of

(1 + 1 + 1 + 0 + 0 + 0 + 0 + 0 + 0 + 0) ÷ 10 = 0.3

So the central limit theorem can be applied in the same way to sample proportions as to sample means.

Second, you will notice there are large gaps between the bars for small sample sizes. This is because the original data is binary, so the sample proportions for small sample sizes are limited to a small set of options. In fact, for a sample of size n, there will be a maximum of (n+1) possible options for the sample proportion. For example, in a sample of size 10, the only possible sample proportions are 0 , 0.1 , 0.2 , 0.3 ... up to 1. It is not possible to get p̂ = 0.15, because that would require 1.5 letter A babies in our sample of 10. So the normal approximation is limited because the normal distribution is "continuous" (a line) but the sample proportions are discrete (a set of points). We can adjust for this using a "continuity correction": for example we can estimate the probability of obtaining p̂ = 0 in a sample of 10 using the probability that the normal approximation is anywhere from −∞ to 0.5. Similarly, we estimate p̂ = 0.1 in a sample of 10 using the probability that the normal approximation is anywhere between 0.05 to 0.15, as it makes sense to round any number in this range to the closest possible value of 0.1.

But even with the continuity correction, the normal approximation still has limited accuracy for small samples. For the normal approximation to the binomial distribution, we have a more precise guideline than the ballpark n = 30 mentioned above. For the binomial distribution the guideline is that both np andn(1− p) should be at least 5. Although the number 5 is somewhat arbitrary and would depend on the level of accuracy you require for your particular situation, at least this guideline incorporates not only the sample size but also the skew. For p ≈0.5, the binomial distribution is roughly symmetrical. For small values of p the binomial distribution is positively skewed (like our letter A example above), and for large values of p the binomial distribution is negatively skewed. To apply this test to our example where p =0.14, we require n ×0.14 to be at least 5, which means n would need to be at least 36. You can use the simulation above to try various sample sizes above and below 36 and see for yourself how close the resulting normal approximation becomes.

## Summary of the Central Limit Theorem

- The central limit theorem says that for sufficiently large random samples, the sample means will be approximately normally distributed, regardless of the distribution of the original data.
- The mean of the sample means will approach the population mean (μ) , and the standard deviation of the sample means will approach the population standard deviation divided by the square root of the sample size (σ ÷ √n)
- The central limit theorem allows statisticians to apply a normal approximation to calculate estimates about the population using a small sample of data, and have a measure of confidence in those estimates.
- How well the normal approximation models the sampling distribution depends on the sample size and the shape and skew of the data. There is nothing special about a sample size of 30, this is a rough guideline at best. For normal or symmetrical data, smaller sample sizes can be used. If your data is highly skewed or you require a low margin of error in your statistical estimates, you may need a larger sample size.