Sample Variance Intuition

This blog post will focus on the intuition behind the n-1 divisor of the standard variance. No proofs, but lots of context and motivation.

And you’ll be pleased to know that this post is self contained. You don’t need to read either of my previous two posts on sample variance.

For a population X_1, X_2, \ldots , X_N the population variance is defined by

\sigma^2 = \frac{1}{N} \sum\limits_{i=1}^{N} (X_i - \mu)^2 where \mu = \frac{1}{N} \sum\limits_{i=1}^{N} X_i

and for a sample x_1, x_2, \ldots , x_n the sample variance is defined by

s^2 = \frac{1}{n-1} \sum\limits_{i=1}^{n} (x_i - \overline{x})^2 where \overline{x} = \frac{1}{n} \sum\limits_{i=1}^{n} x_i

And the question is “why doesn’t the sample variance have the same formula as the population variance?” i.e why isn’t the sample variance given by the direct analogue of the population variance, which is known as the “uncorrected sample variance”:

{s_n}^2 = \frac{1}{n} \sum\limits_{i=1}^{n} (x_i - \overline{x})^2

For the purpose of building intuition, there’s two facts that I think are extremely important to keep in mind.

Fact 1. The population variance \sigma^2 is a measure of the “variety” present in the population X_1, X_2, \ldots , X_N For example, if there’s no variety i.e. the members are identical i.e. X_1 = X_2 = \ldots = X_N then \sigma^2 = 0. A special case which never exhibits any variety (and for which \sigma^2 is always 0) is the case where N=1 i.e. the case of a population with only one member. And if there is some variety, ie X_i \ne X_j for some i \ne j then \sigma^2 \textgreater 0

Furthermore, if “variety” is modified by scaling the population by a constant factor \lambda to {\lambda}X_1, {\lambda}X_2, \ldots , {\lambda}X_N then the population variance scales as the square \lambda^2. So, by uniformly increasing the distance between the members of the population, we increase the population variance (by the square of the scaling factor). Finally, as might be expected of any reasonable measure of variety, such as the population variance, simply adding a constant to each member of the population \alpha+X_1, \alpha+X_2, \ldots , \alpha+X_N does not change the population variance.

Fact 2. The sample variance is designed to be an unbiased estimator of the population variance. That is, for a given n, if you take the average of the sample variance over all possible size n samples of the population, you are supposed to get the population variation. So that, in practice, if you compute a bunch of sample variances by taking several samples of size n, their average should be extremely close to the population variance.

To put it another way, the sample variance uses the divisor n-1 because of the following identity:

{\frac{1}{N} \sum\limits_{i=1}^{N} (x_i - \mu)^2}\ =\  Average({\frac{1}{n-1} \sum\limits_{i=1}^{n} (x_i - \overline{x})^2})

Another way to express the identity, in terms of the uncorrected sample variance, is

{\frac{1}{N} \sum\limits_{i=1}^{N} (x_i - \mu)^2}\ =\  \frac{n}{n-1}Average({\frac{1}{n} \sum\limits_{i=1}^{n} (x_i - \overline{x})^2})

This means that asking for a reason why there is a divisor of n-1 in the sample variance is the same as asking for a reason why the identity I just stated is true.

Let’s start by dealing with what seems to be an obvious problem with the above identity: it seems to give the wrong answer when n=N i.e. for a sample of size N any sample is the population, so that the uncorrected sample variance is the population variance, and needs no correction factor. The uncorrected sample variance is clearly the right thing to use and the sample variance is clearly wrong.

And that would be correct, except for one detail that is typically never explicitly mentioned: the sampling is with repetition allowed. If we don’t allow repetition then yes, the uncorrected sample variance is the correct estimator to use when n=N. But we do allow repetition, and so the sample variance is back in the running as an unbiased estimator.

The above identity neglected to describe what kind of samples the average is taken over. Here’s the identity again, but more precisely stated

{\frac{1}{N} \sum\limits_{i=1}^{N} (x_i - \mu)^2}\ =\  \frac{n}{n-1}\underset{with\ repetition}{Average}({\frac{1}{n} \sum\limits_{i=1}^{n} (x_i - \overline{x})^2})

Intuitively speaking, it seems reasonable to assume that for any reasonable way of measuring variety, repetition should reduce that measure of variety. So you should expect that the variety of a sample (as measured by the uncorrected sample variance, which for samples is the direct analogue of the population variance) should be smaller when averaged all samples with repetition allowed than when it is averaged over all samples without repetition. i.e. we should expect that the following inequality is true:

\underset{with\ repetition}{Average}({\frac{1}{n} \sum\limits_{i=1}^{n} (x_i - \overline{x})^2})\ \textless\ \underset{without\ repetition}{Average}({\frac{1}{n} \sum\limits_{i=1}^{n} (x_i - \overline{x})^2})

So, if the uncorrected sample variance is the correct construct to use when sampling without repetition in the case where n=N then some correction factor bigger than one (e.g. \frac{n}{n-1}) would be needed to be applied to the uncorrected sample variance when sampling with repetition allowed.

At this point you might conclude that the n-1 denominator in the sample variance is simply due to the type of sampling being done i.e. that it’s an artifact of sampling with repetition allowed, and if you were to use “real world” sampling (i.e. sampling with no repetition allowed) then the uncorrected sample variance would be the correct construct to use. That’s not correct. It happens to be correct for n=N but it is not correct in general. There’s more to the story.

And, just so that there’s no suspense, the correction factor when sampling without repetition is \frac{N-1}{N}\frac{n}{n-1}, which, just like the correction factor when sampling with repetition allowed, \frac{n}{n-1}, is bigger than one when n \textless N. In other words, I’m saying that the following identity holds for sampling without repetition

{\frac{1}{N} \sum\limits_{i=1}^{N} (x_i - \mu)^2}\ =\  \frac{N-1}{N}\frac{n}{n-1}\underset{without\ repetition}{Average}({\frac{1}{n} \sum\limits_{i=1}^{n} (x_i - \overline{x})^2})

Regardless of what kind of sampling we perform, the correction factor will always be bigger than one.

In other words, the following inequality will always be true

Average({\frac{1}{n} \sum\limits_{i=1}^{n} (x_i - \overline{x})^2})\ \leqslant\  {\frac{1}{N} \sum\limits_{i=1}^{N} (x_i - \mu)^2}

One way to see that this might be true is to introduce an in between quantity that uses \mu instead of \overline{x} in the formula for the sample variance. Using that quantity, we have the following less mysterious (I claim) chain of inequalities and identities:

Average({\frac{1}{n} \sum\limits_{i=1}^{n} (x_i - \overline{x})^2})\ \leqslant\ Average({\frac{1}{n} \sum\limits_{i=1}^{n} (x_i - \mu)^2}) = {\frac{1}{N} \sum\limits_{i=1}^{N} (x_i - \mu)^2}

This is of course a shorthand for two relationships, each of which is true for different reasons.

Average({\frac{1}{n} \sum\limits_{i=1}^{n} (x_i - \overline{x})^2})\ \leqslant\ Average({\frac{1}{n} \sum\limits_{i=1}^{n} (x_i - \mu)^2})

Average({\frac{1}{n} \sum\limits_{i=1}^{n} (x_i - \mu)^2}) = {\frac{1}{N} \sum\limits_{i=1}^{N} (x_i - \mu)^2}

I’ll now tackle each of the two relationships, starting with the identity.

The second relationship, the identity, is a special case of a more general identity involving a family of population constructs (parameterized by a real number \alpha) and their corresponding sample constructs. For a constant value \alpha, define \Sigma(\alpha)^2 and S(\alpha)^2 as follows

{\Sigma(\alpha)}^2 = \frac{1}{N} \sum\limits_{i=1}^{N} (X_i - \alpha)^2

{S(\alpha)}^2 = \frac{1}{n} \sum\limits_{i=1}^{n} (x_i - \alpha)^2

It turns out that, and it seems quite plausible, that S(\alpha)^2 is an unbiased estimator of \Sigma(\alpha)^2 i.e.

Average(S(\alpha)^2) = \Sigma(\alpha)^2

And, in particular, for the special case \alpha = \mu we have

Average(S(\mu)^2) = \Sigma(\mu)^2

which is the identity we’re interested in, since S(\mu)^2= {\frac{1}{n} \sum\limits_{i=1}^{n} (x_i - \mu)^2} and \Sigma(\mu)^2= {\frac{1}{N} \sum\limits_{i=1}^{N} (x_i - \mu)^2}

Now for the first relationship, the inequality. This inequality about averages of samples is true because it is true of each individual sample. And, the same sample construct we just used to motivate the identity is also relevant here. If we keep x_1, x_2, \ldots x_n fixed and vary \alpha, then S(\alpha)^2 is minimized when \alpha = \overline{x} i.e.

S(\overline{x})^2 \leqslant S(\alpha)^2 for all \alpha for any sample x_1, x_2, \ldots x_n

which, taking the special case \alpha = \mu, of course means that

S(\overline{x})^2 \leqslant S(\mu)^2 for any sample x_1, x_2, \ldots x_n

And since it’s true for each sample, it’s true of the average over all samples i.e.

Average(S(\overline{x})^2) \leqslant Average(S(\mu)^2)

And this is the first relationship, i.e the inequality, since S(\overline{x})^2 = {\frac{1}{n} \sum\limits_{i=1}^{n} (x_i - \overline{x})^2} and S(\mu)^2 = {\frac{1}{n} \sum\limits_{i=1}^{n} (x_i - \mu)^2}

And that’s why the correction factor is bigger than one: for all sampling, with or without repetition allowed.

Author: Walter Vannini

Hi, I'm Walter Vannini. I'm a computer programmer and I'm based in the San Francisco Bay Area. Before I wrote software, I was a mathematics professor. I think about math, computer science, and related fields all the time, and this blog is one of my outlets. I can be reached via walterv at gbbservices dot com.

Leave a Reply

Your email address will not be published.