This blog post will focus on the intuition behind the divisor of the standard variance. No proofs, but lots of context and motivation.
And you’ll be pleased to know that this post is self contained. You don’t need to read either of my previous two posts on sample variance.
For a population the population variance is defined by
and for a sample the sample variance is defined by
And the question is “why doesn’t the sample variance have the same formula as the population variance?” i.e why isn’t the sample variance given by the direct analogue of the population variance, which is known as the “uncorrected sample variance”:
For the purpose of building intuition, there’s two facts that I think are extremely important to keep in mind.
Fact 1. The population variance is a measure of the “variety” present in the population For example, if there’s no variety i.e. the members are identical i.e. then . A special case which never exhibits any variety (and for which is always ) is the case where i.e. the case of a population with only one member. And if there is some variety, ie for some then
Furthermore, if “variety” is modified by scaling the population by a constant factor to then the population variance scales as the square . So, by uniformly increasing the distance between the members of the population, we increase the population variance (by the square of the scaling factor). Finally, as might be expected of any reasonable measure of variety, such as the population variance, simply adding a constant to each member of the population does not change the population variance.
Fact 2. The sample variance is designed to be an unbiased estimator of the population variance. That is, for a given , if you take the average of the sample variance over all possible size samples of the population, you are supposed to get the population variation. So that, in practice, if you compute a bunch of sample variances by taking several samples of size , their average should be extremely close to the population variance.
To put it another way, the sample variance uses the divisor because of the following identity:
Another way to express the identity, in terms of the uncorrected sample variance, is
This means that asking for a reason why there is a divisor of in the sample variance is the same as asking for a reason why the identity I just stated is true.
Let’s start by dealing with what seems to be an obvious problem with the above identity: it seems to give the wrong answer when i.e. for a sample of size any sample is the population, so that the uncorrected sample variance is the population variance, and needs no correction factor. The uncorrected sample variance is clearly the right thing to use and the sample variance is clearly wrong.
And that would be correct, except for one detail that is typically never explicitly mentioned: the sampling is with repetition allowed. If we don’t allow repetition then yes, the uncorrected sample variance is the correct estimator to use when . But we do allow repetition, and so the sample variance is back in the running as an unbiased estimator.
The above identity neglected to describe what kind of samples the average is taken over. Here’s the identity again, but more precisely stated
Intuitively speaking, it seems reasonable to assume that for any reasonable way of measuring variety, repetition should reduce that measure of variety. So you should expect that the variety of a sample (as measured by the uncorrected sample variance, which for samples is the direct analogue of the population variance) should be smaller when averaged all samples with repetition allowed than when it is averaged over all samples without repetition. i.e. we should expect that the following inequality is true:
So, if the uncorrected sample variance is the correct construct to use when sampling without repetition in the case where then some correction factor bigger than one (e.g. ) would be needed to be applied to the uncorrected sample variance when sampling with repetition allowed.
At this point you might conclude that the denominator in the sample variance is simply due to the type of sampling being done i.e. that it’s an artifact of sampling with repetition allowed, and if you were to use “real world” sampling (i.e. sampling with no repetition allowed) then the uncorrected sample variance would be the correct construct to use. That’s not correct. It happens to be correct for but it is not correct in general. There’s more to the story.
And, just so that there’s no suspense, the correction factor when sampling without repetition is , which, just like the correction factor when sampling with repetition allowed, , is bigger than one when . In other words, I’m saying that the following identity holds for sampling without repetition
Regardless of what kind of sampling we perform, the correction factor will always be bigger than one.
In other words, the following inequality will always be true
One way to see that this might be true is to introduce an in between quantity that uses instead of in the formula for the sample variance. Using that quantity, we have the following less mysterious (I claim) chain of inequalities and identities:
This is of course a shorthand for two relationships, each of which is true for different reasons.
I’ll now tackle each of the two relationships, starting with the identity.
The second relationship, the identity, is a special case of a more general identity involving a family of population constructs (parameterized by a real number ) and their corresponding sample constructs. For a constant value , define and as follows
It turns out that, and it seems quite plausible, that is an unbiased estimator of i.e.
And, in particular, for the special case we have
which is the identity we’re interested in, since and
Now for the first relationship, the inequality. This inequality about averages of samples is true because it is true of each individual sample. And, the same sample construct we just used to motivate the identity is also relevant here. If we keep fixed and vary , then is minimized when i.e.
for all for any sample
which, taking the special case , of course means that
for any sample
And since it’s true for each sample, it’s true of the average over all samples i.e.
And this is the first relationship, i.e the inequality, since and
And that’s why the correction factor is bigger than one: for all sampling, with or without repetition allowed.