Sample Variance Intuition

This blog post will focus on the intuition behind the n-1 divisor of the standard variance. No proofs, but lots of context and motivation.

And you’ll be pleased to know that this post is self contained. You don’t need to read either of my previous two posts on sample variance.

For a population X_1, X_2, \ldots , X_N the population variance is defined by

\sigma^2 = \frac{1}{N} \sum\limits_{i=1}^{N} (X_i - \mu)^2 where \mu = \frac{1}{N} \sum\limits_{i=1}^{N} X_i

and for a sample x_1, x_2, \ldots , x_n the sample variance is defined by

s^2 = \frac{1}{n-1} \sum\limits_{i=1}^{n} (x_i - \overline{x})^2 where \overline{x} = \frac{1}{n} \sum\limits_{i=1}^{n} x_i

And the question is “why doesn’t the sample variance have the same formula as the population variance?” i.e why isn’t the sample variance given by the direct analogue of the population variance, which is known as the “uncorrected sample variance”:

{s_n}^2 = \frac{1}{n} \sum\limits_{i=1}^{n} (x_i - \overline{x})^2

For the purpose of building intuition, there’s two facts that I think are extremely important to keep in mind.
Continue reading “Sample Variance Intuition”

Sample Statistics Part 2

Since writing my earlier sample statistics blog post I’ve learned a few things and thought I’d provide an update. Firstly, a quick recap.

For a population X_1, X_2, \ldots , X_N the population variance is defined by

\sigma^2 = \frac{1}{N} \sum\limits_{i=1}^{N} (X_i - \mu)^2 where \mu = \frac{1}{N} \sum\limits_{i=1}^{N} X_i

and for a sample x_1, x_2, \ldots , x_n the sample variance is defined by

s^2 = \frac{1}{n-1} \sum\limits_{i=1}^{n} (x_i - \overline{x})^2 where \overline{x} = \frac{1}{n} \sum\limits_{i=1}^{n} x_i

The fact that n-1 is used instead of n is because that’s what makes the average sample variance equal to the population variance, when the samples are taken with replacement allowed (as discussed in detail in my first sample statistics blog post).

What I didn’t consider in the previous blog post was the case where the samples are taken without allowing repetition (which is often the way that real life sampling is done). I didn’t because at the time I didn’t know how to perform the relevant analysis. Since then I’ve figured it out (and, as far as I know, it’s not explained elsewhere). It turns out that the divisor isn’t n-1 and it’s not n either. It’s {\frac{N}{N-1}}(n-1)

Here’s how to see that, together with some ideas that I think are conceptually helpful when dealing with these matters. Continue reading “Sample Statistics Part 2”

Sample Statistics

If you ever take an introductory statistics course, you’ll very quickly be dealing with taking a sample of size n like x_1, x_2, \ldots , x_n from a population of size N described by X_1, X_2, \ldots , X_N. And then you’ll want to try to say something about the population from the sample.

The usual numbers derived from the population are the population mean and the population variance

\mu = E(X) = \frac{1}{N} \sum\limits_{i=1}^{N} X_i

\sigma^2 = Var(X) = \frac{1}{N} \sum\limits_{i=1}^{N} (X_i - \mu)^2

The usual numbers derived from the sample are the sample mean and the sample variance

\overline{x} = \frac{1}{n} \sum\limits_{i=1}^{n} x_i

s^2 = \frac{1}{n-1} \sum\limits_{i=1}^{n} (x_i - \overline{x})^2

And the sample mean and sample variance are then used to provide an estimate for the corresponding population numbers.

The usual question that comes up is why n-1 for the sample variance, especially since N (not N-1) is being used for the population variance. Why the gratuitous inconsistency? What’s going on?

This certainly puzzled me when I took my first statistics course (many things puzzled me, even though I aced the exams). This blog post is about what I would tell my earlier self if I had access to a time machine. Continue reading “Sample Statistics”