Sample Variance Intuition

This blog post will focus on the intuition behind the $n-1$ divisor of the standard variance. No proofs, but lots of context and motivation.

And you’ll be pleased to know that this post is self contained. You don’t need to read either of my previous two posts on sample variance.

For a population $X_1, X_2, \ldots , X_N$ the population variance is defined by

$\sigma^2 = \frac{1}{N} \sum\limits_{i=1}^{N} (X_i - \mu)^2$ where $\mu = \frac{1}{N} \sum\limits_{i=1}^{N} X_i$

and for a sample $x_1, x_2, \ldots , x_n$ the sample variance is defined by

$s^2 = \frac{1}{n-1} \sum\limits_{i=1}^{n} (x_i - \overline{x})^2$ where $\overline{x} = \frac{1}{n} \sum\limits_{i=1}^{n} x_i$

And the question is “why doesn’t the sample variance have the same formula as the population variance?” i.e why isn’t the sample variance given by the direct analogue of the population variance, which is known as the “uncorrected sample variance”:

${s_n}^2 = \frac{1}{n} \sum\limits_{i=1}^{n} (x_i - \overline{x})^2$

For the purpose of building intuition, there’s two facts that I think are extremely important to keep in mind.

Sample Statistics Part 2

Since writing my earlier sample statistics blog post I’ve learned a few things and thought I’d provide an update. Firstly, a quick recap.

For a population $X_1, X_2, \ldots , X_N$ the population variance is defined by

$\sigma^2 = \frac{1}{N} \sum\limits_{i=1}^{N} (X_i - \mu)^2$ where $\mu = \frac{1}{N} \sum\limits_{i=1}^{N} X_i$

and for a sample $x_1, x_2, \ldots , x_n$ the sample variance is defined by

$s^2 = \frac{1}{n-1} \sum\limits_{i=1}^{n} (x_i - \overline{x})^2$ where $\overline{x} = \frac{1}{n} \sum\limits_{i=1}^{n} x_i$

The fact that $n-1$ is used instead of $n$ is because that’s what makes the average sample variance equal to the population variance, when the samples are taken with replacement allowed (as discussed in detail in my first sample statistics blog post).

What I didn’t consider in the previous blog post was the case where the samples are taken without allowing repetition (which is often the way that real life sampling is done). I didn’t because at the time I didn’t know how to perform the relevant analysis. Since then I’ve figured it out (and, as far as I know, it’s not explained elsewhere). It turns out that the divisor isn’t $n-1$ and it’s not $n$ either. It’s ${\frac{N}{N-1}}(n-1)$

Here’s how to see that, together with some ideas that I think are conceptually helpful when dealing with these matters. Continue reading “Sample Statistics Part 2”

Sample Statistics

If you ever take an introductory statistics course, you’ll very quickly be dealing with taking a sample of size $n$ like $x_1, x_2, \ldots , x_n$ from a population of size $N$ described by $X_1, X_2, \ldots , X_N$. And then you’ll want to try to say something about the population from the sample.

The usual numbers derived from the population are the population mean and the population variance

$\mu = E(X) = \frac{1}{N} \sum\limits_{i=1}^{N} X_i$

$\sigma^2 = Var(X) = \frac{1}{N} \sum\limits_{i=1}^{N} (X_i - \mu)^2$

The usual numbers derived from the sample are the sample mean and the sample variance

$\overline{x} = \frac{1}{n} \sum\limits_{i=1}^{n} x_i$

$s^2 = \frac{1}{n-1} \sum\limits_{i=1}^{n} (x_i - \overline{x})^2$

And the sample mean and sample variance are then used to provide an estimate for the corresponding population numbers.

The usual question that comes up is why $n-1$ for the sample variance, especially since $N$ (not $N-1$) is being used for the population variance. Why the gratuitous inconsistency? What’s going on?

This certainly puzzled me when I took my first statistics course (many things puzzled me, even though I aced the exams). This blog post is about what I would tell my earlier self if I had access to a time machine. Continue reading “Sample Statistics”