Bias, Variance and Mean Square Error (MSE)

[A, SfS] Chapter 4: Estimation: 4.1: Bias, Variance, and Mean Square

Bias, Variance and Mean Square Error (MSE)

Bias, Variance and MSE

In this lesson, you will learn about assessing the quality of a statistic.

#\text{}#

We now transition from Probability to Statistics.

Statistic

A statistic is a number that is computed from a set of measurements of some variable on a sample from a population which we then use to make some inference about the distribution of that variable on the population.

For this course we will focus on inference about the unknown value of some parameter of the distribution, such as the mean. The statistic is a point estimator of the parameter, because it gives an estimate of the value of the parameter as a single point on the real number line.

At first we will not specify which parameter we want to estimate, and we will use the generic symbol #\theta# for a parameter (the Greek letter theta). Then we will use the symbol #\hat{\theta}# (“theta-hat”) to represent the statistic which is intended to estimate #\theta#.

Since the value of #\hat{\theta}# is based on a random sample, #\hat{\theta}# is a random variable, and thus has a mean and variance, and a probability distribution. If we know this information, we can assess the quality of #\hat{\theta}# as an estimator of #\theta#.

#\text{}#

The first quality we will assess about an estimator is the degree to which it is biased.

Unbiased Estimator

If the estimator #\hat{\theta}# is unbiased, then the expected value of #\hat{\theta}# should be #\theta#: \[E(\hat{\theta}) = \theta\]

This means that if we could take every possible sample of a given size from the population, and compute the value of #\hat{\theta}# for each of those samples, the values should be centered around #\theta#. But if the distribution of those values is not centered around #\theta#, then there is some systematic bias in the estimator #\hat{\theta}#. In other words, it is “off target”, i.e., less accurate.

Bias

We define the bias of #\hat{\theta}# as:

\[B(\hat{\theta}) = E(\hat{\theta}) - \theta\] Note that the bias can be either positive or negative, or zero if unbiased.

For example, suppose #X_1,X_2,X_3# are independent measurements of some quantitative variable whose distribution on the population has an expectation of #2\theta#. Suppose we plan to estimate the value of #\theta# using:

\[\hat{\theta} = \cfrac{X_1+X_2+X_3}{7}\] What would be the bias of #\hat{\theta}#?

Using the linear properties of the expectation which we studied earlier, we have:

\[\begin{array}{rcl}
E(\hat{\theta}) &=& E\bigg(\cfrac{X_1+X_2+X_3}{7}\bigg)\\\\
&=& \cfrac{1}{7}\Big(E(X_1) + E(X_2) + E(X_3)\Big)\\\\
&=& \cfrac{1}{7}(2\theta + 2\theta + 2\theta)\\\\
&=& \cfrac{6\theta}{7}
\end{array}\] Then the bias of #\hat{\theta}# is:

\[B(\hat{\theta}) = E(\hat{\theta}) - \theta = \cfrac{6\theta}{7} - \theta = -\cfrac{\theta}{7}\] So #\hat{\theta}# is a biased estimator of the parameter #\theta#.

#\text{}#

Another quality of an estimator that is of interest is its variance. We would like to feel assured that, even if the value of #\hat{\theta}# computed from a sample is not exactly equal to #\theta#, its value isn't likely to be very far away from #\theta#, i.e., its variance is low. This means that the estimator has greater precision.

We have already learned how to compute the variance of a linear combination of independent random variables.

To review, suppose again that #X_1,X_2,X_3# are independent measurements of some quantitative variable, and suppose that the distribution of this variable on the population has a variance of #4\theta^2#.

Suppose again we plan to estimate the value of #\theta# using: \[\hat{\theta} =\cfrac{X_1 +X_2 +X_3}{7}\] What is the variance of #\hat{\theta}#?

We use the method we previously covered to compute this variance:

\[\begin{array}{rcl}
V(\hat{\theta}) &=& V\bigg(\cfrac{X_1+X_2+X_3}{7}\bigg)\\\\
&=& \cfrac{1}{49}\Big(V(X_1) + V(X_2) + V(X_3)\Big)\\\\
&=& \cfrac{1}{49}(4\theta^2 + 4\theta^2 + 4\theta^2)\\\\
&=& \cfrac{12\theta^2}{49}\\\\
&\approx& 0.2449\theta^2
\end{array}\]

There is no way to decide whether this is a “good” variance or a “bad” variance. But if an alternative estimator of #\theta# is proposed, we can compute the variance of that estimator and then make a comparison. The estimator with the smaller variance might be the better estimator. But don't forget, the above estimator was slightly biased, so even if it has a smaller variance, its bias might make it less preferable.

#\text{}#

This brings up the question: How do I decide between an estimator with lower bias and higher variance and an estimator with higher bias and lower variance? We need some way of balancing these two measures of quality. The answer is the Mean Square Error.

Mean Square Error

The Mean Square Error (MSE) is defined as:

\[MSE(\hat{\theta}) = E\Big[(\hat{\theta} - \theta)^2\Big]\] With a bit of manipulation, this formula can be rewritten as:

\[MSE(\hat{\theta}) = V(\hat{\theta}) + \Big(B(\hat{\theta})\Big)^2\] We will judge one estimator to be superior to another estimator if its MSE is smaller.

Going back to our example, where #\hat{\theta} = \cfrac{X_1+X_2+X_3}{7}#, we found that: \[B(\hat{\theta}) = -\cfrac{\theta}{7} \,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\, \text{and}\,\,\,\,\,\,\,\,\,\,\,\,\,\,\, V(\hat{\theta}) = \cfrac{12\theta^2}{49}\]

Then

\[MSE(\hat{\theta}) = \cfrac{12\theta^2}{49} + \Big(-\cfrac{\theta}{7}\Big)^2 = \cfrac{12\theta^2}{49} + \cfrac{\theta^2}{49} = \cfrac{13\theta^2}{49}\] If we encounter another estimator with a smaller MSE, we would consider that estimator to be preferable.

#\text{}#

Recall that the sample variance formula is: \[s^2 = \cfrac{1}{n - 1} \displaystyle\sum_{i=1}^n(X_i - \bar{X})^2\] The sample variance #s^2# is an estimator of the population variance #\sigma^2#. The question often arises: Why do we divide by #n - 1# rather than by #n#? To understand the answer, let us compute the bias of #s^2#.

Estimating the Population Variance

Let #\mu = E(X)#. By definition \[\sigma^2 = E\Big[(X - \mu)^2\Big]\] so \[E\Big[(X_i - \mu)^2\Big] = \sigma^2\] for each #i#, and thus \[\displaystyle\sum_{i=1}^nE\Big[(X_i - \mu)^2\Big] = n\sigma^2\] Likewise, \[V(\bar{X}) = E\Big[(\bar{X} - \mu)^2\Big] = \cfrac{\sigma^2}{n}\] Also, \[\displaystyle\sum_{i=1}^n(X_i - \mu) = \displaystyle\sum_{i=1}^nX_i - \displaystyle\sum_{i=1}^n\mu = n\bar{X} - n\mu\]

Then:

\[\begin{array}{rcl}
E(s^2) &=& E\Bigg(\cfrac{1}{n - 1} \displaystyle\sum_{i=1}^n(X_i - \bar{X})^2\Bigg)\\\\
&=& \cfrac{1}{n - 1} \displaystyle\sum_{i=1}^nE\Big[(X_i - \bar{X})^2\Big]\\\\
&=&\cfrac{1}{n - 1} \displaystyle\sum_{i=1}^nE\Big[(X_i - \mu + \mu - \bar{X})^2\Big] \,\,\,\,\,\,\,\,\text{(note the clever trick!)}\\\\
&=& \cfrac{1}{n - 1} \displaystyle\sum_{i=1}^n E\Big[(X_i - \mu)^2 + 2(X_i - \mu)(\mu - \bar{X}) + (\mu - \bar{X})^2\Big]\\\\
&=& \cfrac{1}{n - 1} \Bigg\{\displaystyle\sum_{i=1}^nE\Big[(X_i - \mu)^2\Big] + 2E\Bigg[(\mu - \bar{X})\displaystyle\sum_{i=1}^n(X_i - \mu)\Bigg] + nE\Big[ (\mu - \bar{X})^2\Big] \Bigg\}\\\\
&=& \cfrac{1}{n - 1} \bigg\{ n\sigma^2 + 2E\Big[(\mu - \bar{X})(n\bar{X} - n\mu)\Big] + n\Big(\cfrac{\sigma^2}{n}\Big)\bigg\}\\\\
&=& \cfrac{1}{n-1} \bigg\{ n\sigma^2 - 2nE\Big[(\bar{X} - \mu)^2\Big] + \sigma^2 \bigg\}\\\\
&=& \cfrac{1}{n - 1} \bigg\{ n\sigma^2 - 2n\Big(\cfrac{\sigma^2}{n}\Big) + \sigma^2 \bigg\}\\\\
&=& \cfrac{1}{n - 1} \Big\{ n\sigma^2 - 2\sigma^2 + \sigma^2 \Big\}\\\\
&=& \cfrac{(n-1)\sigma^2}{n - 1} \\\\
&=& \sigma^2
\end{array}\] This shows that #s^2# is an unbiased estimator of #\sigma^2#. If we did not divide by #n - 1#, but rather by #n#, the result would be biased.