The focus of this topic is measuring variability and error in estimation. To do so, we will investigate three properties of estimators; bias, variance, and mean squared error.
The expected value of some function of a discrete random variable is defined as
\[ \mathbb{E}[g(X)] \triangleq \sum_{x} g(x)p(x) \]
For continuous random variables we have a similar definition.
\[ \text{E}[g(X)] \triangleq \int_{-\infty}^{\infty} g(x)f(x) dx \]
Often of interest are two particular expectations; the mean and variance.
The mean of a random variable is defined to be
\[ \mu_{X} = \text{mean}[X] \triangleq \text{E}[X] \]
For a discrete random variable we would have
\[ \mu_{X} = \text{mean}[X] = \sum_{x} x \cdot p(x) \]
For a continuous random variable, we would essentially replace the sum with an integral.
The variance of a random variable \(X\) is given by
\[ \sigma^2_{X} = \text{var}[X] \triangleq \text{E}[(X - \mu_X)^2] = \text{E}[X^2] - (\mu_X)^2. \]
The standard deviation of a random variable \(X\)is given by
\[ \sigma_{X} = \text{sd}[X] \triangleq \sqrt{\sigma^2_{X}} = \sqrt{\text{var}[X]}. \]
The covariance of random variables \(X\) and \(Y\) is given by
\[ \text{cov}[X, Y] \triangleq \text{E}[(X - \text{E}[X])(Y - \text{E}[Y])] = \text{E}[XY] - \text{E}[X] \cdot \text{E}[Y]. \]
If \(X\) is a random variable with mean \(\text{E}[X]\) and variance \(\text{Var}[X]\), then
\[ \text{E}[a X + b] = a \cdot \text{E}[X] + b \]
\[ \text{Var}[a X + b] = a^2 \cdot \text{Var}[X] \]
If \(X\) and \(Y\) are random variables with
Then
\[ \text{E}[a X + b Y + c] = a\mu_X + b\mu_Y +c \]
and
\[ \text{Var}[a X + b Y + c] = a^2\sigma^2_X + b\sigma^2_Y + 2ab\sigma_{XY}. \]
If \(X\) and \(Y\) and independent random variables, then \(\text{cov}[X, Y] = 0.\) (The reverse is not necessarily true.)
This, if \(X\) and \(Y\) are independent, the above becomes
\[ \text{Var}[a X + b Y + c] = a^2\sigma^2_X + b\sigma^2_Y. \]
If \(X\) and \(Y\) are independent and
Then
\[ a X + b Y + c \sim N(\mu_X + \mu_Y, a^2\sigma^2_X + b^2\sigma^2_Y) \]
The above rules can often be chained together when considering more than two random variables.
A couple general consequences of the above:
If \(X_1, X_2, \ldots X_n\) is a random sample (thus \(X_1, X_2, \ldots X_n\) are IID) from some population with finite mean \(\mu\) and variance \(\sigma^2\), then the sample mean,
\[ \bar{X} = \bar{X}(X_1, X_2, \ldots X_n) = \frac{1}{n}\sum_{i = 1}^{n} X_i \]
has the following properties
\[ \text{E}[\bar{X}] = \mu \]
\[ \text{Var}[\bar{X}] = \frac{\sigma^2}{n} \]
If additionally each \(X_i\) follows a normal distribution with mean \(\mu\) and variance \(\sigma^2\), then
\[ \bar{X} \sim N\left(\mu, \frac{\sigma^2}{n}\right) \]
An estimator is just a fancy name for a statistic that attempts to estimate a parameter of interest. Thus, like statistics, estimators are random variables that have distributions.
When we write
\[ \hat{\theta} = f(X_1, X_2, \ldots, X_N) = \frac{1}{n}\sum_{i = 1}^{n} X_i \]
this is an estimator. It is a function of random variables, so \(\hat{\theta}\) itself is a random variable which has a distribution. We think of estimators this way when we want to discuss the statistical properties of an estimator based on the fact that the estimator could be applied to any possible (random) sample.
When we write
\[ \hat{\theta} = \frac{1}{n}\sum_{i = 1}^{n} x_i = 5 \]
this is an estimate. This is the result of applying the estimator to a particular sample. Since we have the sample data and it is no longer a potential sample that has uncertainty, it is not random. An estimate does not have a distribution.
As a general estimation setup, we will most often consider a random sample \(X_1, X_2, \ldots X_n\) from some population with finite mean \(\mu\) and variance \(\sigma^2\).
The bias of estimating a parameter \(\theta\) using the estimator \(\hat{\theta}\) is defined as
\[ \text{bias}\left[\hat{\theta}\right] \triangleq \text{E}\left[\hat{\theta}\right] - \theta \]
Often, when we calculate bias, it will be a function of the true parameter, \(\theta\) and possibly the sample size \(n\).
The variance of estimating a parameter \(\theta\) using the estimator \(\hat{\theta}\) is defined as
\[ \text{var}\left[\hat{\theta}\right] = \text{E}\left[\left(\hat{\theta} - \text{E}\left[\hat{\theta}\right]\right) ^ 2\right] \]
Technically this results doesn’t depend at all on \(\theta\). The variance of an estimator measures how variable the estimation is, but not about the true value of the parameter.
Often, when we calculate variance, it will be a function of the variance of the population, \(\sigma^2\) and the sample size \(n\).
The mean squared error of estimating a parameter \(\theta\) using the estimator \(\hat{\theta}\) is defined as
\[ \text{MSE}\left[\hat{\theta}\right] = \text{E}\left[(\hat{\theta} - \theta) ^ 2\right] = \left( \text{bias}\left[\hat{\theta}\right] \right)^2 + \text{var}\left[\hat{\theta}\right] \]
The mean squared error does measure how variable the estimator is, this time, about the true value of the parameter. It is essentially the average squared error. When an estimator is unbiased, the mean squared error is equal to the variance.
Often, when we calculate variance, it will be a function of the true parameter, \(\theta\), the variance of the population, \(\sigma^2\) and the sample size \(n\).
In this topic we introduce two new topics consistency and sufficiency. Consistency is another way of evaluating an estimation, this time in an asymptotic sense. Sufficiency both indicates that we are using the available data properly, and helps us start thinking about creating estimators.
An estimator \(\hat{\theta}_n\) is said to be a consistent estimator of \(\theta\) if, for any positive \(\epsilon\),
\[ \lim_{n \rightarrow \infty} P( | \hat{\theta}_n - \theta | \leq \epsilon) =1 \]
or, equivalently,
\[ \lim_{n \rightarrow \infty} P( | \hat{\theta}_n - \theta | > \epsilon) =0 \]
We say that \(\hat{\theta}_n\) converges in probability to \(\theta\) and we write \(\hat{\theta}_n \overset P \rightarrow \theta\).
Theorem: An unbiased estimator \(\hat{\theta}_n\) for \(\theta\) is a consistent estimator of \(\theta\) if
\[ \lim_{n \rightarrow \infty} \text{Var}\left[\hat{\theta}_n\right] = 0 \]
If \(Y_1, Y_2, \ldots, Y_n\) are a random sample such that
Then
\[ \bar{Y}_n \overset P \rightarrow \mu. \]
(That is \(\bar{Y}_n = \frac{1}{n} \sum_{i=1}^n Y_i\) is a consistent estimator of \(\mu\).)
Theorem: Suppose that \(\hat{\theta}_n \overset P \rightarrow \theta\) and that \(\hat{\beta}_n \overset P \rightarrow \beta\).
Suppose we have a random sample \(Y_1, \ldots, Y_n\) from a \(N(\mu, \sigma^2)\) population, with mean \(\mu\) (unknown) and variance \(\sigma^2\) (known).
To estimate \(\mu\), we have proposed using the sample mean \(\bar{Y}\). This is a nice, intuitive, unbiased estimator of \(\mu\) – but we could ask: does it encode all the information we can glean from the data about the parameter \(\mu\)?
In this model, the answer is: \(\bar{Y}\) does encode all the information in the data about the location of \(\mu\) – there is nothing more we can get from the actual data values \(Y_1, \ldots, Y_n.\)
Definition: Let \(Y_1, \ldots, Y_n\) denote a random sample from a probability distribution with unknown parameter \(\theta.\) Then a statistic \(U = g(Y_1, \ldots, Y_n)\) is said to be sufficient for \(\theta\) if the conditional distribution of \(Y_1, \ldots, Y_n\) given \(U,\) does not depend on \(\theta.\)
Let \(U\) be a statistic based on a random sample \(Y_1, Y_2, \ldots, Y_n\). Then \(U\) is a sufficient statistic for \(\theta\) if and only if the joint probability distribution or density function can be factored into two nonnegative functions,
\[ f(y_1, y_2, \ldots, y_n | \theta) = g(u, \theta) \cdot h(y_1, y_2, \ldots, y_n), \]
where \(g(u,\theta)\) is a function only of \(u\) and \(\theta\) and \(h(y_1, y_2, \ldots, y_n)\) is not a function of \(\theta\).
Any one-to-one function of a sufficient statistic is sufficient.