Overview

This project consists of a large-scale simulation study to investigate how robust the simple linear regression model is to deviations from its assumptions.

Details

The basic simple linear regression model states that for \(i=1, \ldots, n\)

\[ Y_i = \beta_0 + \beta_1 x_i + \epsilon_i \]

where the \(\epsilon_i\) are a random sample from a \(N(0, \sigma^2)\) distribution. The goal of this project is to assess how robust our inference about \(\beta_1\) is to deviations from these assumptions.

Objectives

Provide a thorough examination of how good \(\hat{\beta}_1\) is as an estimator of \(\beta_1\) when some of the assumptions of the simple linear model do not hold. Specifically, we have derived the bias and the standard deviation of \(\hat{\beta}_1\) under the above modeling assumptions. In this project, you will (repeatedly) generate data from models which deviate from the above assumptions in one or more ways. For a particular simulation setting, if you run \(B\) simulations, you can save those \(B\) estimates of \(\hat{\beta}_1\) and calculate the empirical bias and empirical standard error of \(\hat{\beta}_1.\) You can also count how many times the confidence intervals calculated according to standard formulas contain the true value of \(\beta_1.\) By comparing the empirical bias and standard error to the theoretical bias and standard error, and the observed confidence interval coverage to the nominal confidence interval coverage, you can evaluate how well the purported properties of \(\hat{\beta}_1\) hold up when different assumptions do not hold. Some possible deviations to consider:

  1. What if the error distribution is not normal but still symmetric? (e.g., Uniform on [-0.5, 0.5], Double Exponential, Cauchy)
  2. What if the error distribution is skewed?
  3. What if the true relationship is not linear? (e.g., \(Y_i = \beta_0 + \beta_1 x_i +\beta_2 x_i^2 + \epsilon_i\))
  4. What if the \(\epsilon_i\) are not a random sample? (e.g., the individuals may be divided into “families” with each family having its own error distribution)
  5. What if the variability of the \(\epsilon_i\) is not constant?
  6. What if there are outliers? (Outliers may have extreme \(Y\)-values, extreme \(x\)-values, or both.)

You will want to evaluate these for at least a few different sample sizes – at least one small sample size and one large sample size. Present your results in graphical and tabular format, and describe your observations. Which assumptions may be relaxed in practice, and which ones can’t?