Please see the homework policy document for detailed instructions and some grading notes. Failure to follow instructions will result in point reductions.

Hofstadter’s Law: “It always takes longer than you expect, even when you take into account Hofstadter’s Law.”

—

Douglas Hofstadter, Gödel, Escher, Bach: An Eternal Golden Braid

For this homework we will use the `Sacramento`

data from the `caret`

package. You should read the documentation for this data. The **goal** of our modeling will be to predict home prices.

You may only use the following packages:

```
library(caret)
library(randomForest)
library(tidyverse)
library(knitr)
library(kableExtra)
```

Before modeling, we will perform some data preparation.

Instead of using the `city`

or `zip`

variables that exist in the dataset, we will simply create a variable indicating whether or not a house is technically within the city limits Sacramento. (We do this because they would both be factor variables with a large number of factors. This is a choice that is made due to laziness, not because it is justified. Think about what issues these variables might cause.)

```
data(Sacramento)
sac_data = Sacramento
sac_data$limits = factor(ifelse(sac_data$city == "SACRAMENTO", "in", "out"))
sac_data = subset(sac_data, select = -c(city, zip))
```

A plot of longitude versus latitude gives us a sense of where the city limits are.

```
qplot(y = longitude, x = latitude, data = sac_data,
col = limits, main = "Sacramento City Limits ")
```

You should consider performing some additional exploratory data analysis, but we provide a histogram of the home prices.

`qplot(x = price, data = sac_data, main = "Sacramento Home Prices")`

After these modifications, we test-train split the data.

```
set.seed(42)
sac_trn_idx = sample(nrow(sac_data), size = trunc(0.80 * nrow(sac_data)))
sac_trn_data = sac_data[sac_trn_idx, ]
sac_tst_data = sac_data[-sac_trn_idx, ]
```

The training data should be used for all model fitting.

For this exercise, we will create linear models in an attempt to be able to predict `price`

, **without** the use of the `limits`

, `latitude`

, or `longitude`

variables. Do not modify `sac_trn_data`

.

With the available variables, fit the following models:

- An additive model using all
*availible*predictors - A model using
*only*`sqft`

as a predictor - \(\texttt{price} = \beta_0 + \beta_1 \texttt{sqft} + \beta_2 \texttt{multi} + \beta_3 \texttt{res} + \epsilon\)
- \(\texttt{price} = \beta_0 + \beta_1 \texttt{sqft} + \beta_2 \texttt{multi} + \beta_3 \texttt{res} + \beta_4 (\texttt{sqft}\times\texttt{multi}) + \beta_5 (\texttt{sqft}\times\texttt{res}) + \epsilon\)

Here, `res`

is a dummy variable indicating whether or not `type`

is `Residential`

. Similarly, `multi`

is a dummy variables indicating whether or not `type`

is `Multi_Family`

.

Summarize these models using a well-formatted markdown table which includes columns for:

- Model (written using
`R`

’s formula syntax) - Train RMSE
- Test RMSE

For this exercise, we will create models in an attempt to be able to predict `price`

, using all available predictors.

Fit a total of four models:

: an`add`

*additive*linear model using all*availible*predictors: a linear model using all the main effects of all`int`

*availible*predictors as well, as all possible two-way*interactions*: a linear model that performs better than`user`

and`add`

`int`

: a`rf`

*random forest*which uses all available predictors fit using the`randomForest()`

function from the`randomForest`

package with all default arguments. To specify the predictors used, use the formula syntax for an additive model

Summarize these models using a well-formatted markdown table which includes columns for:

- Model Name
- Model Type (
`lm`

or`rf`

) - Variables Used (may use formula syntax)
- Train RMSE
- Test RMSE

Re-fit each of the models from Exercise 2, but with a log transformation applied to the response. (**Do not modify the data to do so.**)

Summarize the results of these four models using a well-formatted markdown table which includes columns for:

- Mode Name (append
**log_**to the start of the previous names) - Model Type (
`lm`

or`rf`

) - Variables Used (may use formula syntax)
- Train RMSE (on the data scale, that is, units of
**dollars**) - Test RMSE (on the data scale, that is, units of
**dollars**)

**[a]** Which model in Exercise 1 performs best?

**[b]** Which model in Exercise 2 performs best?

**[c]** Which model in Exercise 3 performs best?

**[d]** Does the log transformation appear justified? Explain.

**[e]** Does location appear to be helpful for predicting price? Explain.

**[f]** Suggest a reason for performing this **analysis**. The reason we are creating the **models** is to predict price. From an analysis perspective, why might these predictions be useful?

**[g]** With your answer to part **[f]** in mind, is the best model we found at all useful? Explain.