Please see the homework policy document for detailed instructions and some grading notes. Failure to follow instructions will result in point reductions.

“Statisticians, like artists, have the bad habit of falling in love with their models.”

For this homework, you may only use the following packages:

```
# general
library(MASS)
library(caret)
library(tidyverse)
library(knitr)
library(kableExtra)
library(mlbench)
# specific
library(ISLR)
library(ellipse)
library(randomForest)
library(gbm)
library(glmnet)
library(rpart)
library(rpart.plot)
```

If you feel additional general packages would be useful for future homework, please pass these along to the instructor.

**[7 points]** For this question we will use the data in `leukemia.csv`

which originates from Golub et al. 1999.

The response variable `class`

is a categorical variable. There are two possible responses: `ALL`

(acute myeloid leukemia) and `AML`

(acute lymphoblastic leukemia), both types of leukemia. We will use the many feature variables, which are expression levels of genes, to predict these classes.

Note that, this dataset is rather large and you may have difficultly loading it using the “Import Dataset” feature in RStudio. Instead place the file in the same folder as your `.Rmd`

file and run the following command. (Which you should be doing anyway.) Again, since this dataset is large, use 5-fold cross-validation when needed.

`leukemia = read_csv("leukemia.csv", progress = FALSE)`

For use with the `glmnet`

package, it will be useful to create a factor response variable `y`

and a feature matrix `X`

as seen below. We won’t test-train split the data since there are so few observations.

```
y = as.factor(leukemia$class)
X = as.matrix(leukemia[, -1])
```

Do the following:

- Set a seed equal to your UIN.
- Fit the full path of a logistic regression with both a lasso penalty and a ridge penalty. (Don’t use cross-validation. Also let
`glmnet`

choose the \(\lambda\) values.) Create side-by-side plots that shows the features entering (or leaving) the models. - Use cross-validation to tune an logistic regression with a lasso penalty. Again, let
`glmnet`

choose the \(\lambda\) values. Store both the \(\lambda\) that minimizes the deviance, as well as the \(\lambda\) that has a deviance within one standard error. Create a plot of the deviances for each value of \(\lambda\) considered. Use these two \(\lambda\) values to create a grid for use with`train()`

in`caret`

. Use`train()`

to get cross-validated classification accuracy for these two values of \(\lambda\). Store these values. - Use cross-validation to tune an logistic regression with a ridge penalty. Again, let
`glmnet`

choose the \(\lambda\) values. Store both the \(\lambda\) that minimizes the deviance, as well as the \(\lambda\) that has a deviance within one standard error. Create a plot of the deviances for each value of \(\lambda\) considered. Use these two \(\lambda\) values to create a grid for use with`train()`

in`caret`

. Use`train()`

to get cross-validated classification accuracy for these two values of \(\lambda\). Store these values. - Use cross-validation to tune \(k\)-nearest neighbors using
`train()`

in`caret`

. Do not specify a grid of \(k\) values to try, let`caret`

do so automatically. (It will use 5, 7, 9.) Store the cross-validated accuracy for each. Scale the predictors. - Summarize these
**seven**models in a table. (Two lasso, two ridge, three knn.) For each report the cross-validated accuracy and the standard deviation of the accuracy.

**[5 points]** For this exercise, we will use the `College`

data from the `ISLR`

package. Familiarize yourself with this dataset before performing analyses. We will attempt to predict the `Outstate`

variable.

Test-train split the data using this code.

```
set.seed(42)
index = createDataPartition(College$Outstate, p = 0.75, list = FALSE)
college_trn = College[index, ]
college_tst = College[-index, ]
```

Train a total of **six** models using five-fold cross validation.

- An additive linear model.
- An elastic net model using additive predictors. Use a
`tuneLength`

of`10`

. - An elastic net model that also considers all two-way interactions. Use a
`tuneLength`

of`10`

. - A well-tuned KNN model.
- A well-tuned KNN model that also considers all two-way interactions. (Should this work?)
- A default-tuned random forest.

Before beginning, set a seed equal to your UIN.

```
uin = 123456789
set.seed(uin)
```

- Create a table which reports CV and Test RMSE for each.

**[5 points]** For this exercise we will create data via simulation, then assess how well certain methods perform. Use the code below to create a train and test dataset.

```
set.seed(42)
sim_trn = mlbench.spirals(n = 2500, cycles = 1.5, sd = 0.125)
sim_trn = data.frame(sim_trn$x, class = as.factor(sim_trn$classes))
sim_tst = mlbench.spirals(n = 10000, cycles = 1.5, sd = 0.125)
sim_tst = data.frame(sim_tst$x, class = as.factor(sim_tst$classes))
```

The training data is plotted below, with colors indicating the `class`

variable, which is the response.