Please see the homework policy document for detailed instructions and some grading notes. Failure to follow instructions will result in point reductions.
“The fool wonders, the wise man asks.”
— Benjamin Disraeli
This homework will use data in hw01-trn-data.csv
and hw01-tst-data.csv
which are train and test datasets respectively. Both datasets contain a single predictor x
and a numeric response y
. The following chunk imports this data.
hw01_trn_data = read.csv("hw01-trn-data.csv")
hw01_tst_data = read.csv("hw01-tst-data.csv")
For this assignment, you may only use the following packages:
library(FNN)
library(rpart)
library(knitr)
library(kableExtra)
Fit a total of five polynomial models to the training data that can be used to predict y
from x
. Use polynomial degrees of 1, 3, 5, 7, and 9. For each, calculate both train and test RMSE. Do not output these results directly, instead summarize the results with a single well labeled plot that shows both train and test RMSE as a function of the degree of the polynomial fit.
Fit a total of five KNN models to the training data that can be used to predict y
from x
. Use k
(number of neighbors) values of 1
, 11
, 21
, 31
, and 41
. For each, calculate both train and test RMSE. Do not output these results directly, instead summarize the results with using a well-formatted markdown table that shows k
, train RMSE and test RMSE.
Fit a total of five tree models to the training data that can be used to predict y
from x
. To do so, use the rpart()
function from the rpart
package. The rpart()
syntax is very similar to lm()
. For example:
rpart(y ~ x, data = some_data, control = rpart.control(cp = 0.5, minsplit = 2))
This code fits a tree with a cost complexity parameter of 0.5
, as defined using the cp
argument to rpart.control
. We will consider this to be the single tuning parameter of tree fitting. (More on this much later in the course.) The minsplit
argument could also be considered a tuning parameter, but we will keep it fixed at 2.
Use cp
values of 0
, 0.001
, 0.01
, 0.1
, and 1
. For each, calculate both train and test RMSE. Do not output these results directly, instead summarize the results with using a well-formatted markdown table that shows cp
, train RMSE and test RMSE.
Add lines (curves) to the following plot which correspond to the fitted model for the best polynomial model, best KNN model, and best tree model based on the results of the previous exercises. Use different line types and colors for the different models. Add a legend to indicate which line is which model.
plot(y ~ x, data = hw01_trn_data, col = "darkgrey", pch = 20,
main = "Homework 01, Training Data")
grid()
(a) Which, if any, of the polynomial models are likely underfitting based on the results you obtained?
(b) Which, if any, of the polynomial models are likely overfitting based on the results you obtained?
(c) Which, if any, of the KNN models are likely underfitting based on the results you obtained?
(d) Which, if any, of the KNN models are likely overfitting based on the results you obtained?
(e) Which, if any, of the tree models are likely underfitting based on the results you obtained?
(f) Which, if any, of the tree models are likely overfitting based on the results you obtained?