Please see the homework instructions document for detailed instructions and some grading notes. Failure to follow instructions will result in point reductions.

For this homework we will use data found in `wisc-trn.csv`

and `wisc-tst.csv`

which contain train and test data respectively. `wisc.csv`

is provided but not used. This is a modification of the Breast Cancer Wisconsin (Diagnostic) dataset from the UCI Machine Learning Repository. Only the first 10 feature variables have been provided. (And these are all you should use.)

You should consider coercing the response to be a factor variable.

You should use the `caret`

package and training pipeline to complete this homework. Any time you use the `train()`

function, first run `set.seed(1337)`

.

`caret`

)**[6 points]** Train a KNN model using all available predictors, **no data preprocessing**, 5-fold cross-validation, and a well chosen value of the tuning parameter. Consider \(k = 1, 3, 5, 7, \ldots, 101\). Store the tuned model fit to the training data for later use. Plot the cross-validated accuracies as a function of the tuning parameter.

`caret`

)**[6 points]** Train a KNN model using all available predictors, predictors scaled to have mean 0 and variance 1, 5-fold cross-validation, and a well chosen value of the tuning parameter. Consider \(k = 1, 3, 5, 7, \ldots, 101\). Store the tuned model fit to the training data for later use. Plot the cross-validated accuracies as a function of the tuning parameter.

**[6 points]** Now that we’ve introduced `caret`

, it becomes extremely easy to try different statistical learning methods. Train a random forest using all available predictors, **no data preprocessing**, 5-fold cross-validation, and well a chosen value of the tuning parameter. Using `caret`

to perform the tuning, there is only a single tuning parameter, `mtry`

. Consider `mtry`

values between 1 and 10. Store the tuned model fit to the training data for later use. Report the cross-validated accuracies as a function of the tuning parameter using a well formatted table.

**[1 point each]** Answer the following questions based on your results from the three exercises. Format your answer to this exercise as a table with one column indicating the part, and the other column for your answer. See the `rmarkdown`

source for a template of this table.

**(a)** What value of \(k\) is chosen for KNN without predictor scaling?

**(b)** What is the cross-validated accuracy for KNN without predictor scaling?

**(c)** What is the test accuracy for KNN without predictor scaling?

**(d)** What value of \(k\) is chosen for KNN **with** predictor scaling?

**(e)** What is the cross-validated accuracy for KNN **with** predictor scaling?

**(f)** What is the test accuracy for KNN **with** predictor scaling?

**(g)** Do you think that KNN is performing better with or without predictor scaling?

**(h)** What value of `mtry`

is chosen for the random forest?

**(i)** Using the random forest, what is the (estimated) probability that the 10th observation of the test data is a cancerous tumor?

**(j)** Using the random forest, what is the (test) sensitivity?

**(k)** Using the random forest, what is the (test) specificity?

**(l)** Based on these results, is the random forest or KNN model performing better?