Please see the homework instructions document for detailed instructions and some grading notes. Failure to follow instructions will result in point reductions.
[7 points] For this exercise we will use data found in wisc-trn.csv
and wisc-tst.csv
which contain train and test data respectively. wisc.csv
is provided but not used. This is a modification of the Breast Cancer Wisconsin (Diagnostic) dataset from the UCI Machine Learning Repository. Only the first 10 feature variables have been provided. (And these are all you should use.)
You should consider coercing the response to be a factor variable. Use KNN with all available predictors. For simplicity, do not scale the data. (In practice, scaling would slightly increase performance on this dataset.) Consider \(k = 1, 3, 5, 7, \ldots, 51\). Plot train and test error vs \(k\) on a single plot.
Use the seed value provided below for this exercise.
set.seed(314)
[5 points] Continue with the cancer data from Exercise 1. Now consider an additive logistic regression that considers only two predictors, radius
and symmetry
. Plot the test data with radius
as the \(x\) axis, and symmetry
as the \(y\) axis, with the points colored according to their tumor status. Add a line which represents the decision boundary for a classifier using 0.5 as a cutoff for predicted probability.
[5 points] Continue with the cancer data from Exercise 1. Again consider an additive logistic regression that considers only two predictors, radius
and symmetry
. Report test sensitivity, test specificity, and test accuracy for three classifiers, each using a different cutoff for predicted probability:
Consider M
to be the “positive” class when calculating sensitivity and specificity. Summarize these results using a single well-formatted table.
[7 points] Use the data found in hw05-trn.csv
and hw05-tst.csv
which contain train and test data respectively. Use y
as the response. Coerce y
to be a factor after importing the data if it is not already.
Create pairs plot with ellipses for the training data, then train the following models using both available predictors:
Calculate test and train error rates for each model. Summarize these results using a single well-formatted table.
[1 point each] Answer the following questions based on your results from the three exercises.
(a) Which \(k\) performs best in Exercise 1?
(b) In Exercise 4, which model performs best?
(c) In Exercise 4, why does Naive Bayes perform poorly?
(d) In Exercise 4, which performs better, LDA or QDA? Why?
(e) In Exercise 4, which prior performs better? Estimating from data, or using a flat prior? Why?
(f) In Exercise 4, of the four classes, which is the easiest to classify?
(g) [Not Graded] In Exercise 3, which classifier would be the best to use in practice?