This is not an ordinary homework. Many of the usual homework policies apply, but there are several additional policies outlined below.

“In theory there is no difference between theory and practice. In practice there is.”

Yogi Berra

This homework will have four parts:


# suggested packages
library(MASS)
library(caret)
library(tidyverse)
library(knitr)
library(kableExtra)
library(mlbench)
library(ISLR)
library(ellipse)
library(randomForest)
library(gbm)
library(glmnet)
library(rpart)
library(rpart.plot)
library(klaR)
library(gam)
library(e1071)
# feel free to use additional packages

The Autograder

For this homework, we will utilize an autograder and leaderboard.

The autograder can be found at: https://rstudio.stat.illinois.edu/shiny/stat432ag/

In order to submit to the autograder, you will need to install the devtools package, and then install the autograde package.

install.packages("devtools")
library(devtools)
devtools::install_github("coatless/autograde")

The gen_agfile() function in the autograde package will write a vector of predicted results to a .csv file in the format the autograder expects.

library(autograde)
gen_agfile(prediction_results, file.name = "file_name")

Classification Task

There are two datasets for the classification task. Our goal is to predict the response y in the test data.

class_trn = read_csv("https://daviddalpiaz.github.io/stat432sp18/hw/hw08/class-trn.csv")
class_tst = read_csv("https://daviddalpiaz.github.io/stat432sp18/hw/hw08/class-tst.csv")

We will use accuracy as our metric for comparing models.

accuracy = function(actual, predicted) {
  mean(actual == predicted)
}

After finding a good model using the training data, make predictions on the test data, and submit these to the autograder to obtain test accuracy. You goal is to create a model that obtains the highest possible accuracy on the test data. You can use the training data any way you would like. You can submit to the autograder a total of 11 times. Your first submission should be an additive logistic regression in order to test your ability to use the autograder. The remaining ten submissions you may use any way you choose.


Regression Task

There are two datasets for the regression task. Our goal is to predict the response y in the test data.

reg_trn = read_csv("https://daviddalpiaz.github.io/stat432sp18/hw/hw08/reg-trn.csv")
reg_tst = read_csv("https://daviddalpiaz.github.io/stat432sp18/hw/hw08/reg-tst.csv")

We will use RMSE as our metric for comparing models.

rmse = function(actual, predicted) {
  sqrt(mean((actual - predicted) ^ 2))
}

After finding a good model using the training data, make predictions on the test data, and submit these to the autograder to obtain test RMSE. You goal is to create a model that obtains the lowest possible RMSE on the test data. You can use the training data any way you would like. You can submit to the autograder a total of 11 times. Your first submission should be an additive linear model in order to test your ability to use the autograder. The remaining ten submissions you may use any way you choose.


Spam Filter Task

The data for this task is originally from the UCI Machine Learning Repository. It has been modified for our purposes. (Any attempt to use the original data will be a waste of time. Assume that various data augmentations have been performed to the provided data.) Take a look at the documentation to get an understanding of the feature variables. Their names in our data are slightly different, but the names are descriptive enough to match them.

Our data will store the spam status of an email (response) in a variable named type. This variable will have two levels: spam and nonspam.

Additional documentation can be found in the spam data from the kernlab package. This documentation will use the same feature names as the data provided.

Your task is to actually create a spam filter. Only concerning ourselves with the overall accuracy means that marking spam as non-spam, and non-spam as spam are equal errors. But they clearly are not! Marking a non-spam email as spam could have terrible consequences.

spam_trn = read_csv("https://daviddalpiaz.github.io/stat432sp18/hw/hw08/spam-trn.csv")
spam_tst = read_csv("https://daviddalpiaz.github.io/stat432sp18/hw/hw08/spam-tst.csv")

To asses how well a spam filter is working, we will calculate the following score, where “ham” is an email that is not spam:

\[\begin{aligned} \text{score} &= 01 \times \{ \text{spam labeled spam} \}\\ &+ 01 \times \{ \text{ham labeled ham} \} \\ &- 25 \times \{ \text{ham labeled spam} \} \\ &- 02 \times \{ \text{spam labeled ham} \} \\ \end{aligned}\]
score = function(actual, predicted) {
  1   * sum(predicted == "spam" & actual == "spam") +
  -25 * sum(predicted == "spam" & actual == "nonspam") +
  -1  * sum(predicted == "nonspam" & actual == "spam") +
  2   * sum(predicted == "nonspam" & actual == "nonspam")
}

Positive weights are assigned to correct decisions. Negative weights are assigned to incorrect decisions. (Marking non-spam email as spam being penalized much harder.) Your goal is to create a spam filter that achieves the highest possible score.

After training a spam filter, make predictions on the test data, and submit these to the autograder to obtain a score. You goal is to create a spam filter that obtains the highest possible score on the test data. You can use the training data any way you would like. You can submit to the autograder a total of 11 times. Your first submission should be a simple model in order to test your ability to use the autograder. The remaining ten submissions you may use any way you choose.


Report

Create a very brief report using this template that shows only the steps necessary to train your best model for each task and create the predictions that you submitted to the autograders. Submit a .zip to Compass as you would for other homework assignments, following the usual filename conventions.


Grading

Like all other homework assignments, the total possible points is 30.

Late Policy

  • The usual late policy applies to the report submitted to Compass as Homework 08.
  • No late submission to the autograder will be accepted. Because of this, you should submit the suggested test submissions as soon as possible. The deadline for submitting to the autograder is: Friday, April 20 by 11:59 PM.

Academic Integrity

Since this homework is partially a competition, there should be no collaboration or discussion with other students. Your work should be entirely your own.


Hints and Suggestions