This is not an ordinary homework. Many of the usual homework policies apply, but there are several additional policies outlined below.

“In theory there is no difference between theory and practice. In practice there is.”

— Yogi Berra

This homework will have four parts:

A classification competition. Classification-sp18
A regression competition. Regression-sp18
A spam filter competition. Spammer-sp18
A brief report written in rmarkdown detailing the three competition tasks.

# suggested packages
library(MASS)
library(caret)
library(tidyverse)
library(knitr)
library(kableExtra)
library(mlbench)
library(ISLR)
library(ellipse)
library(randomForest)
library(gbm)
library(glmnet)
library(rpart)
library(rpart.plot)
library(klaR)
library(gam)
library(e1071)
# feel free to use additional packages

The Autograder

For this homework, we will utilize an autograder and leaderboard.

The autograder can be found at: https://rstudio.stat.illinois.edu/shiny/stat432ag/

In order to submit to the autograder, you will need to install the devtools package, and then install the autograde package.

install.packages("devtools")
library(devtools)
devtools::install_github("coatless/autograde")

The gen_agfile() function in the autograde package will write a vector of predicted results to a .csv file in the format the autograder expects.

library(autograde)
gen_agfile(prediction_results, file.name = "file_name")

Classification Task

There are two datasets for the classification task. Our goal is to predict the response y in the test data.

class-trn.csv contains a binary response y and several predictors.
class-tst.csv contains the same predictors as class-trn.csv but no response.

class_trn = read_csv("https://daviddalpiaz.github.io/stat432sp18/hw/hw08/class-trn.csv")
class_tst = read_csv("https://daviddalpiaz.github.io/stat432sp18/hw/hw08/class-tst.csv")

We will use accuracy as our metric for comparing models.

accuracy = function(actual, predicted) {
  mean(actual == predicted)
}

After finding a good model using the training data, make predictions on the test data, and submit these to the autograder to obtain test accuracy. You goal is to create a model that obtains the highest possible accuracy on the test data. You can use the training data any way you would like. You can submit to the autograder a total of 11 times. Your first submission should be an additive logistic regression in order to test your ability to use the autograder. The remaining ten submissions you may use any way you choose.

Regression Task

There are two datasets for the regression task. Our goal is to predict the response y in the test data.

reg-trn.csv contains a numerical response y and several predictors.
reg-tst.csv contains the same predictors as reg-trn.csv but no response.

reg_trn = read_csv("https://daviddalpiaz.github.io/stat432sp18/hw/hw08/reg-trn.csv")
reg_tst = read_csv("https://daviddalpiaz.github.io/stat432sp18/hw/hw08/reg-tst.csv")

We will use RMSE as our metric for comparing models.

rmse = function(actual, predicted) {
  sqrt(mean((actual - predicted) ^ 2))
}

After finding a good model using the training data, make predictions on the test data, and submit these to the autograder to obtain test RMSE. You goal is to create a model that obtains the lowest possible RMSE on the test data. You can use the training data any way you would like. You can submit to the autograder a total of 11 times. Your first submission should be an additive linear model in order to test your ability to use the autograder. The remaining ten submissions you may use any way you choose.

Spam Filter Task

The data for this task is originally from the UCI Machine Learning Repository. It has been modified for our purposes. (Any attempt to use the original data will be a waste of time. Assume that various data augmentations have been performed to the provided data.) Take a look at the documentation to get an understanding of the feature variables. Their names in our data are slightly different, but the names are descriptive enough to match them.

UC Irvine Machine Learning Repository

Our data will store the spam status of an email (response) in a variable named type. This variable will have two levels: spam and nonspam.

Additional documentation can be found in the spam data from the kernlab package. This documentation will use the same feature names as the data provided.

Your task is to actually create a spam filter. Only concerning ourselves with the overall accuracy means that marking spam as non-spam, and non-spam as spam are equal errors. But they clearly are not! Marking a non-spam email as spam could have terrible consequences.

Create your spam filter using the training data.
Make predictions to submit using the test data.

spam_trn = read_csv("https://daviddalpiaz.github.io/stat432sp18/hw/hw08/spam-trn.csv")
spam_tst = read_csv("https://daviddalpiaz.github.io/stat432sp18/hw/hw08/spam-tst.csv")

To asses how well a spam filter is working, we will calculate the following score, where “ham” is an email that is not spam:

\[\begin{aligned} \text{score} &= 01 \times \{ \text{spam labeled spam} \}\\ &+ 01 \times \{ \text{ham labeled ham} \} \\ &- 25 \times \{ \text{ham labeled spam} \} \\ &- 02 \times \{ \text{spam labeled ham} \} \\ \end{aligned}\]

score = function(actual, predicted) {
  1   * sum(predicted == "spam" & actual == "spam") +
  -25 * sum(predicted == "spam" & actual == "nonspam") +
  -1  * sum(predicted == "nonspam" & actual == "spam") +
  2   * sum(predicted == "nonspam" & actual == "nonspam")
}

Positive weights are assigned to correct decisions. Negative weights are assigned to incorrect decisions. (Marking non-spam email as spam being penalized much harder.) Your goal is to create a spam filter that achieves the highest possible score.

After training a spam filter, make predictions on the test data, and submit these to the autograder to obtain a score. You goal is to create a spam filter that obtains the highest possible score on the test data. You can use the training data any way you would like. You can submit to the autograder a total of 11 times. Your first submission should be a simple model in order to test your ability to use the autograder. The remaining ten submissions you may use any way you choose.

Report

Create a very brief report using this template that shows only the steps necessary to train your best model for each task and create the predictions that you submitted to the autograders. Submit a .zip to Compass as you would for other homework assignments, following the usual filename conventions.

Homework 08 Template Document

Grading

Like all other homework assignments, the total possible points is 30.

[10 points] The classification task will account for 10 points.
- The maximum 10 points will be awarded to the student that obtains the highest score on the leaderboard. After that, points will be awarded semi-competitively.
[10 points] The regression task will account for 10 points.
- The maximum 10 points will be awarded to the student that obtains the highest score on the leaderboard. After that, points will be awarded semi-competitively.
[10 points] The spam task will account for 10 points.
- The maximum 10 points will be awarded to the student that obtains the highest score on the leaderboard. After that, points will be awarded semi-competitively.
[3 points] The report will account for 3 points.
- These will be buffer points. They will not allow you to go above 30 points, but they will allow for students to obtain the maximum 30 points without achieving the top place of the leaderboards.

Late Policy

The usual late policy applies to the report submitted to Compass as Homework 08.
No late submission to the autograder will be accepted. Because of this, you should submit the suggested test submissions as soon as possible. The deadline for submitting to the autograder is: Friday, April 20 by 11:59 PM.

Academic Integrity

Since this homework is partially a competition, there should be no collaboration or discussion with other students. Your work should be entirely your own.

Any homework that the course staff believes to be the result of cheating beyond a reasonable doubt will be dealt with as harshly as possible.
- To avoid any issues, do not copy and paste code. (With an exception for code provided by the course.)
- Do not share .Rmd files!

Hints and Suggestions

caret is your friend.
Start early. Don’t wait until the last minute. Submit the test submissions as early as possible. Late submissions to the autograder will not be accepted.
You may find it helpful to set seed values to make your work reproducible. You may do so if you wish.
Do not become disheartened by your place on the leaderboard, scores will only be semi-competitive. The distribution of grades will be similar to previous homework assignments at worst. (They will actually likely be better.)

STAT 432 Homework 08

Spring 2018 | Dalpiaz | UIUC

Due: Friday, April 20, 11:59 PM