This is not an ordinary homework. Many of the usual homework policies apply, but there are several additional policies outlined below.
“In theory there is no difference between theory and practice. In practice there is.”
This homework will have four parts:
Classification-sp18
Regression-sp18
Spammer-sp18
rmarkdown
detailing the three competition tasks.# suggested packages
library(MASS)
library(caret)
library(tidyverse)
library(knitr)
library(kableExtra)
library(mlbench)
library(ISLR)
library(ellipse)
library(randomForest)
library(gbm)
library(glmnet)
library(rpart)
library(rpart.plot)
library(klaR)
library(gam)
library(e1071)
# feel free to use additional packages
For this homework, we will utilize an autograder and leaderboard.
The autograder can be found at: https://rstudio.stat.illinois.edu/shiny/stat432ag/
In order to submit to the autograder, you will need to install the devtools
package, and then install the autograde
package.
install.packages("devtools")
library(devtools)
devtools::install_github("coatless/autograde")
The gen_agfile()
function in the autograde
package will write a vector of predicted results to a .csv file in the format the autograder expects.
library(autograde)
gen_agfile(prediction_results, file.name = "file_name")
There are two datasets for the classification task. Our goal is to predict the response y
in the test data.
class-trn.csv
contains a binary response y
and several predictors.class-tst.csv
contains the same predictors as class-trn.csv
but no response.class_trn = read_csv("https://daviddalpiaz.github.io/stat432sp18/hw/hw08/class-trn.csv")
class_tst = read_csv("https://daviddalpiaz.github.io/stat432sp18/hw/hw08/class-tst.csv")
We will use accuracy as our metric for comparing models.
accuracy = function(actual, predicted) {
mean(actual == predicted)
}
After finding a good model using the training data, make predictions on the test data, and submit these to the autograder to obtain test accuracy. You goal is to create a model that obtains the highest possible accuracy on the test data. You can use the training data any way you would like. You can submit to the autograder a total of 11 times. Your first submission should be an additive logistic regression in order to test your ability to use the autograder. The remaining ten submissions you may use any way you choose.
There are two datasets for the regression task. Our goal is to predict the response y
in the test data.
reg-trn.csv
contains a numerical response y
and several predictors.reg-tst.csv
contains the same predictors as reg-trn.csv
but no response.reg_trn = read_csv("https://daviddalpiaz.github.io/stat432sp18/hw/hw08/reg-trn.csv")
reg_tst = read_csv("https://daviddalpiaz.github.io/stat432sp18/hw/hw08/reg-tst.csv")
We will use RMSE as our metric for comparing models.
rmse = function(actual, predicted) {
sqrt(mean((actual - predicted) ^ 2))
}
After finding a good model using the training data, make predictions on the test data, and submit these to the autograder to obtain test RMSE. You goal is to create a model that obtains the lowest possible RMSE on the test data. You can use the training data any way you would like. You can submit to the autograder a total of 11 times. Your first submission should be an additive linear model in order to test your ability to use the autograder. The remaining ten submissions you may use any way you choose.
The data for this task is originally from the UCI Machine Learning Repository. It has been modified for our purposes. (Any attempt to use the original data will be a waste of time. Assume that various data augmentations have been performed to the provided data.) Take a look at the documentation to get an understanding of the feature variables. Their names in our data are slightly different, but the names are descriptive enough to match them.
Our data will store the spam status of an email (response) in a variable named type
. This variable will have two levels: spam
and nonspam
.
Additional documentation can be found in the spam
data from the kernlab
package. This documentation will use the same feature names as the data provided.
Your task is to actually create a spam filter. Only concerning ourselves with the overall accuracy means that marking spam as non-spam, and non-spam as spam are equal errors. But they clearly are not! Marking a non-spam email as spam could have terrible consequences.
spam_trn = read_csv("https://daviddalpiaz.github.io/stat432sp18/hw/hw08/spam-trn.csv")
spam_tst = read_csv("https://daviddalpiaz.github.io/stat432sp18/hw/hw08/spam-tst.csv")
To asses how well a spam filter is working, we will calculate the following score, where “ham” is an email that is not spam:
\[\begin{aligned} \text{score} &= 01 \times \{ \text{spam labeled spam} \}\\ &+ 01 \times \{ \text{ham labeled ham} \} \\ &- 25 \times \{ \text{ham labeled spam} \} \\ &- 02 \times \{ \text{spam labeled ham} \} \\ \end{aligned}\]score = function(actual, predicted) {
1 * sum(predicted == "spam" & actual == "spam") +
-25 * sum(predicted == "spam" & actual == "nonspam") +
-1 * sum(predicted == "nonspam" & actual == "spam") +
2 * sum(predicted == "nonspam" & actual == "nonspam")
}
Positive weights are assigned to correct decisions. Negative weights are assigned to incorrect decisions. (Marking non-spam email as spam being penalized much harder.) Your goal is to create a spam filter that achieves the highest possible score.
After training a spam filter, make predictions on the test data, and submit these to the autograder to obtain a score. You goal is to create a spam filter that obtains the highest possible score on the test data. You can use the training data any way you would like. You can submit to the autograder a total of 11 times. Your first submission should be a simple model in order to test your ability to use the autograder. The remaining ten submissions you may use any way you choose.
Create a very brief report using this template that shows only the steps necessary to train your best model for each task and create the predictions that you submitted to the autograders. Submit a .zip
to Compass as you would for other homework assignments, following the usual filename conventions.
Like all other homework assignments, the total possible points is 30.
Since this homework is partially a competition, there should be no collaboration or discussion with other students. Your work should be entirely your own.
.Rmd
files!caret
is your friend.