Data Background

The data for this project is originally from the UCI Machine Learning Repository. It has been modified for our purposes. (Any attempt to use the original data will be a waste of time. Assume that various data augmentations have been performed to the provided data.) Take a look at the documentation to get an understanding of the feature variables. Their names in our data are slightly different, but the names are descriptive enough to match them.

Our data will store the spam status of an email (response) in a variables named type. This will be a factor variable with two levels: spam and nonspam.

Additional documentation can be found in the spam data from the kernlab package. This documentation will use the same feature names as the data provided.


Learning Task

Your task is to actually create a spam filter. Only concerning ourselves with the overall accuracy means that marking spam as non-spam, and non-spam as spam are equal errors. But they clearly are not! Marking a non-spam email as spam could have terrible consequences.

To asses how well a spam filter is working, we will calculate the following score, where “ham” is an email that is not spam:

\[\begin{aligned} \text{score} &= 01 \times \{ \text{spam labeled spam} \}\\ &+ 01 \times \{ \text{ham labeled ham} \} \\ &- 30 \times \{ \text{ham labeled spam} \} \\ &- 01 \times \{ \text{spam labeled ham} \} \\ \end{aligned}\]

Positive weights are assigned to correct decisions. Negative weights are assigned to incorrect decisions. (Marking non-spam email as spam being penalized much harder.) Your goal is to create a spam filter that achieves the highest possible score.


Datasets


Autograder Submission

The autograder can be found at: https://rstudio.stat.illinois.edu/shiny/stat430ag/

You may find the gen_agfile function from the coatless package useful.

devtools::install_github("coatless/autograde")
?autograde::gen_agfile

After training a spam filter, make predictions on the test data, and submit these to the autograder to obtain a score. You goal is to create a spam filter that obtains the highest possible score on the test data. You can use the training data any way you would like. You can submit to the autograder 3 times a day. (Start early!) Consider submitting a test submission with a simple model ASAP.


Report

Create a very brief report using rmarkdown that shows how to fit your spam filter to the provided data and generate predictions. Only show the steps needed for your final spam filter. Submit this report to Compass.


Grading

Do not become disheartened by your place on the leaderboard, scores will only be semi-competitive.


Late Policy


Academic Integrity


Hints and Suggestions