Data Background
Learning Task
Datasets
Autograder Submission
Report
Grading
Late Policy
Academic Integrity
Hints and Suggestions

Data Background

The data for this project is originally from the UCI Machine Learning Repository. It has been modified for our purposes. (Any attempt to use the original data will be a waste of time. Assume that various data augmentations have been performed to the provided data.) Take a look at the documentation to get an understanding of the feature variables. Their names in our data are slightly different, but the names are descriptive enough to match them.

UC Irvine Machine Learning Repository

Our data will store the spam status of an email (response) in a variables named type. This will be a factor variable with two levels: spam and nonspam.

Additional documentation can be found in the spam data from the kernlab package. This documentation will use the same feature names as the data provided.

Learning Task

Your task is to actually create a spam filter. Only concerning ourselves with the overall accuracy means that marking spam as non-spam, and non-spam as spam are equal errors. But they clearly are not! Marking a non-spam email as spam could have terrible consequences.

To asses how well a spam filter is working, we will calculate the following score, where “ham” is an email that is not spam:

\[\begin{aligned} \text{score} &= 01 \times \{ \text{spam labeled spam} \}\\ &+ 01 \times \{ \text{ham labeled ham} \} \\ &- 30 \times \{ \text{ham labeled spam} \} \\ &- 01 \times \{ \text{spam labeled ham} \} \\ \end{aligned}\]

Positive weights are assigned to correct decisions. Negative weights are assigned to incorrect decisions. (Marking non-spam email as spam being penalized much harder.) Your goal is to create a spam filter that achieves the highest possible score.

Datasets

Create your spam filter using the training data.
Make predictions to submit using the test data.

Autograder Submission

The autograder can be found at: https://rstudio.stat.illinois.edu/shiny/stat430ag/

You may find the gen_agfile function from the coatless package useful.

devtools::install_github("coatless/autograde")
?autograde::gen_agfile

After training a spam filter, make predictions on the test data, and submit these to the autograder to obtain a score. You goal is to create a spam filter that obtains the highest possible score on the test data. You can use the training data any way you would like. You can submit to the autograder 3 times a day. (Start early!) Consider submitting a test submission with a simple model ASAP.

Report

Create a very brief report using rmarkdown that shows how to fit your spam filter to the provided data and generate predictions. Only show the steps needed for your final spam filter. Submit this report to Compass.

Grading

A score of 100% will be given to the student who obtains the highest score.
Students who obtain a score equal to the score obtained using an additive logistic regression will be given a passing grade.
Between these two extremes, scores will be given according to a distribution that is at least as good as Quiz I.
Failure to provide the requested report, or an unreasonable report will result in deductions.

Do not become disheartened by your place on the leaderboard, scores will only be semi-competitive.

Late Policy

No late submission to the autograder will be accepted. Because of this, you should submit the suggested test submissions as soon as possible. The deadline for submitting to the autograder is: Saturday, December 9, 11:59 PM.

Academic Integrity

If you think you are operating in a grey area, you probably are.

Hints and Suggestions

caret is your friend.
Start early. Don’t wait until the last minute. Submit the test submissions as early as possible. Late submissions to the autograder will not be accepted.
You may find it helpful to set various seed values to make your work reproducible. You may do so if you wish.

STAT 430: Graduate Student Project

Fall 2017, Dalpiaz

Due: Saturday, December 9, 11:59 PM