Data Background

For this project, we will use the classic MNIST handwritten digits dataset. MNIST stands for “Modified National Institute of Standards and Technology database” as MNIST is actually a subset of a larger NIST dataset. The MNIST dataset has been modified to make it well suited to training machine learning algorithms. For details, see the following references:

Learning Task

Your goal will be to create a digit recognizer. That is, you should train a model that takes as input the pixel information from an image of a handwritten digit, and outputs a digit from 0 to 10.

Data

Instead of using the data directly from the MNIST website, we have pre-processed the data for easy loading into R. The following code will download, extract, and load three datasets into R. Any attempt to reverse-engineer solutions by using the original data will be considered an academic integrity violation. (Although, for fun, consider trying to load the original data yourself. Note that the instructor has code for this somewhere on the web. Also for fun, try to find this code.)

library(readr)
trn = read_csv("https://daviddalpiaz.github.io/stat432sp18/projects/mnist-train.csv.gz")
val = read_csv("https://daviddalpiaz.github.io/stat432sp18/projects/mnist-validation.csv.gz")
tst = read_csv("https://daviddalpiaz.github.io/stat432sp18/projects/mnist-test.csv.gz")

These datasets are both a subset of the MNIST data, and a different split than the original MNIST data.

The trn dataset is training images with their labels.
- The labels are contained in the y variable.
The val dataset is a validation set that does not contain labels.
- Although you do not have the labels, you will be able to check your progress on this dataset.
The tst dataset is a test set that does not contain labels.

There are 784 predictors in each dataset. Each predictor corresponds to grayscale information for a pixel of the 28x28 images in the datasets. The first predictor corresponds to the top left pixel, while the 784th pixel is for the bottom right pixel. The values of the predictors measure the intensity of that pixel. A value of 0 corresponds to white, while 256 codes for solid black. Between 0 and 256 are increasing intensities of gray.

summary(trn$X300)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00    0.00   16.00   94.67  231.00  255.00

# helper function for visualization of observations
show_digit = function(arr784, col = gray(12:1 / 12), ...) {
  image(matrix(as.matrix(arr784[-785]), nrow = 28)[, 28:1], col = col, ...)
}

The show_digit() function allows us to recreate the images from the given data. Here are a few of the training images:

par(mfrow = c(1, 3))
show_digit(trn[1, ])
show_digit(trn[2, ])
show_digit(trn[3, ])

Autograder Submission

The autograder can be found at: https://rstudio.stat.illinois.edu/shiny/stat432ag/

You may find the gen_agfile() function from the coatless package useful.

devtools::install_github("coatless/autograde")
?autograde::gen_agfile

After training a digit recognizer, make predictions on the validation and test data, then submit these to the autograder to obtain a score.

accuracy = function(actual, predicted) {
  mean(actual == predicted)
}

You goal is to create a digit recognizer that obtains the highest possible accuracy on the test data. A public leaderboards will be maintained for the validation data and test data. (Only scores will be public. Names will be anonymized to other students.)

You can use the training data any way you would like.
You can submit to the validation autograder 3 times a day. Validation-sp18
You can submit to the test autograder only once. Test-sp18

In some sense both the validation and test sets are test data, but you will be given feedback via the leaderboard about how well you are donig on the validation dataset, while you will not on the test dataset. This is to prevent using the feedback about the validation set to overfit to that dataset.

Report Submission

Create a very brief report using rmarkdown that shows:

how to fit your digit recognizer to the provided data.
how you generated validation predictions for your best submission.
how you generated the test predictions that you submitted.

Only show the steps needed for your final digit recognizer. Submit this report to Compass. Essentially, you should provide enough information for the course staff to recreate the predictions you submitted to the autograder.

Grading

A score of 100% will be given to the student who obtains the highest accuracy on the test data.
Students who obtain an accuracy above 0.90 will be given a passing grade.
Between these two extremes, scores will be given according to a distribution that is at least as good as Quiz I.
Failure to provide the requested report, or an unreasonable report will result in deductions.
Do not become disheartened by your place on the leaderboard, scores will only be semi-competitive.

Late Policy

No late submission to the autograder will be accepted. Because of this, you should submit a practice submission as soon as possible. The deadline for submitting to the autograder is: Friday, April 13, 11:59 PM.

Academic Integrity

If you think you are operating in a grey area, you probably are.

Hints and Suggestions

Start early.
When starting early, your first task should not be to make a good submission to the autograder, but instead a valid submission.
The caret package and the train() function will be useful, however, be careful about using cross-validation or other resampling techniques, as they will cause a large increase in computation time.
Due to the size of the dataset, using all of the available data to train a model will require a lot of computation time. Consider searching for models on a subset of the available data.
You may find it helpful to set various seed values to make your work reproducible. You may do so if you wish.

Graduate Project - MNIST Data

STAT 432 - Spring 2018 - Dalpiaz

Due: Friday, April 13, 11:59 PM