Data Background

For this project, we will use the classic MNIST handwritten digits dataset. MNIST stands for “Modified National Institute of Standards and Technology database” as MNIST is actually a subset of a larger NIST dataset. The MNIST dataset has been modified to make it well suited to training machine learning algorithms. For details, see the following references:


Learning Task

Your goal will be to create a digit recognizer. That is, you should train a model that takes as input the pixel information from an image of a handwritten digit, and outputs a digit from 0 to 10.


Data

Instead of using the data directly from the MNIST website, we have pre-processed the data for easy loading into R. The following code will download, extract, and load three datasets into R. Any attempt to reverse-engineer solutions by using the original data will be considered an academic integrity violation. (Although, for fun, consider trying to load the original data yourself. Note that the instructor has code for this somewhere on the web. Also for fun, try to find this code.)

library(readr)
trn = read_csv("https://daviddalpiaz.github.io/stat432sp18/projects/mnist-train.csv.gz")
val = read_csv("https://daviddalpiaz.github.io/stat432sp18/projects/mnist-validation.csv.gz")
tst = read_csv("https://daviddalpiaz.github.io/stat432sp18/projects/mnist-test.csv.gz")

These datasets are both a subset of the MNIST data, and a different split than the original MNIST data.

There are 784 predictors in each dataset. Each predictor corresponds to grayscale information for a pixel of the 28x28 images in the datasets. The first predictor corresponds to the top left pixel, while the 784th pixel is for the bottom right pixel. The values of the predictors measure the intensity of that pixel. A value of 0 corresponds to white, while 256 codes for solid black. Between 0 and 256 are increasing intensities of gray.

summary(trn$X300)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00    0.00   16.00   94.67  231.00  255.00

# helper function for visualization of observations
show_digit = function(arr784, col = gray(12:1 / 12), ...) {
  image(matrix(as.matrix(arr784[-785]), nrow = 28)[, 28:1], col = col, ...)
}

The show_digit() function allows us to recreate the images from the given data. Here are a few of the training images:

par(mfrow = c(1, 3))
show_digit(trn[1, ])
show_digit(trn[2, ])
show_digit(trn[3, ])


Autograder Submission

The autograder can be found at: https://rstudio.stat.illinois.edu/shiny/stat432ag/

You may find the gen_agfile() function from the coatless package useful.

devtools::install_github("coatless/autograde")
?autograde::gen_agfile

After training a digit recognizer, make predictions on the validation and test data, then submit these to the autograder to obtain a score.

accuracy = function(actual, predicted) {
  mean(actual == predicted)
}

You goal is to create a digit recognizer that obtains the highest possible accuracy on the test data. A public leaderboards will be maintained for the validation data and test data. (Only scores will be public. Names will be anonymized to other students.)

In some sense both the validation and test sets are test data, but you will be given feedback via the leaderboard about how well you are donig on the validation dataset, while you will not on the test dataset. This is to prevent using the feedback about the validation set to overfit to that dataset.


Report Submission

Create a very brief report using rmarkdown that shows:

Only show the steps needed for your final digit recognizer. Submit this report to Compass. Essentially, you should provide enough information for the course staff to recreate the predictions you submitted to the autograder.


Grading


Late Policy

No late submission to the autograder will be accepted. Because of this, you should submit a practice submission as soon as possible. The deadline for submitting to the autograder is: Friday, April 13, 11:59 PM.


Academic Integrity

If you think you are operating in a grey area, you probably are.


Hints and Suggestions