For this project, you will work in groups to apply what you have learned by analyzing a dataset of your choice.

Timeline

There are four assignments for the project. Their due dates are:

Project Goal

The overall goal of the project is to apply supervised statistical learning methods to a dataset of your choice to answer a question.

Data Selection

You may use any dataset of your choice, so long as it contains at minimum 500 observations. This dataset might be relevant to research outside of this course, another field, or some other interest of yours. If you have any questions about whether your data is appropriate, do not hesitate to ask. If you plan to use data from another endeavor of yours, such as a research project, be sure to gain permission from the controlling authority first.

The two most common sources of data used by students:

Analysis

The final product of this project will be a written report of your analysis. It should contain the following sections:

Details of what is expected in each section will be discussed in the template document that will be provided. (This is something new we are trying this semester. Details in class as well.)

Your goal is not to use as many methods as possible. Your task is to use appropriate methods to find a good model that can perform the desired statistical learning task. Most importantly, you should motivate and discussion why that task is being completed, and how well it is being completed.

Task Specifics

Group Choice

For this project, you must work in groups of at least three students and at most four students. A portion of your grade will come from your ability to work in a group setting. You may pick your group members if you like.

If you choose your group, a roster of your group members is due by Friday, November 10, 11:59 PM. Send a single email with the list of group members to David (dalpiaz2@illinois.edu). Include all participants’ University emails in the CC line as a means to verify that all agree to the group. Also include full names and NetIDs in the body of the email.

If you would like to be assigned to a group, send an email to David (dalpiaz2@illinois.edu) and simply state in the body of the email that you would like to be assigned to a group. You must do so by Friday, November 10, 11:59 PM, but if you do so earlier, you may be assigned to a group earlier.

Groups of “one” may be considered, if and only if, you are willing to sacrifice a total of 2.5% of your total course grade that comes from peer evaluation. (You will still self evaluate.) The ability to work in a group is an important skill. If you would like to be a group of one, send an email to David (dalpiaz2@illinois.edu) and simply state in the body of the email that you would like work as a group of one. You must do so by Friday, November 10, 11:59 PM

Analysis Proposal

A proposal of your intended project is due by Friday, December 1, 11:59 PM. It should be submitted online via Compass by a single group member.

After review of the proposal, it will be evaluated in one of two ways:

  • Approved - Your group may proceed with your plans for the data and project.
  • Pending - We will provide suggestions, concerns, or needed information that must be addressed before the proposal will be approved.

A proposal of your intended project should include the following:

  • The names and NetIDs of the students who will be contributing to the group project.
  • A tentative title for the project.
  • Description of the dataset. You do not necessarily have to list all the variables, but at least mention those of greatest importance.
  • Background information on the dataset, including specific citation of its source.
  • The statistical learning task that the dataset will be used to accomplish. (Regression or classification.)
  • A reason the statistical learning task is being applied to this dataset. (Other than for completion of this project.) That is, provide motivation.
  • Evidence that the data can be loaded into R. Load the data, and print the first few values of the response variable as evidence. Why this data?
  • Evidence that the data can be modeled in R. Fit either lm() (regression) or glm() (classification) then call predict() on the results and return the first few values. You may need to perform some data cleaning before this step.

As a group, you will submit a .zip file as you would for homework that contains an .html and .Rmd file, as well as the data if it cannot be linked online. If your data is too large to submit, and cannot be linked, please let us know and we will find an alternative.

Final Report

The final report of your analysis is due by Thursday, December 21, 10:00 PM. It should be submitted online via Compass by a single group member.

As a group, you will submit a .zip file as you would for homework which contains a .pdf and .Rmd file, as well as the data if it cannot be linked to online. Be sure to follow the suggested formatting in the template document.

Peer Evaluation

A peer evaluation of the group members is due by Thursday, December 21, 10:00 PM. It should be submitted online via Compass by each group member.

Individually, you will write a short review of each of your group members, including yourself. For each member, comment on:

  • Which parts of the project were worked on by that member
  • How well that member communicated with the team (Provide a score from 0 to 100 as well as written comments.)
  • How well that member understood the course concepts (Provide a score from 0 to 100 as well as written comments.)
  • Proportion of the project completed by that member (Provide a proportion from 0% to 100% as well as written comments.)

Individually, you will submit a single file (.pdf preferred) that contains your reviews.

Project Grading

Group Choice

  • Percent of Final Grade: 1%
  • Points Possible: 1

Grading for the group choice is all-or-nothing based on making a group selection before the deadline.

Proposal

  • Percent of Final Grade: 4%
  • Points Possible: 20

You will be graded on formatting, motivation, appropriateness of data, etc.

Final Report

  • Percent of Final Grade: 15%
  • Points Possible: 100

A breakdown of the points for the final report:

  • Use of Statistical Learning Methodology: 30
    • Have you used the appropriate methods for your dataset? Have you applied them correctly?
  • Interpretation of Statistical Learning Methodology: 20
    • Do you arrive at the correct conclusions from the analyses you perform?
  • Discussion: 20
    • Do you sufficiently motivate your analysis?
    • Do you discuss your analysis results in the context of the data and the task at hand?
  • Use of R: 10
    • Does your code perform the desired task?
    • Is your code readable?
    • Is your style consistent?
  • Use of rmarkdown: 10
    • Are you properly utilizting rmarkdown?
    • Are warnings and messages suppressed?
    • Is irrelevant code hidden? (Plots, tables, etc.)
  • General Organization, Neatness, Readability: 10
    • Is your report easy to read?
    • Is it written in a manner such that a reader does not already need to be familiar with the data?

Peer Evaluation

  • Percent of Final Grade: 2.5% (Evaluation of Peers) + 2.5% (Evaluations from Peers)
  • Points Possible: 10 (Evaluation of Peers) + 10 (Evaluations from Peers)

It is more important that you honestly review your team than give each member good remarks. You will be graded on how well you review your group members. If you simply give each of your team members equally good marks, you will likely receive fewer points for the portion of the grade dedicated to evaluating your peers.

FAQ

This section will likely be updated as we progress through the remainder of the semester.

How long should the report be?

Isn’t this a lot to do at the end of the course while we have other things going on in the course? And it’s due during finals week?