Timeline

There are three assignments for the project. Their due dates are:

Project Goal

The overall goal of the project is to work in groups to apply (mostly supervised) statistical learning methods to a dataset of your choice.

Data Selection

You may use any dataset of your choice, so long as it contains at minimum 500 observations and was not previously used in class. This dataset might be relevant to research outside of this course, another field, or some other interest of yours. If you have any questions about whether your data is appropriate, do not hesitate to ask. If you plan to use data from another endeavor of yours, such as a research project, be sure to gain permission from the controlling authority first.

The two most common sources of data used by students:

Analysis

The final product of this project will be a written report of your analysis. It should contain the following sections:

Details of what is expected in each section will be discussed in the template document that will be provided. (We will again follow the IMRD structure that was used for the individual project.)

Your goal is not to use as many methods as possible. Your task is to use appropriate methods to find a good model that can perform the desired statistical learning task. Most importantly, you should motivate and discuss why that task is being completed, and how well it is being completed.

Task Specifics

Analysis Proposal

A proposal of your intended project is due by Friday, April 13, 11:59 PM. It should be submitted online via Compass by a single group member.

After review of the proposal, it will be evaluated in one of two ways:

  • Approved - Your group may proceed with your plans for the data and project.
  • Pending - We will provide suggestions, concerns, or needed information that must be addressed before the proposal will be approved.

A proposal of your intended project should include the following:

  • The names and NetIDs of the students who will be contributing to the group project.
  • A tentative title for the project.
  • Description of the dataset that is sufficient for a reader to understand your motivation for using the dataset.
  • Background information on the dataset, including specific citation of its source.
  • The statistical learning task that the dataset will be used to accomplish the goal of the analysis. (Regression or classification.) In your final report you will need to provide a motivation for the analysis in general, so keep that in mind.
  • Evidence that the data can be loaded into R.
    • Load the data, and print the first few values of the response variable as evidence.
    • Create at least one plot that helps the reader understand the data.
  • Evidence that the data can be modeled in R.
    • Use either lm() (regression) or glm() (classification) then call predict() on the results and return the first few values. You may need to perform some data cleaning before this step.

As a group, you will submit a .zip file as you would for homework that contains an .html and .Rmd file, as well as the data if it cannot be linked online. If your data is too large to submit, and cannot be linked, please let us know and we will find an alternative. There is no required format or template, but you should follow reasonable rmarkdown practices discussed in class.

Final Report

The final report of your analysis is due by Wednesday, May 9, 10:00 PM. It should be submitted online via Compass by a single group member.

As a group, you will submit a .zip file as you would for homework which contains a .html and .Rmd file, as well as the data if it cannot be linked to online. Be sure to follow the suggested formatting in the template document.

Peer Evaluation

A peer evaluation of the group members is due by Wednesday, May 9, 10:00 PM. It should be submitted online via Compass by each group member.

Individually, you will write a short review of each of your group members, including yourself. For each member, comment on:

  • Which parts of the project were worked on by that member
  • How well that member communicated with the team (Provide a score from 0 to 100 as well as written comments.)
  • How well that member understood the course concepts (Provide a score from 0 to 100 as well as written comments.)
  • Proportion of the project completed by that member (Provide a proportion from 0% to 100% as well as written comments.)

Individually, you will submit a single file (.pdf preferred) that contains your reviews.

Project Grading

Proposal

  • Percent of Final Grade: 5%
  • Points Possible: 20

You will be graded on formatting, clarity, appropriateness of data, etc.

Final Report

  • Percent of Final Grade: 15%
  • Points Possible: 100

  • Introduction
    • [5] Analysis is clearly motivated.
      • The why of the analysis is made clear to the reader.
    • [5] Analysis has a clear goal.
      • Reader should understand why statistical models will be useful.
    • [5] Data is clearly explained to the reader
      • Reader should understand what the data is, and how it can be used to achieve the goal.
      • Only the most relevant information should be placed in the introduction.
      • A full data dictionary should be included in an appendix.
    • [5] Exploratory data analysis
      • Only the most relevant EDA should be place in the introduction.
      • Additional EDA may be placed in the appendix.
  • Methods
    • [5] Appropriate methods from class are used.
    • [10] Methods are used correctly.
  • Results
    • [5] Results are clearly organized either visually or as a table.
    • [5] Correct and useful metrics are used.
  • Discussion
    • [5] Correct conclusions are drawn from the results.
    • [5] How the results relate to the goal is discussed.
    • [10] Results are connected to the motivation of the analysis.
  • Abstract
    • [5] Abstract appropriately summarizes the analysis performed.
  • Code
    • [10] R is used appropriately.
      • Does your code perform the desired tasks?
      • Is your code readable?
      • Is your style consistent?
    • [10] rmarkdown is used appropriately.
      • Are you properly utilizing rmarkdown? (Headers, chunks, etc.)
      • Are warnings and messages suppressed when appropriate?
      • Is irrelevant code hidden? (Plots, tables, etc.)
  • General
    • [5] Narrative text is well written.
      • Text is free of spelling errors.
      • Text is written with clarity. (You will not be held to a strict grammar standard.)
      • Text is written in a manner such that a reader does not already need to be familiar with the data. (Minimal familiarity with statistical learning is assumed.)
    • [5] Directions are followed.
      • Report is submitted using correct filetypes and filenames.
      • Report has a title.
      • Name and NetID are included in the report.

Peer Evaluation

  • Percent of Final Grade: 2.5% (Evaluation of Peers) + 2.5% (Evaluations from Peers)
  • Points Possible: 10 (Evaluation of Peers) + 10 (Evaluations from Peers)

It is more important that you honestly review your team than give each member good remarks. You will be graded on how well you review your group members. If you simply give each of your team members good marks, you will likely receive far fewer points for the portion of the grade dedicated to evaluating your peers.

Formatting and clarity will also account for a portion of the grade for your evaluations.

The instructor reserves the right to further reduce a students overall project grade if their teams reports that they did not attempt to make a significant contribution to the project.

FAQ

This section will likely be updated as we progress through the remainder of the semester.

How long should the report be?

Isn’t this a lot to do at the end of the course while we have other things going on in the course? And it’s due during finals week?