Goal

There are two main goals of the project:

• Apply statistical learning methods seen in class as part of an analysis.
• Practice writing a report.

Format

The final product of this project will be a written report of your analysis. The report will be written using rmarkdown and rendered into a .html file. The YAML header should use the following template:

---
title: 'A Unique and Interesting Title'
date: 'NetID'
abstract: 'This will be the abstract!'
output:
html_document:
theme: simplex
---

Be sure to include your name and NetID and to give the report an interesting title. (Using the above YAML header will take care of this.)

For the report format we will utilize the IMRD organization structure. See also this helpful IMRD cheat sheet from CMU.

Using the IMRD structure, your report must contain the following sections:

Abstract

Even though it is the first thing to appear in the report, the abstract should be the last thing that you write. Generally the abstract should serve as a summary of the entire report. Reading only the abstract, the reader should have a good idea about what to expect from the rest of the document. Abstracts can be extremely variable in length, but a good heuristic is to use a sentence for each of the main sections of the IMRD:

• Why are you doing this analysis? (Introduction)
• What did you do? (Methods)
• What did you find? (Results)
• What does it mean? Why does it matter? (Discussion)

To add an abstract to a report written in rmarkdown, see the use of abstract in the YAML example above.

Introduction

Since we are providing data, but not a scenario, you can create any reasonable scenario that you would like.

You do not need to provide a complete data dictionary in the introduction, but you should include one in the appendix. Often the data would be introduced in the Methods section, but here the data is very closely linked to the motivation of the analysis.

Consider including some exploratory data analysis here, and providing some of it to the reader in the report if you feel it helps present the data.

Methods

The methods section should discuss what you did. The methods that you are using are those learned in class. This section should contain the bulk of your “work.” This section will contain most of the R code that is used to generate the results. Your R code is not expected to be perfect idiomatic R, but it is expected to be understood by a reader without too much effort. The majority of your code should be suppressed from the final report, but consider displaying code that helps illustrate the analysis you performed, for example, training of models.

Consider adding subsections in this section. One potential set of subsections could be data and models. The data section would describe your data. How will it be used in performing your analysis? What if any preprocessing have you done to it? The models section would describe the modeling methods that you will consider, as well as strategies for comparison.

Your goal is not to use as many methods as possible. Your task is to use appropriate methods to find a good model that can perform the desired statistical learning task.

Results

The results section should contain numerical or graphical summaries of your results. What are the results of applying your chosen methods? Consider reporting a “final” or “best” model you have chosen. There is not necessarily one, singular correct model, but certainly some methods and models are better than others in certain situations. The results sections is about reporting your results.

Discussion

The discussion section should contain discussion of your results. That is, the discussion section is used for commenting on your results. This should also frame your results in the context of the data. What do your results mean? Results are often just numbers, here you need to explain what they tell you about the analysis you are performing. The results section tells the reader what the results are. The discussion section tells the reader why those results matter.

Appendix

The appendix section should contain any additional code, tables, and graphics that are not explicitly referenced in the narrative of the report. The appendix must contain a data dictionary.

Data

The data for this project originates from Kaggle.. We will use the House Sales in King County data.

The data has been re-hosted on the course website. Do not supply the data when you submit your project. Instead, read the file into your .Rmd document by linking to the provided data.

read.csv("https://daviddalpiaz.github.io/stat432sp18/projects/kc_house_data.csv")

Some background:

Rubric

The 100 points for the project will be assigned as follows:

• Introduction
• [5] Analysis is clearly motivated.
• The why of the analysis is made clear to the reader.
• [5] Analysis has a clear goal.
• Reader should understand why statistical models will be useful.
• [5] Data is clearly explained to the reader
• Reader should understand what the data is, and how it can be used to achieve the goal.
• Only the most relevant information should be placed in the introduction.
• A full data dictionary should be included in an appendix.
• [5] Exploratory data analysis
• Only the most relevant EDA should be place in the introduction.
• Additional EDA may be placed in the appendix.
• Methods
• [5] Appropriate methods from class are used.
• [10] Methods are used correctly.
• Results
• [5] Results are clearly organized either visually or as a table.
• [5] Correct and useful metrics are used.
• Discussion
• [5] Correct conclusions are drawn from the results.
• [5] How the results relate to the goal is discussed.
• [10] Results are connected to the motivation of the analysis.
• Abstract
• [5] Abstract appropriately summarizes the analysis performed.
• Code
• [10] R is used appropriately.
• [10] rmarkdown is used appropriately.
• Are you properly utilizing rmarkdown? (Headers, chunks, etc.)
• Are warnings and messages suppressed when appropriate?
• Is irrelevant code hidden? (Plots, tables, etc.)
• General
• [5] Narrative text is well written.
• Text is free of spelling errors.
• Text is written with clarity. (You will not be held to a strict grammar standard.)
• Text is written in a manner such that a reader does not already need to be familiar with the data. (Minimal familiarity with statistical learning is assumed.)
• [5] Directions are followed.
• Report is submitted using correct filetypes and filenames.
• Report has a title.
• Name and NetID are included in the report.

Submission

• Your analysis is due by Sunday, April 1, 11:59 PM.
• No late submissions will be accepted.
• If you submit early, before Friday, March 30, 11:59 PM, you will receive 5 bonus (buffer) points. (These points cannot be used to obtain a score above 100.)
• Your report must be submitted online via Compass.
• You will submit a single .zip file as you would for homework which contains a .html and .Rmd file.
• Any external images used should be placed in a folder names img.
• Do not submit the data. (It should be linked to in your .Rmd file from the course website.)
• These files should be named by replacing your NetID in place of netID:
• ind-proj-netID.zip
• ind-proj-netID.Rmd
• ind-proj-netID.html

FAQ

How long should the report be?

• There is no explicit minimum. There is an implicit maximum. On one hand, you need to provide results and evidence to support your decisions, and you need to be thorough and diligent as you walk through the steps of findings. On the other hand, a well-crafted data analysis will utilize brevity and conciseness. If you have a point to make, get to it. If you find yourself writing things simply for the sake of padding the word count, you’re writing the wrong things. This project is intentionally open-ended to see how you do without being given explicit steps, so have fun!