The first group project will consist of analysis of a data set of interest to group participants. Provided below is a list of possible data sets with a short description of each. The final product of this project will be a presentation and written report of your analysis.


Goals

There are several goals for this project:


Main Idea

The projects typically ask you to fit several probability distributions to one or more variables in a data set, estimating any needed parameters and assessing model fit. In addition to distributions we’ve encountered in this class, Wikipedia has a long list of distributions that may by useful to look through.

To estimate parameters, you can use estimators mentioned in class, in the textbook, or on Wikipedia! The purpose of this project is to use these tools to effectively analyze data, not necessarily to derive the tools themselves.

In addition to point estimation, if you see a way to extend your project to include interval estimation, you can do so. (However, you are not required to do so.)


Timeline

There are five “tasks” associated with the project. Their due dates are:


Tasks


Group Selection

For this project you will work in groups of four. (Due to the rules of division, a few groups of either three or five may be necessary.) Groups for this project will be partially student selected. You need to do one of two things:

  • Nothing. By default, you will be placed into a group by the course staff.
  • Form your own group. To do so, have one member of a proposed group send an email to dalpiaz.14@osu.edu with the subject line [STAT 3202] Group Selection, being sure to carbon copy the other proposed group members, by Friday, February 8, 11:59 PM.
    • You may submit proposed groups of size two, three, or four. Groups of four will be considered final. Groups of two or three will be merged with other proposed groups or unassigned students.
    • Group members must all be registered for the same lab section.

Project Selection

After groups are finalized, there are two ways to proceed:

  • Pick a project form the list below.
  • Suggest a new dataset and project.
    • This option will allow for the possibility of up to three “buffer” points. (See grading details below.)
    • Some possible sources of data are listed below.
    • If you choose this route, do read through the available projects to get a sense of the scope of what is possible and expected.

For either, an email must be sent to dalpiaz.14@osu.edu by a single group member with the exact subject line [STAT 3202] Group #N, Project Choice where #N is your group number by Friday, February 15, 11:59 PM. You must carbon copy all additional team members.

If you would like to use a pre-made project, you must provide a ranked list of three possible projects. (We will place some limits on the number of repeats of pre-made projects.) If you would like to suggest a new dataset and project, you must do the following:

  • Provide the data and/or an R Markdown file that proves that you can read the data into R.
  • A brief description and source of the data.
  • Suggested use of the data.

If you are planning to suggest a new dataset and project it is highly recommended that you discuss it with the instructor in office hours as soon as possible.

After collecting this information, the instructor will approve suggested projects, and assign pre-made projects based on rankings. (They will be given out based on when emails are received.)


Possible Projects

These projects were prepared by professors Kubatko and Sinnott in previous semesters. They are presented here with some modifications. (You may notice some missing letters. This is not an error.)

  • Project C NCAA basketball data: This data set contains data on every NCAA tournament game ever played.
  • Project D Forest ecology study: This data set contains information on the types and sizes of trees in an old-growth beech-maple forest in northeastern Ohio.
  • Project G MLB game data: This data set contains data on the number of runs scored by the home and the away teams for all games played during the 2011 through 2015 seasons.
  • Project H MLB height and weight data: This data set contains data on heights and weights of 1033 major league baseball players.
  • Project I Seattle Real-Time Fire 911 Calls: This data set contains times and dates of all Seattle Fire Department 911 dispatches in 2015.
  • Project J Seattle Real-Time Fire 911 Calls: This data set contains times and dates of all Seattle Fire Department 911 dispatches in 2015.
  • Project K NBA Season Data: This data set contains data on team performance by season for the 2008-09 through 2011-12 seasons.
  • Project L Alcohol Consumption Data: This data set contains data on alcohol consumption per capita across countries in the world.
  • Project M French and English Word Lengths: The data consists of English and French word lists for use by computer systems.
  • Project N Word Lengths in George Orwell’s 1984: The data consists of an English word list, as well as a word list from George Orwell’s novel 1984.
  • Project O Temperature In Charlotte, North Carolina: The data set consists of actual temperatures in 2014-2015 (mean, min, and max); average temperatures (min and max); and record temperatures (min and max) in Charlotte, North Carolina.

Project Presentation

Some general guidelines for preparing group presentations:

  • Prepare powerpoint (or similar) slides for your presentation.
  • Each group member should participate in the group presentation in some way.
  • Structure your presentation as follows:
    • Introduction and description of the data.
    • Methods used in the project.
    • Presentation results, including graphs, estimators, comparisons by groups, etc.
    • Discussion of results.
  • Aim to have a presentation that is no longer than six minutes. We will be on a tight schedule, and we would like a couple minutes for questions.

Presentation Rubric

The 25 points for the presentation will be assigned as follows:

  • [5] All groups members participate.
  • [10] Origination and presentation of slides
    • Slides are easy to read. (Not overly cluttered with words)
    • Flow of presentation is well organized.
  • [5] Statistical content of presentation is correct.
  • [3] Time is managed effectively.
  • [2] Directions are followed.
    • PDF of presentation is received in Carmen by date and time specified.
    • Group number and names are included on the first slide.

Project Report

  • The report must be written using rmarkdown and rendered into a .html file.
  • Reports should be “approximately” three pages (double-spaced) of text at most. In addition, include relevant plots and figures, with titles and captions, which do not count towards this length suggestion. Plots and figures should be near where they are referenced in the text. Because we are rendering to a .html file, this reference to “pages” is somewhat irrelevant, but you could easily copy-paste your text into a text editor to get an idea of how much you have written.
  • Be sure to include your group number and the names of all group members in the report.
  • Be sure to give your report and presentation an interesting title.
  • You do not need to include mathematical derivations of estimators in your project write-up.
  • Pay attention to grammar, spelling, formatting, etc. This is designed to provide practice for the real world, where you would provide reports to clients or to your boss. Use professional language, provide references, write paragraphs of complete sentences, etc.

For the report format we will utilize the IMRD organization structure. See also this helpful IMRD cheat sheet from CMU. In general your presentation can more loosely follow the IMRD structure, but the report must exactly follow the IMRD structure containing the following sections:

  • Abstract
  • Introduction
  • Methods
  • Results
  • Discussion
  • (Optional) Appendix

Abstract

Even though it is the first thing to appear in the report, the abstract should be the last thing that you write. Generally the abstract should serve as a summary of the entire report. Reading only the abstract, the reader should have a good idea about what to expect from the rest of the document. Abstracts can be extremely variable in length, but a good heuristic is to use a sentence for each of the main sections of the IMRD:

  • Why are you doing this analysis? (Introduction)
  • What did you do? (Methods)
  • What did you find? (Results)
  • What does it mean? Why does it matter? (Discussion)

Introduction

The introduction should discuss the “why” of your analysis and the “what” of your data. Essentially, you need to motivate why the analysis that you’re about to do should be done. In particular you should state a clear problem of interest. Why does this analysis need to be done? What is the goal of this analysis? The introduction should also provide enough background on the subject area for a reader to understand your analysis. Do not assume your reader knows anything about the subject area that your data comes from. If the reader does not understand your data, there is no way the reader will understand your motivation. Since you did not collect this data, you can create any reasonable scenario that you would like. (In the real world, you would often have some input before collecting data.)

You do not need to provide a complete data dictionary in the introduction, but you should include one in the appendix. Often the data would be introduced in the Methods section, but here the data should be very closely linked to the motivation of the analysis.

Consider including some exploratory data analysis here, and providing some of it to the reader in the report if you feel it helps present the data.

Methods

The methods section should discuss what you did. The methods that you are using are those learned in class. This section should contain the bulk of your “work.” (But do not take that to mean that it should be the longest section.) This section will contain most of the R code that is used to generate the results. Your R code is not expected to be perfect idiomatic R, but it is expected to be understood by a reader without too much effort. The majority of your code should be suppressed from the final report, but consider displaying code if it is concise and helps explain what you did. (If you use rmarkdown you can set echo = FALSE to suppress code.)

Consider adding subsections in this section. One potential set of subsections could be data and modeling. (Here we use modeling to mean fitting probability distributions.) The data section would describe your data. How will it be used in performing your analysis? What if any preprocessing have you done to it? The modeling section would describe the modeling methods that you will consider, as well as strategies for comparison.

Your goal is not to use as many methods as possible. Your task is to use appropriate methods to find a good model to help answer a question about your data.

Results

The results section should contain numerical or graphical summaries of your results. What are the results of applying your chosen methods? Consider reporting “final” or “best” models you have chosen. There is not necessarily one, singular correct model, but certainly some methods and models are better than others in certain situations. The results sections is about reporting your results.

Discussion

The discussion section should contain discussion of your results. That is, the discussion section is used for commenting on your results. This should also frame your results in the context of the data. What do your results mean? What other data do you wish had been collected? What interesting observations arose from your analysis? Results are often just numbers or graphics, here you need to explain what they tell you about the analysis you are performing. The results section tells the reader what the results are. The discussion section tells the reader why those results matter.

Any concluding remarks should be placed here.

Appendix

The appendix section should contain any additional code, tables, and graphics that are not explicitly referenced in the narrative of the report. (If you use rmarkdown and supply the .Rmd file that contains the suppressed code, this is not necessary.) The appendix must contain a data dictionary. Appropriate citations should be placed here.

Report Rubric

The 65 points for the report will be assigned as follows:

  • [5] Abstract
    • Does the abstract appropriately summarize the analysis performed?
  • [10] Introduction
    • Is the analysis is clearly motivated? Is the why of the analysis made clear to the reader?
    • Is the goal of the analysis clear to the reader?
    • Is the problem of interest stated in terms of a statistical task?
    • Does the reader have a clear understanding of the data?
  • [10] Methods
    • Are appropriate methods from class used?
    • Are the methods used correctly?
  • [10] Results
    • Are the results clearly organized, potentially either visually or as a table?
    • Are correct and useful metrics are used?
  • [10] Discussion
    • Are correct conclusions drawn from the results?
    • Is it clear how the results relate to the goal and motivation outlined in the introduction?
  • [5] R Code
    • Does the provided code perform the desired tasks?
    • Is the provided code reasonably readable?
    • Does the provided code have a consistent style?
  • [5] R Markdown Usage
    • Is R Markdown used to suppress code from the final rendered report?
    • Is R Markdown used to specify a document structure through the use of headers?
  • [10] Writing and Directions
    • Is the text free of spelling errors?
    • Is the text written with clarity?
    • Does the report have a meaningful title?
    • Are the group number and names included in the report?

Peer Review

Each group member will write an anonymous peer review of each group member, including themselves. Peer reviews will comment on communication, knowledge of course concepts, and R programming skills. Peer Reviews will remain anonymous. Grading will be based largely on completion. It will be better to give honest comments than to simply give all team members high praise. Formatting and directions will appear with the associated assignment listing on Carmen.


Grading

The total points for the project is 100. There are broken down by task:

If you submitted a request to complete an original dataset and project, you will receive three “buffer” points. (These points can get you to, but not over 100 points for the project.)

Submissions

Details on what must be submitted for each task (which is partially described in this document) can be found in the description of the corresponding item on Carmen. The presentation and report will be group assignments, thus only one member of the group needs to submit.


Tips