The first group project will consist of analysis of a data set of interest to group participants. Provided below is a list of possible data sets with a short description of each. Group assignments can be found on Carmen.
Goals
There are several goals for this project:
- Gain experience reading a large data set into
R
and manipulating it.
- Learn to think critically about how to estimate population parameters of interest.
- Develop skill in describing and presenting analysis, both in written and in oral formats.
Main Idea
The projects typically ask you to fit several probability distributions to one or more variables in a data set, estimating any needed parameters and assessing model fit. In addition to distributions we’ve encountered in this class, Wikipedia has a long list of distributions that may by useful to look through.
To estimate parameters, you can use estimators mentioned in class, in the textbook, or on Wikipedia! The purpose of this project is to use these tools to effectively analyze data, not necessarily to derive the tools themselves.
In addition to point estimation, if you see a way to extend your project to include point estimation, you can do so. (However, you are not required to do so.)
Timeline
There are four “assignments” associated with the project. Their due dates are:
- Project Selection - Friday, September 28, 11:59 PM
- Project Presentation - Tuesday, October 9, 11:59 PM
- A PDF of the presentation should be emailed to
dalpiaz.14@osu.edu
by this time. The presentations will take place during class on Wednesday, October 10.
- Project Report - Friday, October 12, 11:59 PM
- But it would be best if you simply finished before Fall break.
- Peer Review - Friday, October 12, 11:59 PM
Presentation Guidelines
Some general guidelines for preparing group presentations:
- Prepare powerpoint (or similar) slides for your presentation.
- Each group member should participate in the group presentation in some way.
- Structure your presentation as follows:
- Introduction and description of the data.
- Statement of the problem of interest (e.g., what quantities are you trying to estimate, and why are they interesting?)
- Methods used in the project.
- Presentation results, including graphs, estimators, comparisons by groups, etc.
- Discussion of results – what other data do you wish had been collected? What interesting observations arose from your analysis?
- Aim to have a presentation that is no longer than six minutes. We will be on a tight schedule, and we would like a couple minutes for questions.
Additional details and expectations will be released after selection of projects.
Report Guidelines
Some general guidelines for preparing the project report:
- Structure your report in a similar style to your presentation (see above).
- Reports should be “approximately” three pages (double-spaced) of text. In addition, include relevant plots and figures, with titles and captions. Plots and figures should be near where they are referenced in the text.
- You do not need to include code with your project write-up (though I may request your code after you submit your report).
- Using
R
Markdown to write the report may gain your group up to two buffer points. (See grading information below.)
- You do not need to include mathematical derivations of estimators in your project write-up.
- Pay attention to grammar, spelling, formatting, etc. This is designed to provide practice for the real world, where you would provide reports to clients or to your boss. Use professional language, provide references, etc.
Additional details and expectations will be released after selection of projects.
Data and Project Selection
There are two ways to proceed:
- Pick a project form the list below.
- Suggest a new dataset and project.
- This option will allow for the possibility of up to three “buffer” points. (See grading details below.)
- Some possible sources of data are listed below.
- Do read through the available projects to get a sense of the scope of what is possible and expected.
For either, an email must be sent to dalpiaz.14@osu.edu
by a single group member with the exact subject line [STAT 3202] Group #N, Project Choice
where #N
is your group number by Friday, September 28, 11:59 PM. You must CC all additional team members.
If you would like to use a pre-made project, you must provide a ranked list of three possible projects. (We will attempt to have no overlap between groups.) If you would like to suggest a new dataset and project, you must do the following:
- Provide the data and/or an
R
Markdown file that proves that you can read the data into R
.
- A brief description and source of the data.
- Suggested use of the data.
If you would like to suggest a new dataset and project it is highly recommended that you discuss it with the instructor in office hours ASAP.
After collecting this information, the instructor will approve suggested projects, and assign pre-made projects based on rankings. (They will be given out based on when emails are received.)
Possible Projects
These projects were prepared by professors Kubatko and Sinnott in previous semesters. They are presented here with some modifications.
- Project A Internet traffic data: The data set consists of 50,000 observations of the time between arrival of packets of data over a two-minute period from the Digital Equipment Servers on March 8th, 1995.
- Project B Internet traffic data: This data set consists of 50,000 observations of the lengths of visits to the MSNBC website on September 28, 1999.
- Project C NCAA basketball data: This data set contains data on every NCAA tournament game ever played.
- Project D Forest ecology study: This data set contains information on the types and sizes of trees in an old-growth beech-maple forest in northeastern Ohio.
- Project E AnthroKids data: This data set consists of anthropomorphic data collected on 3,900 children in 1977 for use in consumer product safety studies.
- Project F AnthroKids data: This data set consists of anthropomorphic data collected on 3,900 children in 1977 for use in consumer product safety studies.
- Project G MLB game data: This data set contains data on the number of runs scored by the home and the away teams for all games played during the 2011 through 2015 seasons.
- Project H MLB height and weight data: This data set contains data on heights and weights of 1033 major league baseball players.
- Project I Seattle Real-Time Fire 911 Calls: This data set contains times and dates of all Seattle Fire Department 911 dispatches in 2015.
- Project J Seattle Real-Time Fire 911 Calls: This data set contains times and dates of all Seattle Fire Department 911 dispatches in 2015.
- Project K NBA Season Data: This data set contains data on team performance by season for the 2008-09 through 2011-12 seasons.
- Project L Alcohol Consumption Data: This data set contains data on alcohol consumption per capita across countries in the world.
- Project M French and English Word Lengths: The data consists of English and French word lists for use by computer systems.
- Project N Word Lengths in George Orwell’s 1984: The data consists of an English word list, as well as a word list from George Orwell’s novel 1984.
- Project O Temperature In Charlotte, North Carolina: The data set consists of actual temperatures in 2014-2015 (mean, min, and max); average temperatures (min and max); and record temperatures (min and max) in Charlotte, North Carolina.
Grading
The total points for the project is 100. There are broken down by task:
- Project Selection: 5
- Project Presentation: 45
- Project Report: 45
- Peer Review: 5
There are two opportunities for “buffer” points:
- Use of
R
Markdown for Project Report: Up to 2 points.
Use of an original dataset and project: Up to 3 points.
- Grading for the project selection is based only on completion.
Grading of the peer review will be mostly based on completion of a brief, anonymous feedback form about your group members. (To be released.) While you will be evaluating your peers, it will not necessarily directly impact their grades. Only in rare circumstances where the instructor believes that a group member had nearly zero participation will the peer evaluations effect their grade.
Rubrics for the report and presentation can be found in the Formats and Rubrics document.
Tips
- Do not try to split up the analysis Everyone should attempt to do a full analysis, then compare and contrast results before writing the report.
- The same goes for writing the report and making the presentation. However, in this case it would be a bad idea to have everyone do it. Someone should take the lead on each of this items, but they should still be a collaborative effort.
- When writing the report, it is always better to write too little than to write to much. In particular, don’t overstate your results. Also, if you find yourself writing something for the sake of writing something, don’t write it. Get to the point. It’s better to write nothing that something that is wrong.
- Everyone should be present when submitting the final report, for two reasons:
- To deal with any last minute changes.’
- So it isn’t a single person’s responsibility to make sure it is submitted on time.