For the second group project, everyone will analyze the same data set. The data consist of flight arrival and departure information for all commercial US flights during the years 1987 - 2008. It was provided as part of an American Statistical Association Statistical Computing / Statistical Graphics poster competition in 2009 – details are available here:
Individual years may be downloaded by clicking links here:
Additional information about the data such as the airport codes and carrier codes are available as supplementary data here:
The final product of this project will again be a presentation and written report of your analysis.
There are four “tasks” associated with the project. Their due dates are:
For this project you will work in groups of four. (Due to the rules of division, a few groups of either three or five may be necessary.) Groups for this project will be partially student selected. You need to do one of two things:
dalpiaz.14@osu.edu
with the subject line [STAT 3202] Group Selection
, being sure to carbon copy the other proposed group members, by Thursday, April 4, 11:59 PM.
Your goal is to explore this data, come up with some research questions to ask about it, and answer them using the skills and tools that we have learned in this class, such as:
To give you some guidance, we expect you to ask and use statistical tools to answer at least 4 to 5 questions. These questions should be related to one or two overarching research questions. For example, you could try to investigate which days of the week have the most flight cancellations and delays. Your analysis could then be, for flights in 2008:
Try to come up with a general question you are interested in, which you can investigate using the data in a few different ways. Some other areas to think about:
The full data is enormous, and will be difficult to deal with in R
. You should give yourself time to work through such difficulties. It would also be good to think about restricting the scope of what you are interested in. For example, you can ask the questions about 2008 only. You can look at flight patterns at different times of the year in 2001. You can focus on Port Columbus through the years. You can focus on a particular airline through the years.
Some general guidelines for preparing group presentations:
The 25 points for the presentation will be assigned as follows:
rmarkdown
and rendered into a .html
file..html
file, this reference to “pages” is somewhat irrelevant, but you could easily copy-paste your text into a text editor to get an idea of how much you have written.For the report format we will utilize the IMRD organization structure. See also this helpful IMRD cheat sheet from CMU. In general your presentation can more loosely follow the IMRD structure, but the report must exactly follow the IMRD structure containing the following sections:
Even though it is the first thing to appear in the report, the abstract should be the last thing that you write. Generally the abstract should serve as a summary of the entire report. Reading only the abstract, the reader should have a good idea about what to expect from the rest of the document. Abstracts can be extremely variable in length, but a good heuristic is to use a sentence for each of the main sections of the IMRD:
The introduction should discuss the “why” of your analysis and the “what” of your data. Essentially, you need to motivate why the analysis that you’re about to do should be done. In particular you should state a clear problem of interest. Why does this analysis need to be done? What is the goal of this analysis? The introduction should also provide enough background on the subject area for a reader to understand your analysis. Do not assume your reader knows anything about the subject area that your data comes from. If the reader does not understand your data, there is no way the reader will understand your motivation. Since you did not collect this data, you can create any reasonable scenario that you would like. (In the real world, you would often have some input before collecting data.)
You do not need to provide a complete data dictionary in the introduction, but you should include one in the appendix. Often the data would be introduced in the Methods section, but here the data should be very closely linked to the motivation of the analysis.
Consider including some exploratory data analysis here, and providing some of it to the reader in the report if you feel it helps present the data.
The methods section should discuss what you did. The methods that you are using are those learned in class. This section should contain the bulk of your “work.” (But do not take that to mean that it should be the longest section.) This section will contain most of the R
code that is used to generate the results. Your R
code is not expected to be perfect idiomatic R
, but it is expected to be understood by a reader without too much effort. The majority of your code should be suppressed from the final report, but consider displaying code if it is concise and helps explain what you did. (If you use rmarkdown
you can set echo = FALSE
to suppress code.)
Consider adding subsections in this section. One potential set of subsections could be data and modeling. (Here we use modeling to mean fitting probability distributions.) The data section would describe your data. How will it be used in performing your analysis? What if any preprocessing have you done to it? The modeling section would describe the modeling methods that you will consider, as well as strategies for comparison.
Your goal is not to use as many methods as possible. Your task is to use appropriate methods to find a good model to help answer a question about your data.
The results section should contain numerical or graphical summaries of your results. What are the results of applying your chosen methods? Consider reporting “final” or “best” models you have chosen. There is not necessarily one, singular correct model, but certainly some methods and models are better than others in certain situations. The results sections is about reporting your results.
The discussion section should contain discussion of your results. That is, the discussion section is used for commenting on your results. This should also frame your results in the context of the data. What do your results mean? What other data do you wish had been collected? What interesting observations arose from your analysis? Results are often just numbers or graphics, here you need to explain what they tell you about the analysis you are performing. The results section tells the reader what the results are. The discussion section tells the reader why those results matter.
Any concluding remarks should be placed here.
The appendix section should contain any additional code, tables, and graphics that are not explicitly referenced in the narrative of the report. (If you use rmarkdown
and supply the .Rmd
file that contains the suppressed code, this is not necessary.) The appendix must contain a data dictionary. Appropriate citations should be placed here.
The 65 points for the report will be assigned as follows:
Each group member will write an anonymous peer review of each group member, including themselves. Peer reviews will comment on communication, knowledge of course concepts, and R programming skills. Peer Reviews will remain anonymous. Grading will be based largely on completion. It will be better to give honest comments than to simply give all team members high praise. Formatting and directions will appear with the associated assignment listing on Carmen.
The total points for the project is 100. There are broken down by task:
Peer Review: 10
Grading of the peer review will be mostly based on completion of a brief, anonymous feedback form about your group members. (To be released.) While you will be evaluating your peers, it will not necessarily directly impact their grades. Only in rare circumstances where the instructor believes that a group member had nearly zero participation will the peer evaluations effect their grade.
Details on what must be submitted for each task (which is partially described in this document) can be found in the description of the corresponding item on Carmen. The presentation and report will be group assignments, thus only one member of the group needs to submit.