Overview

What are the predictors / sources of variability of flight delays?

Details

The dataset consists of flight arrival and departure information for all commercial US flights from January to April 2008. It was provided as part of an American Statistical Association Statistical Computing / Statistical Graphics poster competition in 2009 – details are available here: http://stat-computing.org/dataexpo/2009/. The data may be downloaded by clicking the 2008 link here: http://stat-computing.org/dataexpo/2009/the-data.html. Additional information about the data such as the airport codes and carrier codes are available as supplementary data here: http://stat-computing.org/dataexpo/2009/supplemental-data.html.

Data Description

Variable Description
Year 1987-2008
Month 1-12
DayofMonth 1-31
DayOfWeek 1 (Monday) - 7 (Sunday)
DepTime actual departure time (local, hhmm)
CRSDepTime scheduled departure time (local, hhmm)
ArrTime actual arrival time (local, hhmm)
CRSArrTime scheduled arrival time (local, hhmm)
UniqueCarrier unique carrier code
FlightNum flight number
TailNum plane tail number
ActualElapsedTime in minutes
CRSElapsedTime in minutes
AirTime in minutes
ArrDelay arrival delay, in minutes
DepDelay departure delay, in minutes
Origin origin IATA airport code
Dest destination IATA airport code
Distance in miles
TaxiIn taxi in time, in minutes
TaxiOut taxi out time in minutes
Cancelled was the flight cancelled?
CancellationCode reason for cancellation (A = carrier, B = weather, C = NAS, D = security)
Diverted 1 = yes, 0 = no
CarrierDelay in minutes
WeatherDelay in minutes
NASDelay in minutes
SecurityDelay in minutes
LateAircraftDelay in minutes

Objectives

As stated above the overall objective is to identify variables that are associated with flight delays – either departure delays (DepDelay) or arrival delays (ArrDelay) or both. Compared with other project assignments, this project is very exploratory, and there are many different types of questions you can pursue here. For example:

  1. Are certain carriers more prone to delays?
  2. Are certain days of the week or times of day more prone to delays?
  3. Are certain airports more prone to delays?

You can explore these questions, and/or others that you find interesting (for example, how does Port Columbus stack up against other airports?). Describe your findings and present results numerically, in tables, or graphically, as appropriate.

One thing to be aware of: there are 136246 entries with missing Departure Delay (coded as NA). You should learn a little about missing data (online). Do you think these values are missing “completely at random”, “at random”, or “not at random”? Describe some approaches you might use for dealing with missing data. (For this project, it is okay to simply remove these observations [though you are welcome to attempt an approach to include them] – but you should discuss why this is or is not a good idea.)