What are the predictors / sources of variability of flight delays?
The dataset consists of flight arrival and departure information for all commercial US flights from January to April 2008. It was provided as part of an American Statistical Association Statistical Computing / Statistical Graphics poster competition in 2009 – details are available here: http://stat-computing.org/dataexpo/2009/. The data may be downloaded by clicking the 2008 link here: http://stat-computing.org/dataexpo/2009/the-data.html. Additional information about the data such as the airport codes and carrier codes are available as supplementary data here: http://stat-computing.org/dataexpo/2009/supplemental-data.html.
Variable | Description |
---|---|
Year | 1987-2008 |
Month | 1-12 |
DayofMonth | 1-31 |
DayOfWeek | 1 (Monday) - 7 (Sunday) |
DepTime | actual departure time (local, hhmm) |
CRSDepTime | scheduled departure time (local, hhmm) |
ArrTime | actual arrival time (local, hhmm) |
CRSArrTime | scheduled arrival time (local, hhmm) |
UniqueCarrier | unique carrier code |
FlightNum | flight number |
TailNum | plane tail number |
ActualElapsedTime | in minutes |
CRSElapsedTime | in minutes |
AirTime | in minutes |
ArrDelay | arrival delay, in minutes |
DepDelay | departure delay, in minutes |
Origin | origin IATA airport code |
Dest | destination IATA airport code |
Distance | in miles |
TaxiIn | taxi in time, in minutes |
TaxiOut | taxi out time in minutes |
Cancelled | was the flight cancelled? |
CancellationCode | reason for cancellation (A = carrier, B = weather, C = NAS, D = security) |
Diverted | 1 = yes, 0 = no |
CarrierDelay | in minutes |
WeatherDelay | in minutes |
NASDelay | in minutes |
SecurityDelay | in minutes |
LateAircraftDelay | in minutes |
As stated above the overall objective is to identify variables that are associated with flight delays – either departure delays (DepDelay) or arrival delays (ArrDelay) or both. Compared with other project assignments, this project is very exploratory, and there are many different types of questions you can pursue here. For example:
You can explore these questions, and/or others that you find interesting (for example, how does Port Columbus stack up against other airports?). Describe your findings and present results numerically, in tables, or graphically, as appropriate.
One thing to be aware of: there are 136246 entries with missing Departure Delay (coded as NA). You should learn a little about missing data (online). Do you think these values are missing “completely at random”, “at random”, or “not at random”? Describe some approaches you might use for dealing with missing data. (For this project, it is okay to simply remove these observations [though you are welcome to attempt an approach to include them] – but you should discuss why this is or is not a good idea.)