Anyone who has traveled by air knows that delays can be frustrating and costly to the customer.
This project will attempt to predict a delay and a delay's duration in an air travel itinerary and further attempt to ascribe a possible cost to that delay.
Hopefully, this will empower consumers with a bit more inforamtion if their trip will be 'smooth sailing' or a 'disaster' from a customer standpoint.
Starting with 35+million observations, I used a gradient descent algorithm (XGBoost) and was able to improve our predictive accuracy
over the null model by 13% and our baseline model by 4%.
Null model
is one were we would randomly guess if there is a delay. Since the classes of delay vs not delayed were 50/50 in composition we can say that we have an accuracy of 50%.XGBoost baseline
was computed using 'out of the box' hyperparameters and the metrics shown below are pulled from those predictions made on that model.XGBoost tuned
was computed on hyperparameter settings tuned using a gridsearch.
Metric | Null Model | XGBoost Baseline Predictions | XGBoost Tuned Predictions |
|
|
---|---|---|---|---|---|
Accuracy | 0.50 | 0.59 | 0.63 | 0.04 | 0.13 |
Recall | --- | 0.62 | 0.66 | 0.04 | --- |
Precision | --- | 0.60 | 0.63 | 0.03 | --- |
F1 | --- | 0.60 | 0.64 | 0.04 | --- |
ROC AUC | --- | 0.59 | 0.63 | 0.04 | --- |
NOTE: the values above were obtained on the original data. Your results from this notebook may vary as your dataset is smaller in size
The prototype mvp has been promising. Additional tuning of the model and testing should be completed to improve model accuracy and reduce Type One errors from the testing and training data.
The ultimate goal of this project is to have an app that is viewable and usable based on the trained model data. The model should need, as I imagine it, monthly maintenance as new delay data is release from BTS.
Our mvp product is the first step to such a useful and helpful customer friendly tool.
- Data intake & cleaning.
- EDA
- Data Modeling Training and Testing
Data predictions/app output
- An app of this prediction tool is still under development as of this writing 23 December, 2021.
Due to virtual hosting resource limitations. The dataset used on the repository will be a randomized sampling of the original all flights file noted in the notebook. You can still use the smaller subsample as the beginning to go through the entire notebook end to end to proceed and reproduce results.
Please use the sampled_all_flights.csv
file to proceed through the notebooks.
A slide deck showing the results of this research is located here. A collection of un-annotated charts and figures are located here as well.
All data obtained by download of CSV files from the Bureau of Transportation Statistics (BTS). Follow the link above for access to the data portal.
This section has two parts, the description and the dictionary.
- The Description talks about the dataset in general terms while the dictionary provides a macro-level understanding of the data types and organization of the data in our dataframe.
Having obtained 65 complete months of daily flights in the North American air system, I compiled the baseline table by joining all 65 data tables together for more that 34,409,230 million rows of labeled data.
The data was then sorted by delay or not delayed and then sampled to generate an even split of a randomized sampling. This was still far too many rows so the sample was save and resampled to something more manageable.
This setup allows us to be able to conduct a binary classification task. That is, will there be a delay or not.
Below is a table of each column in the dataset.
No. | Column | Description | Units | Type |
---|---|---|---|---|
1 | YEAR | The Year of the flight YYYY format | integer | categorical |
2 | MONTH | The number representation of the month MM | integer | categorical |
3 | DAY_OF_MONTH | The day of the month dd format | integer | categorical |
4 | DAY_OF_WEEK | A number representation for the day of the week Monday = 1, Sunday = 7, Unknown = 9 | integer | categorical |
5 | FL_DATE | Full recorded Date of the flight yyyy-mm-dd | string | categorical |
6 | OP_UNIQUE_CARRIER | Reporting Airline by Two-Letter Designator, EG AA = American Airlines. | string | categorical |
7 | Tail_Number | The identification number of the aircraft used for the flight. N831AA | string | categorical |
8 | OP_CARRIER_FL_NUM | The flight number of the reporting airline. EG 5574 | string | categorical |
9 | Origin | The IATA three-letter airport code identifying the station of origin for the flight. EG SYD | string | categorical |
10 | ORIGIN_CITY_NAME | City, ST. formatted city name of the origin airport | string | categorical |
11 | DEST | The IATA three-letter airport code identifying the station of origin for the flight. EG SFO | string | categorical |
12 | DEST_CITY_NAME | City, ST. formatted city name of the destination airport | string | categorical |
13 | CRS_DEP_TIME | Scheduled departure time stored as an integer, 11:52 pm is 2352 | integer | categorical |
14 | DEP_TIME | Actual departure time recorded at airport and stored as a float. 7:13 pm is 1913.0 | float | categorical |
15 | DEP_DELAY | Total time in minutes measured as difference between CRS_DEP_TIME and DEP_TIME | integer | discrete |
16 | CRS_ARR_TIME | Scheduled arrival time stored as an integer, 07:52 pm is 1952 | integer | categorical |
17 | ARR_TIME | Actual arrival time recorded at airport and stored as a float. 7:13 pm is 1913.0 | float | categorical |
18 | ARR_DELAY | Total time in minutes measured as difference between CRS_ARR_TIME and ARR_TIME | integer | discrete |
19 | CANCELLED | Binary designator if the flight was canceled | binary int | categorical |
20 | CANCELLATION_CODE | Description that with Cancellation reason represented by letter code A - G | string | categorical |
21 | Arr_Delay | Measures the number of minutes delayed. This is a focus metric that is used for the entire project. | integer | discrete |
22 | Cancelled | Describes if the flight had been canceled 0 for not-canceled, 1 for canceled. | integer | categorical |
23 | CancellationCode | Describes the type of cancellation with encoded values. | string | categorical |
24 | Diverted | Describes the diverted status of the flight. 1 was diverted and 0 not diverted | integer | categorical |
25 | Distance | Measures the total distance flown from origin to destination in miles. | float | discrete |
26 | CarrierDelay | Measures the total amount of time spent on a delay that is attributed to a controllable reason for the airline | integer | discrete |
27 | WeatherDelay | Measures the total amount of time spent on a delay that is attributed by weather/environmental reasons | float | discrete |
28 | NASDelay | Measures the total amount of time spent on a delay that is attributed by Air Traffic Control (ATC) reasons | float | discrete |
29 | SecurityDelay | Measures the total amount of time spent on a delay that is attributed to security zone issues (long lines at screening areas or delay for law enforcement activity in airport) | float | discrete |
30 | LateAircraftDelay | Measures the total amount of time spent on a delay that is attributed the aircraft's late arrival. | float | discrete |