Skip to content

Latest commit

 

History

History
110 lines (83 loc) · 8.24 KB

README.md

File metadata and controls

110 lines (83 loc) · 8.24 KB

Airline Delay Prediction and Analysis.

Capstone project on predicing an airline delay from a point of origin.


Nicholas Van Bergen
General Assembly Data Science Immersive
Cohort 0927-Remote

Airline Delay Prediction and Analysis, Executive Summary:

Problem statement:

Anyone who has traveled by air knows that delays can be frustrating and costly to the customer.

This project will attempt to predict a delay and a delay's duration in an air travel itinerary and further attempt to ascribe a possible cost to that delay.

Hopefully, this will empower consumers with a bit more inforamtion if their trip will be 'smooth sailing' or a 'disaster' from a customer standpoint.

Solution:

Starting with 35+million observations, I used a gradient descent algorithm (XGBoost) and was able to improve our predictive accuracy over the null model by 13% and our baseline model by 4%.

Results:

  • Null model is one were we would randomly guess if there is a delay. Since the classes of delay vs not delayed were 50/50 in composition we can say that we have an accuracy of 50%.
  • XGBoost baseline was computed using 'out of the box' hyperparameters and the metrics shown below are pulled from those predictions made on that model.
  • XGBoost tuned was computed on hyperparameter settings tuned using a gridsearch.
Metric Null Model XGBoost Baseline Predictions XGBoost Tuned Predictions $$\Delta$$ Baseline - Tuned $$\Delta$$ Null-Tuned
Accuracy 0.50 0.59 0.63 0.04 0.13
Recall --- 0.62 0.66 0.04 ---
Precision --- 0.60 0.63 0.03 ---
F1 --- 0.60 0.64 0.04 ---
ROC AUC --- 0.59 0.63 0.04 ---

NOTE: the values above were obtained on the original data. Your results from this notebook may vary as your dataset is smaller in size

Conclusion

The prototype mvp has been promising. Additional tuning of the model and testing should be completed to improve model accuracy and reduce Type One errors from the testing and training data.

The ultimate goal of this project is to have an app that is viewable and usable based on the trained model data. The model should need, as I imagine it, monthly maintenance as new delay data is release from BTS.

Our mvp product is the first step to such a useful and helpful customer friendly tool.

Folder structure:

Code

  1. Data intake & cleaning.
  2. EDA
  3. Data Modeling Training and Testing
  4. Data predictions/app output
  • An app of this prediction tool is still under development as of this writing 23 December, 2021.

Data Folder

Due to virtual hosting resource limitations. The dataset used on the repository will be a randomized sampling of the original all flights file noted in the notebook. You can still use the smaller subsample as the beginning to go through the entire notebook end to end to proceed and reproduce results.

Please use the sampled_all_flights.csv file to proceed through the notebooks.

Presentation Folder

A slide deck showing the results of this research is located here. A collection of un-annotated charts and figures are located here as well.


Summary of sources and data dictionary

Data Sources

All data obtained by download of CSV files from the Bureau of Transportation Statistics (BTS). Follow the link above for access to the data portal.

This section has two parts, the description and the dictionary.

  • The Description talks about the dataset in general terms while the dictionary provides a macro-level understanding of the data types and organization of the data in our dataframe.

Data Description

Having obtained 65 complete months of daily flights in the North American air system, I compiled the baseline table by joining all 65 data tables together for more that 34,409,230 million rows of labeled data.

The data was then sorted by delay or not delayed and then sampled to generate an even split of a randomized sampling. This was still far too many rows so the sample was save and resampled to something more manageable.

This setup allows us to be able to conduct a binary classification task. That is, will there be a delay or not.

Data Dictionary

Below is a table of each column in the dataset.

No. Column Description Units Type
1 YEAR The Year of the flight YYYY format integer categorical
2 MONTH The number representation of the month MM integer categorical
3 DAY_OF_MONTH The day of the month dd format integer categorical
4 DAY_OF_WEEK A number representation for the day of the week Monday = 1, Sunday = 7, Unknown = 9 integer categorical
5 FL_DATE Full recorded Date of the flight yyyy-mm-dd string categorical
6 OP_UNIQUE_CARRIER Reporting Airline by Two-Letter Designator, EG AA = American Airlines. string categorical
7 Tail_Number The identification number of the aircraft used for the flight. N831AA string categorical
8 OP_CARRIER_FL_NUM The flight number of the reporting airline. EG 5574 string categorical
9 Origin The IATA three-letter airport code identifying the station of origin for the flight. EG SYD string categorical
10 ORIGIN_CITY_NAME City, ST. formatted city name of the origin airport string categorical
11 DEST The IATA three-letter airport code identifying the station of origin for the flight. EG SFO string categorical
12 DEST_CITY_NAME City, ST. formatted city name of the destination airport string categorical
13 CRS_DEP_TIME Scheduled departure time stored as an integer, 11:52 pm is 2352 integer categorical
14 DEP_TIME Actual departure time recorded at airport and stored as a float. 7:13 pm is 1913.0 float categorical
15 DEP_DELAY Total time in minutes measured as difference between CRS_DEP_TIME and DEP_TIME integer discrete
16 CRS_ARR_TIME Scheduled arrival time stored as an integer, 07:52 pm is 1952 integer categorical
17 ARR_TIME Actual arrival time recorded at airport and stored as a float. 7:13 pm is 1913.0 float categorical
18 ARR_DELAY Total time in minutes measured as difference between CRS_ARR_TIME and ARR_TIME integer discrete
19 CANCELLED Binary designator if the flight was canceled binary int categorical
20 CANCELLATION_CODE Description that with Cancellation reason represented by letter code A - G string categorical
21 Arr_Delay Measures the number of minutes delayed. This is a focus metric that is used for the entire project. integer discrete
22 Cancelled Describes if the flight had been canceled 0 for not-canceled, 1 for canceled. integer categorical
23 CancellationCode Describes the type of cancellation with encoded values. string categorical
24 Diverted Describes the diverted status of the flight. 1 was diverted and 0 not diverted integer categorical
25 Distance Measures the total distance flown from origin to destination in miles. float discrete
26 CarrierDelay Measures the total amount of time spent on a delay that is attributed to a controllable reason for the airline integer discrete
27 WeatherDelay Measures the total amount of time spent on a delay that is attributed by weather/environmental reasons float discrete
28 NASDelay Measures the total amount of time spent on a delay that is attributed by Air Traffic Control (ATC) reasons float discrete
29 SecurityDelay Measures the total amount of time spent on a delay that is attributed to security zone issues (long lines at screening areas or delay for law enforcement activity in airport) float discrete
30 LateAircraftDelay Measures the total amount of time spent on a delay that is attributed the aircraft's late arrival. float discrete