In this project, we are predicting the amount of time that USA commercial flights will be delayed, one week in advance. The data was extracted from PostgreSQL, and machine learning methods were applied, using both regression and classification models. The steps taken in the project are outlined below. All relevant files can be found in this repo.
-
Connect to Postgres database via JDBC (SQL Workbench)
- Find train and test flights data
- Randomly extract 10000 rows of flights train data (without NaN values in identifying columns)
-
Save data to .txt file and use Pandas in Python to display dataframe
- exploratory_analysis.ipynb
-
Decide which features to keep in initial attempt
- Drop columns containing high percentage of NaN values
- Drop columns containing identifying columns (i.e. flight ID. location ID)
- Check counts in each column - if value counts are either too high or too low, drop column
-
Pre-process data for initial attempt
- Drop columns containing non-numerical data (will later return to these columns and perform one-hot encoding)
- Use scaling fit_transform from sklearn on remaining numerical features
- Split training data into sub-training and sub-testing data
-
Perform an initial regression attempt with the reduced features and sub-training data
- modeling.ipynb
- Check that R-squared value is > 0 (better than random)
-
Increase model complexity using other regression methods
- Polynomial
- Logistic
- Ridge
- Lasso
- SVR
- Random forest
- Gradient boosting
- XGBoost
-
Choose best model based on regression metrics
- R-squared
- Root mean squared error
-
Integrate weather API based on location
- Departure
- Arrival
-
Perform GridSearchCV() method on best model
- Automatically perform k-fold cross-validation
-
Time-permitting only:
- Stretch material (classification problems)
- Historical data as predictors
- Fuel consumption
- Passengers
-
Perform predictions on flights test data
- Use selected model with optimized parameters via grid search
-
Transfer code blocks to .py files and generate final CSV file