This project aims to predict future flight delays using Apache Spark for big data processing and Facebook Prophet for time-series forecasting. By analyzing large datasets from the U.S. Bureau of Transportation Statistics, I uncover patterns and trends that contribute to delays, helping airlines optimize operations and improve customer experiences.
Key features include:
- Efficient data loading and preprocessing with Spark.
- In-depth exploratory data analysis (EDA).
- Predictive modeling using Prophet with custom regressors.
- Fine-tuning the model using cross-validation to minimize Mean Squared Error (MSE) and Absolute Mean Error (MAE).
- Apache Spark
- Python 3.x
- Matplotlib
- NumPy
- Seaborn
- Pandas (in Spark)
- Facebook Prophet
- Cross Validation (Scikit-learn)
This dataset contains flight delay data, including flight times, delays, and relevant metrics for analysis.
- Extracted the dataset from the U.S. Bureau of Transportation.
- Created a function using Spark to load the data efficiently.
- Conducted EDA using Spark to understand the dataset, identifying key trends and distributions.
- Filtered and prepared the dataset for the Facebook Prophet model.
- Split the dataset into training and testing sets using Spark.
- Applied Facebook Prophet with specified parameters to predict future flight delays.
- Enhanced the model by adding regressors that may influence flight delays.
- Used Scikit-learn to evaluate model performance with Mean Squared Error (MSE) and Absolute Mean Error (MAE).
- Visualized the forecast predictions to better understand the model outputs.
- Employed cross-validation and parameter tuning to optimize the model for the best MSE and **MAE