Sales Time Series Forecasting Using Machine Learning Techniques (Random Forest, XGBoost, Stacked Ensemble Regressor)
Developed as a group project for the program "Business Analytics: Operational Research and Risk Analysis" at the Alliance Manchester Business School.
The python code for this project can be viewed through jupyter nbviewer via this link.
The objective of this project is to build a predictive model to forecast 6 weeks of daily sales for 1,115 drug stores in Europe.
Key steps of the project:
- Exploratory Data Analysis (EDA)
- Datetime Objects preprocessing
- Time Series K-Means clustering using Dynamic Time Warping (to effectively capture curve similarity across time)
- Generic Preprocessing and Feature Engineering
- Cluster-specific EDA
- Variance Inflation Factor (VIF) Backwards Feature Selection (per cluster)
- Development of Naive Models based on historical sales data (day of week, day of month, day of year)
- Introduction of historical sales proxy features (Weekly, Monthly based)
- Three sets of ML models were developed per cluster (No proxy, weekly proxy, monthly proxy)
- Visualizations of sales predictions for randomly selected stores of each cluster
The ML models used are:
- Linear Regression
- Decision Tree Regressor
- Random Forest Regressor
- Gradient Boosting Regressor
- XGBoost Regressor
- Ensembled Stacking model using Random Forest and XGBoost as weak learners and Linear Regression ad Meta Learner
The models were evaluated based on the the Root Mean Square Percentage Error (RMSPE) and R-squared metrics for the training and validation sets. However, RMSPE was primarily used to measure the performance of each model since R-squared might not be suitable for capturing the goodness of the fit of non-linear relationships.
N is the total number of data records for accuracy measurement, yi is the actual sales for the ith record, ŷi is the sales forecast for the ith record. Zero sales were excluded from the calculation
- Editor Used: JupyterLab
- Python Version: Python 3.10.9
- General Purpose:
statsmodels, scipy, time, math
- Data Manipulation:
pandas, numpy
- Data Visualization:
seaborn, matplotlib
- Machine Learning:
scikit-learn, tslearn
-
stores.csv
: contains supplementary information for the 1,115 drug stores (1115 rows, 10 columns) -
train.csv
: contains the historical sales data, which covers sales from 01/01/2013 to 31/07/2015 (1017209 rows, 9 columns) -
test.csv
: identical to train.csv, except that Sales and Customers are unknown for the period of 01/08/2015 to 17/09/2015 (41088 rows, 9 columns)
The overall project design was based on the Cross Industry Standard Process for Data Mining (CRISP-DM), which consists of Business Understanding, Data Understanding, Data Preparation, Modelling, Evaluation and Deployment
The data pre-processing pipeline is illustrated in Figure 1. Initially store_train
was split into training and validation data sets. A preliminary round of data transformation and feature engineering based on the store_train
set has been applied across all store_train
, store_validate
and store_test
.
With the pre-processed dataset in hand, clustering was performed to separate the dataset into several clusters for model partitioning. Given the cyclical nature of the sales, Time Series K-Means clustering was employed using Dynamic Time Warping (DTW) as the distance metric to effectively capture curve similarity across time. To remove the effects of differences in the magnitude of sales across stores, the store-specific historical sales data was first transformed by applying TimeSeriesScalerMeanVariance() to standardize sales on a store-by-store basis, resulting in each store's sales having zero mean and unit variance.
Additional cluster-specific EDA was then performed on the clustered dataset, aiding in handling missing values and making data feature engineering decisions on each individual cluster. The final clustered training datasets were checked for multicollinearity using the variance inflation factor index (VIF). Features with the highest VIF were iteratively removed until all features had a value less than five. This resulted in the final training, validation, and testing datasets for each cluster.
The modelling pipeline is demonstrated in Figure 2, with a separate set of models being developed for each cluster. Naïve models were first developed as a benchmark for the ML models. The prediction from the naïve models for open stores was either based on the historical weekly average or the monthly average of that specific store, while the prediction for closed stores was 0. Upon retrieving the benchmark results, five selected ML models were developed with and without the sales_proxy from the naïve models and further validation was performed. Stacking was then applied on the most robust models to address any potential overfitting issues. To prevent potential data leakage from the sales_proxy and one-hot-encoded (day-of-month) variables, validation was exclusively conducted on the validation set, as opposed to adopting a cross-validation approach. Finally, permutation importance method was used to extract the feature importance of the final models and provide business recommendations.
The clustering analysis resulted in 4 clusters of stores (A, B, C, D), and the final features used for each cluster-specific model after VIF backward selection are demonstrated below:
The basic ML models without the sales_proxy dominated the naïve ones, proving that ML is an effective technique. However, the more advanced ML models with the sales_proxy feature showed a better behaviour, and the monthly_proxy ones demonstrated the lowest RMSPE scores without overfitting. For the monthly_proxy models, Random Forest and XGBoost appeared to be the most powerful, thus they were also stacked as weak learners with the Ensemble Stacking Method (Meta Learner: Linear Regression). The results prove that stacking was successful since in most of the clusters the difference between training and validation error decreased, thus reducing overfitting. The selected models with the highest performance for each cluster are depicted in the following table:
The tables above illustrate the permutation feature importance across the different clusters. First, the model illustrates a strong seasonality pattern. The monthly sales_proxy successfully captured a significant amount of the monthly patterns and is the most important feature in terms of permutation importance. The day of the week (dow) was also a significant indicator in predicting sales. A trend in the historical data was that sales peaked in December, on Mondays and Sundays, and additionally at the beginning, middle, and end of each month. It was also discovered that school holidays impact sales more than state holidays. Other factors that influenced sales were promotions, with individual store promotions (Promo) appearing to be more effective than coupon-based mailing campaigns (Promo2). The distance to competitors (CompetitionDistance) also showcased some significance within clusters A and C.
The predictive models were based on some assumptions:
- Monday and first day of month were assumed when computing Promo2Duration and CompetitiveLength
- Seasonality in sales led to the establishment of time information features
- Missing values in Open were imputed based on mode of other stores on the given date within the cluster
The last assumption covers a negligible limitation of the model, where it might not be accurate when predicting sales on days where stores are closed, because the model will not predict exactly zero sales. This should be unproblematic as it is more valuable to predict the sales when stores are open. For future improvements, multiple things can be investigated; firstly, investigate underlying patterns of when stores are open to impute any missing records in a more pragmatic way. Secondly, a more systematic approach can be used to evaluate the clustering results. Thirdly, a model could be developed which uses cross validation with hyperparameter tuning instead of sales_proxy features, to compare results. Lastly, due to the fact that the sales proxy may bring multicollinearity issue to the model, the interpretability of the model could be trimmed. With more data available and the model being predictive enough, sales proxy could be removed in the future.