End-to-End Machine Learning (ML) Rossmann Sales Prediction

Rossmann is Germany's second-largest drug store chain. Dataset consists of historical sales data for 1,115 Rossmann stores. The task is to forecast the "Sales" column. Note that some stores in the dataset were temporarily closed for refurbishment, so these are not included as part of the training set.

Machine Learning Model

For this specific problem, a tree-based model was used specifically, XGBoostRegressor. This is to easily produce a working model because less preprocessing is needed.

It can easily handle the ff:

Imbalanced datasets
Mixed feature types like categorical and continuous (e.g. gender, age, number of customers, etc.)
Not sensitive to scale - standardization/normalization is not necessary
Both training and inference is relatively fast relative to deep learning (DL) models

Data Preparation

From the specifications, the following columns are the input data to the model (in production); hence, model was trained on those set of features/inputs to predict sales:

{
  "Store":1111,
  "DayOfWeek":4,
  "Date":"2014-07-10",
  "Customers":410,
  "Open":1,
  "Promo":0,
  "StateHoliday":"0",
  "SchoolHoliday":1
}

Feature Engineering

Split Date column to 2 columns: Month and Day
Label encoding on StateHoliday categorical column
- from ['0', 'a', 'b', 'c'] to [0, 1, 2, 3]

Model Training Experiments

Summary of results of training the model with different hyperparameters can be seen at the end of the Jupyter notebook (notebooks/1_x_Rossmann_Sales_Model_Creation.ipynb). So far, the best performing model with lowest RMSE, RMSPE, and does not overfit:

Params:

objective="reg:squarederror"
learning_rate=0.1
booster='gbtree'
n_estimators=1000
max_depth=7

Results:

Train RMSE: 414.9187115956893
Test RMSE: 464.0707667220559
Train RMSPE: 0.06328874878998308
Test RMSPE: 0.0700014959444069

Feature Importance

From the chart, we can see that the model predictions depend on which store, the number of customers during a specific day of the month, day of the week, month (New Year, Spring Break, Christmas, Thanksgiving, etc. months), and if there is an existing promo or not. Surprisingly, holidays whether it be school or state lightly affects the model's predictions.

End-to-End ML Pipeline

Model Deployment

The best performing model was exported to an external json file. Model was then packaged into a REST API service. Upon initialization of the service, the model was loaded using the XGBoost library for production use. To handle incoming requests, inputs are preprocessed as desribed in the Feature Engineering section before it is fed into the model to output sales predictions.

To know more, please see: src/deployments/model_service.py

Model Testing

Unit Testing

Simple checks are implemented such as:

Model accepts input data format
Model works or not
Model returns 200 status, and expected data types (etc. float, string, etc.)

To know more, please see: src/deployments/unit-test.py

Integration Testing

Another test is done once the model is deployed in a production server. The idea is to do a simple "unit testing" but with the model that is ready for production use.

To know more, please see: src/tests/integration-test/test.py

Load Testing

Load testing is essential to simulate actual production workload to the model that would be released in production. Goals of this test is to check:

If the model maintains a required SLA - latency (average, 95th Percentile), requests per second, error rate, etc.
If the model service scales as the traffic increases

To know more, please see: src/tests/load-test/locustfile.py

Usage

Prerequisites

Python
Python pip
Python requests library
BentoML

Deployment Usage

Make sure that the model is deployed in production, if not, just run build-deploy.yaml workflow. Inside the workflow run, it would output an endpoint URL that you can use. To test, you can use Postman, curl (bash), or requests (Python).

Python Example

import requests

input = {
    "Store": 1005,
    "DayOfWeek": 6,
    "Date": "2014-04-04",
    "Customers": 3000,
    "Open": 1,
    "Promo": 1,
    "StateHoliday": "0",
    "SchoolHoliday": 0
}
res = requests.post(url, headers=headers, data=json.dumps(input))
print(res)

Local Usage

(Optional) Install pre-commit by:

pip install -r requirements.txt
pre-commit install

Install BentoML v1.0+

pip install bentoml

Run Unit Test

python src/deployments/unit-test.py

<Output>:
INPUT: {
    "Store": 1005,
    "DayOfWeek": 6,
    "Date": "2014-04-04",
    "Customers": 3000,
    "Open": 1,
    "Promo": 1,
    "StateHoliday": "0",
    "SchoolHoliday": 0
}

OUTPUT: {'sales': 16643.078, 'status': 200}

Name		Name	Last commit message	Last commit date
Latest commit History 81 Commits
.github/workflows		.github/workflows
.vscode		.vscode
assets		assets
notebooks		notebooks
src		src
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

End-to-End Machine Learning (ML) Rossmann Sales Prediction

Machine Learning Model

Data Preparation

Feature Engineering

Model Training Experiments

End-to-End ML Pipeline

Model Deployment

Model Testing

Unit Testing

Integration Testing

Load Testing

Usage

Prerequisites

Deployment Usage

Local Usage

About

Releases 1

Packages

Languages

License

renzabergos/mlops-sales

Folders and files

Latest commit

History

Repository files navigation

End-to-End Machine Learning (ML) Rossmann Sales Prediction

Machine Learning Model

Data Preparation

Feature Engineering

Model Training Experiments

End-to-End ML Pipeline

Model Deployment

Model Testing

Unit Testing

Integration Testing

Load Testing

Usage

Prerequisites

Deployment Usage

Local Usage

About

Resources

License

Stars

Watchers

Forks

Releases 1

Packages 0

Languages

Packages