- Introduction
- Problem Statement
- Directory Layout
- Data
- Setup
- Notebooks
- Workflow Orchestration & Training Pipeline
- Streaming
- Terraform
- GitHub Actions for CI/CD
This project is part of the DataTalksClub/mlops-zoomcamp course, an initiative focused on integrating the principles of MLOps with real-world applications.
The xGoals MLOps project, inspired by the pioneering work of the Bundesliga Match Facts initiative and AWS's advanced machine learning techniques, aims to delve deeper into the realm of Expected Goals (xG). This metric, known as xGoals, quantifies the probability of a shot resulting in a goal based on various factors, offering a data-driven perspective to the age-old debate: "What are the odds of that shot finding the back of the net?"
A pivotal aspect of this project is its foundation on a simplified xGoals model, inspired by the Soccermatics course. This course, available on GitHub, offers a deep dive into the nuances of xGoals, and our project leverages its teachings to create a mathematical model tailored for modern football analytics.
But this project is not just about creating an xGoals model. It's about building an end-to-end machine learning pipeline, emphasizing scalability, reproducibility, and maintainability. Leveraging the principles of MLOps, the xGoals project ensures that the journey from data ingestion to model deployment is seamless, efficient, and robust.
The primary objective is to construct an end-to-end machine learning solution for the xG metric. This solution would:
- Predict the likelihood of a shot leading to a goal.
- Provide insights beyond just the scoreline.
- Aid in predicting future goals more accurately than past goals.
- Guide players and coaches in their decision-making processes.
- Act as a foundational layer for more advanced football data models.
Click to toggle the directory layout!
xGoals-mlops/
│
├── .github/
│ └── workflows/ # CI/CD workflow files for GitHub Actions.
│
├── infrastructure/ # Infrastructure-related files and configurations.
│
├── integration-test/ # Scripts and configurations for integration testing.
│
├── scripts/ # Utility scripts for miscellaneous tasks.
│
├── src/
│ └── pipeline/ # Source code related to the ML training pipeline.
│
├── tests/ # Scripts for unit and other tests.
│
├── .gitignore # List of files and directories ignored by Git.
├── Dockerfile # Docker container definition for the project.
├── Makefile # Commands for task automation.
├── Pipfile # Package dependencies specified by Pipenv.
├── Pipfile.lock # Dependency lock file generated by Pipenv.
├── README.md # Project overview and documentation.
├── config.env # Environment variables for the project.
├── config.json # Configuration file containing model parameters and other settings.
├── lambda_function.py # AWS Lambda function script for serverless deployment.
├── model.py # Script containing the machine learning model and related functions.
└── pyproject.toml # Configuration file for Python projects.
The primary source of this data is a significant contribution to the field of soccer analytics, known as the Soccer match event dataset.
This dataset contains event data for various tournaments, including the European Championship and the World Cup, allowing for a detailed analysis of shots and their likelihood of resulting in goals.
Soccer analytics has gained immense traction in both academia and the industry, especially with the advent of sensing technologies that offer high-fidelity data streams from every match. However, a significant challenge has been the limited public availability of such detailed data, as they are predominantly owned by specialized companies. Addressing this gap, the Soccer match event dataset, collected by Wyscout, offers the largest open collection of soccer logs. It encompasses all the spatiotemporal events, such as passes, shots, fouls, and more, from every match of an entire season across seven major competitions: La Liga, Serie A, Bundesliga, Premier League, Ligue 1, FIFA World Cup 2018, and UEFA Euro Cup 2016.
Each match event in the dataset provides insights into its position, time, outcome, involved player, and specific characteristics. This dataset has not only been pivotal for the Soccer Data Challenge but is also recognized as the most extensive public collection of soccer logs.
For our project, we specifically utilize two JSON files from this collection, chosen for their relatively smaller size, ensuring efficient processing. The direct links to these files are conveniently available in the project's config.json file. For ease of access, these files are uploaded to an S3 Bucket, with links that remain active for a maximum of 7 days due to the presigned URL's expiration constraints.
The data ingestion mechanism in our pipeline is designed to fetch data from the provided URLs and store it in a designated directory. If you encounter issues accessing the data, it's likely that the presigned URL has expired. In such cases, you'll need to:
- Download the original data from the primary source.
- Upload it to an S3 Bucket.
- Replace the expired presigned URLs in the
config.json
with the new ones.
It's worth noting that while we've chosen the two smallest JSON files for this project, incorporating more data can enhance the model's accuracy and predictive power. Ideally, for a more robust setup, permissions would be set on the S3 Bucket where the data resides, or alternative data ingestion methods would be employed.
I have used a development t2.large
Amazon EC2 (Elastic Compute Cloud) machine with python 3.9 and MLflow RDS database. You will also need Anaconda, Docker and Docker-Compose. Here, are basic instructions. Configure also AWS and your github account creds from github settings.
- Clone the Repository on EC2 instance and navigate to the project directory:
git clone https://github.com/dimzachar/xGoals-mlops.git
cd xGoals-mlops
You will need to have ports open 22 (SSH), 5000 (MLflow), 3000 (Grafana), 8081 (Adminer), 8080, 5432 (Postgress db) and create S3 Bucket for the MLflow artifacts. Make sure you also have aws configured.
- Environment Dependencies
Setup the environment by installing
pip install --upgrade pip
pip install pipenv
pipenv install --dev
pipenv shell
- Pre-commit Hooks
Pre-commit hooks are scripts that are executed automatically before a commit is made to the repository. They can be used to enforce coding standards, run tests, or perform any other checks to ensure the quality of the code. By using pre-commit hooks, you can ensure that only code that meets your defined standards is committed to the repository.
Use the following command:
make setup
that executes pipenv install --dev
and pre-commit install
.
You can explore the notebooks as a starter that include:
-
Data Preparation: We filter the data for shots only and then convert the pitch coordinates from percentages to an actual size of 105m x 68m.
-
Exploratory Data Analysis (EDA): We then perform a detailed analysis of the shots data. This will include examining the distribution of shots and goals across the pitch, and investigating the relationship between shot outcomes and factors like the distance and angle to the goal.
-
Feature Selection: Based on our EDA, we identify which features are most relevant for predicting whether a shot results in a goal:
Distance to Goal: Distance plays a pivotal role in determining the likelihood of a shot turning into a goal. Generally, shots taken closer to the goal have a higher probability of success.
Angle to Goal: The angle from which the shot is taken can significantly impact its success rate. A shot taken directly in front of the goal typically has a higher chance of scoring compared to one from a tight angle.
Shot Type: The type of shot (e.g., header, right foot, left foot) can also affect the likelihood of scoring. This information might be contained in the tags or subEventName field.
This will form the basis of our xG model.
We refactor the code from notebooks and orchestrate the entire training pipeline using Prefect. We use Prefect to automate tasks that can be performed before data comes in real-time. This ensures that tasks run in the right sequence and that any issues are handled gracefully.
Here's a breakdown:
- Sets up MLflow tracking. You will need to start the MLflow server
mlflow server -h 0.0.0.0 -p 5000 --backend-store-uri postgresql://db_username:db_password@database.endpoint/db_name --default-artifact-root s3://s3_bucket_name/
where xgoals
is the default bucket name.
Here is how you setup an RDS database.
- Start the Prefect Server
prefect server start
prefect config set PREFECT_API_URL=http://127.0.0.1:4200/api
Before you start training, start a new terminal and make sure you
export MLFLOW_TRACKING_SERVER_HOST="<EC2 Public IPv4 DNS>"
export MLFLOW_EXPERIMENT_NAME="xgoals"
data_ingestion.py
: Loads data configuration and downloads data from provided URLs. Reads all the JSON files from the./data/raw
directory.data_preprocessing.py
: Filters and transforms data. Then splits them into training, validation, and test sets for XGBoost.model_training.py
: Trains the model using hyperparameter optimization with Hyperopt.model_registry.py
: Registers the best model in the MLflow Model Registry.orchestrate.py
: Main workflow orchestrator to automate the tasks.
You can either run each file manually or execute the src/pipeline/train.sh
script with Makefile, which is used to automate common tasks such as data preprocessing, training the model, or running the tests. You don't need to train the model now, since we train it again later on, but you could do it with:
make train
which first executes unit tests using the pytest framework on the tests/ directory and then does quality checks:
isort .
: Sorts the imports in Python files.black .
: Formats Python code to adhere to the Black code style.pylint --recursive=y .
: Runs the Pylint linter on the codebase to identify and report coding standards violations.
After these are passed, it triggers the training pipeline.
The goal is to minimize the negative AUC as the objective function. We are logging the AUC value for each set of hyperparameters and saving the model using MLflow.
Only the best-performing model (in terms of AUC
) is promoted as the "Production" version.
Once trained, this model can predict the likelihood of a shot resulting in a goal based on the features of the shot.
The production-ready model is used for real-time predictions using AWS Lambda.
The MLflow UI will run under
http://<EC2 Public IPv4 DNS>:5000
where you can see the models with logged artifacts.
The Prefect UI can be accessed under
http://127.0.0.1:4200/
The idea behind the use of Streaming deployment would be to make predictions in real time, such as live during a match as shots are taken.
Real-time predictions can be valuable in a variety of scenarios, especially in the context of sports analytics and xGoals prediction.
▶️ Here are a few reasons:
Live Match Analysis: Real-time predictions can be used to analyze a match as it's happening. This can provide valuable insights to commentators, coaches, or even fans watching the game. For example, an xGoals model could provide a more objective measure of a team's performance than the current scoreline.
Betting: In the world of sports betting, odds can change rapidly. Real-time predictions can help bettors make more informed decisions.
Interactive Fan Engagement: Real-time predictions can also be used to engage fans during a live match. For instance, a mobile app or website could allow fans to see the xGoals prediction for each shot as it's taken.
Player or Team Strategy Adjustment: While this is more theoretical and not currently used in professional football, real-time predictions could potentially be used by coaching staff to make strategic decisions during a match, adjusting tactics based on the quality of chances being created.
Data Products: Real-time predictions can be part of data products sold to media outlets, betting companies, or other businesses in the sports industry.
As detailed shot data becomes available during a match (either from manual data entry or automated tracking systems), it could be fed into the xGoals model to update the expected goals in real time. This could then be used to provide enhanced live commentary, for in-play betting markets, or for interactive fan experiences.
We need to setup:
-
AWS input Kinesis to ingest the shot data in real time
-
AWS Lambda Function: Triggered by new data in the Kinesis stream. This function loads the trained xGoals model (from MLflow's model registry), preprocess the new shot data, and use the model to make a prediction. The prediction then is written to another output Kinesis stream.
You will need to manually create the two streams with
aws kinesis create-stream --stream-name shot_predictions --shard-count 1 && aws kinesis create-stream --stream-name xgoals_events --shard-count 1
and see them with
aws kinesis list-streams
Use the Makefile to development environment, run tests, ensure code quality, train the model, build the Docker image, and publish the image, instead of building, running the docker images manually.
The trained model is containerized using Docker with
make build
after training and tests are processed.
The integration_test/run.sh
script is designed to set up the necessary environment, build a Docker image if needed, fetches the run_id
and experiment_id
of the production model from MLflow, start services using Docker Compose in detached mode, create a Kinesis stream in LocalStack, download model artifacts from S3, and run integration tests against the Docker and Kinesis setups.
With
make integration_test
you can automate the build step too.
- Creating an ECR Repository:
aws ecr create-repository --repository-name xgoals_mlops
Before you can push Docker images to ECR, you need to authenticate your Docker client to the ECR registry. Use the get-login command to retrieve an authentication token and log in:
$(aws ecr get-login --no-include-email)
- Tagging and Pushing the Docker Image
First, define the remote URI, tag, and image details:
REMOTE_URI=
REMOTE_TAG=
REMOTE_IMAGE=${REMOTE_URI}:${REMOTE_TAG}
LOCAL_IMAGE="xgoals_prediction_model:v1"
Next, tag your local Docker image with the remote image details:
docker tag ${LOCAL_IMAGE} ${REMOTE_IMAGE}
Finally, push the Docker image to the ECR repository:
docker push ${REMOTE_IMAGE}
Check scripts/publish.sh
and replace with your own variables. Later on with Terraform, we publish the docker image to ECR automatically.
Monitoring the performance of machine learning models is crucial in real-world applications. As data evolves over time, the model's performance can degrade, leading to suboptimal predictions. This can be due to various reasons, such as changes in data distribution (data drift) or the emergence of new patterns that the model hasn't seen during training. To ensure that our models remain effective and relevant, we need to continuously monitor their performance and be ready to retrain or fine-tune them when necessary.
We utilize the Evidently library to monitor the model's performance. Evidently is a Python library designed for machine learning model validation, comparison, and monitoring. Here's how it works:
- Reference Data: This is a subset of the training data that the model was initially trained on. It serves as a baseline to compare against new incoming data.
- Current Data: This is the new data that the model is currently scoring. It represents the most recent data points and can be used to detect any shifts or drifts from the reference data.
- Model Predictions: For both the reference and current data, the model's predictions are recorded. These predictions are then used to compute various metrics to assess the model's performance.
The code uses several metrics provided by Evidently to monitor the model:
- Column Drift Metric: Measures the drift in individual columns or features.
- Dataset Drift Metric: Provides an overall assessment of how much the entire dataset has drifted from the reference data.
- Dataset Missing Values Metric: Monitors the proportion of missing values in the dataset.
- Column Quantile Metric: Measures the quantile values for specific columns.
All these metrics are stored in a PostgreSQL database (opt_metrics
table) for further analysis and visualization. To facilitate easy database management and visualization, the setup includes:
- Adminer: A lightweight database management tool that provides a web interface to manage the PostgreSQL database. It's accessible on port
8081
. - Grafana: An open-source platform for monitoring and observability. With Grafana, you can visualize the metrics from your application in real-time, making it easier to detect anomalies, drifts, and overall model performance. Grafana is accessible on port
3000
and is configured with custom data sources and dashboards.
The ModelService
class is responsible for handling incoming data, making predictions, and monitoring drift. When the buffer of incoming data reaches a certain threshold (BUFFER_THRESHOLD
), the drift monitoring process is triggered. The monitor_drift
method computes the aforementioned metrics and stores them in the database.
Make sure to delete streams using the
aws kinesis delete-stream --stream-name shot_predictions && aws kinesis delete-stream --stream-name xgoals_events
command in the end and any other service that is open.
This is an infrastructure as code (IaC) tool for building, changing, and versioning infrastructure safely and efficiently.
Terraform is used to automate the process of setting up and managing the cloud infrastructure, including the Kinesis streams and Lambda functions. The files can be found in the infrastructure
folder.
You need to have Terraform installed in your EC2 machine, instructions here. See the course notes for further details.
Before you start, ensure that you manually create the state bucket using the following command:
aws s3api create-bucket --bucket tf-state-xgoals-mlops
Initialize the Terraform working directory by using the terraform init
command
cd infrastructure
terraform init
After your Terraform working directory has been successfully initialized, you can see execution plan
terraform plan -var-file=vars/prod.tfvars
and apply the Terraform configuration
terraform apply -var-file=vars/prod.tfvars
Go to
cd xGoals-mlops/scripts
and run
. ./deploy_manual.sh
Export input Kinesis stream
export KINESIS_STREAM_INPUT="prod_shot_events-mlops-xgoals"
and test it with record found on put_record_test.json
. Go to CloudWatch log group to see the captured logs from the Lambda function.
Amazon CloudWatch can collect and track metrics, collect and monitor log files, set alarms, and automatically react to changes in your AWS resources. CloudWatch can monitor AWS resources such as Amazon EC2 instances, and Amazon Kinesis streams, as well as custom metrics generated by your applications and services, and any log files your applications generate. You can use CloudWatch to gain system-wide visibility into resource utilization, application performance, and operational health.
Make sure you destroy the infrastructure in the end
terraform destroy -var-file=vars/prod.tfvars
It allows developers to define workflows directly within their repositories. These workflows can be triggered by various events, such as a push or pull request, and can perform a wide range of tasks, from running tests to deploying applications.
-
Continuous Integration (CI) Workflow:
- Triggered on pull requests to the
develop
branch - Checks out the code
- Sets up the required Python version
- Installs dependencies
- Runs unit tests and linters
- Configures AWS credentials
- Triggered on pull requests to the
-
Continuous Deployment (CD) Workflow:
- Triggered on pushes to the
develop
branch - Checks out the code and configures AWS credentials
- Trains the model
- Defines and applies infrastructure changes using Terraform
- Builds and pushes Docker images to Amazon ECR
- Retrieves model artifacts and updates the Lambda function with the latest model
- Triggered on pushes to the
The CI/CD pipeline, as defined in the provided GitHub Actions workflows, ensures a seamless deployment process. It includes:
- Running tests to ensure code quality and functionality
- Applying code formatting using
black
- Checking for linting errors with
pylint
- Ensuring proper import order with
isort
- Building Docker images and pushing them to Amazon ECR
- Deploying the model and updating AWS Lambda functions with the latest model
By automating these processes, you can ensure that your application is always in a deployable state, and any changes made to the codebase are immediately tested and deployed.