A data pipeline I am working on made using RAWG's Database API to explore trends on video games through Streamlit.
The pipeline is orchestrated by Airflow in the following steps:
- The data is extracted and parsed from RAWG's API
- A connection to Google Storage is programmatically made to load the data into it
- Data modeling is done through dbt for use in Big Query
- Streamlit connects to BigQuery to and visualizes the data
Terraform is used to manage the Google Cloud Platform infrastructure while Docker will containerize the orchestration done by Airflow.
CircleCI is used for CI/CD pipeline, ensuring that new code will not break the build and reflect the changes to Streamlit.
An example of how the dashboard will look:
Poetry is used for dependency management and the required dependencies are outline in the pyproject.toml
and poetry.lock
files. Refer here for the installation guide.
Start your poetry project in the repository and install the dependencies.
$ poetry init
$ poetry install
Set the environment variable for the API key in the terminal. You'll have to register in RAWG's site here. Do the same for the service account key after making an account with GCP.
$ export API_KEY="enter-your-key-here"
$ export SERVICE_KEY="gcp-service-key.json"
Build out the infrastructure for Google Cloud Platform using Terraform. See here for how to download based on your machine (I am using Debian).
$ terraform init
$ terraform apply
When you are done using the project, remember to destroy the infrastructure to avoid potential charges.
$ terraform destroy
Orchestrate the tasks for the pipeline via Airflow. Airflow also provides a web UI to interact with its services.
$ poetry run airflow webserver
$ poetry run airflow scheduler
Access the web app dashboard with Streamlit. It will source the data from the data models in BigQuery into your local setup.
$ poetry run streamlit run streamlit/app.py