RAWG-Data-Pipeline

A data pipeline I am working on made using RAWG's Database API to explore trends on video games through Streamlit.

Architecture Diagram

The pipeline is orchestrated by Airflow in the following steps:

The data is extracted and parsed from RAWG's API
A connection to Google Storage is programmatically made to load the data into it
Data modeling is done through dbt for use in Big Query
Streamlit connects to BigQuery to and visualizes the data

Terraform is used to manage the Google Cloud Platform infrastructure while Docker will containerize the orchestration done by Airflow.

CircleCI is used for CI/CD pipeline, ensuring that new code will not break the build and reflect the changes to Streamlit.

An example of how the dashboard will look:

Requirements

Poetry is used for dependency management and the required dependencies are outline in the pyproject.toml and poetry.lock files. Refer here for the installation guide.

Getting started

Start your poetry project in the repository and install the dependencies.

$ poetry init
$ poetry install

Set the environment variable for the API key in the terminal. You'll have to register in RAWG's site here. Do the same for the service account key after making an account with GCP.

$ export API_KEY="enter-your-key-here"
$ export SERVICE_KEY="gcp-service-key.json"

Build out the infrastructure for Google Cloud Platform using Terraform. See here for how to download based on your machine (I am using Debian).

$ terraform init
$ terraform apply

When you are done using the project, remember to destroy the infrastructure to avoid potential charges.

$ terraform destroy

Orchestrate the tasks for the pipeline via Airflow. Airflow also provides a web UI to interact with its services.

$ poetry run airflow webserver
$ poetry run airflow scheduler

Access the web app dashboard with Streamlit. It will source the data from the data models in BigQuery into your local setup.

$ poetry run streamlit run streamlit/app.py

Name		Name	Last commit message	Last commit date
Latest commit History 65 Commits
.circleci		.circleci
airflow		airflow
data		data
dbt		dbt
images		images
src		src
streamlit		streamlit
terraform		terraform
tests		tests
.flake8		.flake8
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
Makefile		Makefile
README.md		README.md
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

RAWG-Data-Pipeline

Architecture Diagram

Requirements

Getting started

About

Releases

Packages

Languages

andrewwkimm/RAWG-Data-Pipeline

Folders and files

Latest commit

History

Repository files navigation

RAWG-Data-Pipeline

Architecture Diagram

Requirements

Getting started

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages