This project creates a pipeline for processing and analyze the raw data scraped from the videogame digitial distribution platform Steam. The data is procesed in an ELT pipeline with the objective of provide insight about the best selling videogames since 2006.
The dataset is originary from Kaggle, a repository of free datasets for data science. The dataset is composed as a mix of directly scraped records from Steam itself and Steamspy, a website dedicated to collect actual and speculative data about the former platform. All the files are enconded as raw jsons and need and extensive work of transformation in order to make then usable. Additionally all the reviews for each game are also retrieved from another Kaggle dataset.
- Cloud - Google Cloud Platform
- Infraestructure as Code - Terraform
- Containerization - Docker, Docker Compose
- Orchestration - Airflow
- Transformation - Spark
- Transformation - dbt
- Data Lake - Google Cloud Storage
- Data Warehouse - BigQuery
- Data Visualization - PowerBI
- Languages - Python, SQL
To add more realism the datasets are placed in a S3 bucket in AWS. The trasfer from AWS to GCP is done trough Terraform as part of the initialization tasks, a one-time operation.
Since the reviews datasets is ~ 40 GB and ~ 500k files a special processing in Spark is performed. The Airflow server is located in a vitual machine deployed in Compute Engine GCP .
For a more compacted visualization only games with metacritic score are displayed (3.5k out of 51k).
A note about Revenue:
The revenue is calculated as a product of owners * price
. This is a pretty naive approximation, see Steamspy guidelines about "Owned" and "Sold" equivalence.
The DAG has to main branches :
- Store branch : selected JSON files are converted into parquet and then loaded as tables in Bigquery in parallel.
- Reviews branch : JSON files are compacted before the cluster is created, then processed in dataproc as a pyspark job. Finally the results are injected into Bigquery, where they are merged with other tables using dbt. Cluster creation and deletion is fully automated.
Both branches start by retrieving the respective data from the GCS Buckets, and they converge in the dbt task wich builds the definitive tables in BigQuery.
Transformations are done with Spark (20%) and dbt (80%).
The spark job consists in process all 500k JSON files of reviews in a temporal Dataproc cluster and write them in a Bigquery table.
All the transformation for the steam store data is done in dbt using SQL intensively. The main table steam_games
is built following the google recommendations, i.e. is a denormalized table featuring nested an repeated structers to avoid the cost of performing joins.
In order to built the dashboard a custom unnested tables are also built from steam_games
.
In the final stage of the pipeline, inside the dbt task some checks are performed:
- Checks for uniqueness and non-null for the
gameid
column in all affected tables. - Checking for consistency of
release_date
column in tablesteam_games
partition core tables, for performance- Add a cost analysis of a full pipeline run
- extract more value from steam reviews --> new dashboard ?
- USE CASE: matrix factorization for recommendations
github action for upload spark script to bucket- More tests on the pipeline
- Fully normalize the tables as exercise
Thanks to Alexey and his community from datatalks club, create this project without the course of Data Engineering would have been much more difficult.