Data Engineering project for ZoomCamp`24: JSONL -> PostgreSQL + Metabase + Mage.AI
ETL for Amazon Reviews'23` dataset.
Cloud environment: GitHub CodeSpaces, free and enough to test (120 core-hours of Codespaces compute for free monthly, 15GB of Codespaces storage for free monthly).
To reproduce and review this project it would be enough (hopefully) less than hour, and ~8GB total for described datasets. You don't need to use anything extra, like VS Code or trial accounts - just webrowser + GitHub account is totally enough.
This level of cloud resources allowed me to process and analyze-visualize datasets with at least 6 Mln reviews without issues.
This is my Data Engineering project started during DE ZoomCamp'24. And the main goal is straight-forward: Extract - Transform - Load data, then visualize some insights.
I chose to analyze Amazon Reviews dataset. Full dataset is huge, and it is available to download in parts - by product categories. There are some smaller and larger sub-datasets, some of them we would explore here (Digital_Music, Health_and_Personal_Care, All_Beauty, Software). Each subset includes 2 files: products (meta), ratings and user comments (reviews).
Some time ago I became very interested in how much we can trust all those ratings on Amazon, including 'bestseller' statuses. But without proper data it was hard to investigate. Now, years later, I have more skills in data analytics and this huge dataset, thanks to McAuley Lab. I chose to use only part of the whole information:
- Products: categories and average ratings
- Reviews: user's ratings, review dates, was that verified purchase or not (will analyze reviews texts later, awesome source for many insights I think)
Thanks to ZoomCamp for the reason to learn many new tools and get back to my 'reviews & trust' questions!
- What are the trends in reviews ratings for verified/not purchases? Are they more negative or more positive than average?
- As not verified purchase reviews can be manipulative, what is their ratio in total number?
- As automated review submission is technically possible with more advanced tech last years, what are the not verified purchase rating trends over time?
- Are there any significant differences in trends for product categories?
Let's explore together! These categories (datasets) I played with so far.
- dataset urls are defined in .csv files stored in
/data
-dataset_urls0.csv
,dataset_urls1.csv
, etc - my scripts (bash, python, sql)
- pre-configured Metabase dashboard
- pre-configured Mage AI pipelines
- Docker and docker-compose. All dockerized apps use default bridge network (172.17.0.x).
- PostgreSQL as a data base/warehouse - open source and free
- Metabase as an analitics platform - open source and free self-hosted
- Mage AI as an orchestration tool - open source and free self-hosted
π‘ Combination of CodeSpaces + PostgreSQL + Metabase + Mage AI would be a good choice for those who prefer/experiment with a simple and open source approach, or those who hesitate to deal with The BIG 3 cloud providers (AWS, Azure, GCP) with their serious payments. Looks like a good fit for early stage startups, developers, and those who learn Data Engineering as me (you?). βFind one more open source and free opportunity at the end of this instructiom.
π Oh, so many steps?! Please take a deep breath to calm down, they are all simple enough, that's why so many. One step at a time. I did it successfully many times to test it, you can do it too! Let's go!
NB: and if you 'lost the track', you can always start from scratch - create new CodeSpace and go step by step. Good luck!
- Setup environment
- Download dataset and start PostgreSQL server
- Ingest data using CLI (bash & python)
- Visualize data
- Orchestration data ingestion (Mage AI)
- Instructions to stop the apps
- Fork this repo on GitHub.
- Create GitHub CodeSpace from the repo.
- Start CodeSpace, wait for
requirements.txt
installation to finish (it starts automatically, just watch & wait). - Copy
dev.env
to.env
- runcp dev.env .env
in your CodeSpace terminal. You don't need to change anything to test the process. It contains all key settings (PostgreSQL etc).
cp dev.env .env
- Run
bash download.sh
to download 1st dataset (smallest) - fromdataset_urls0.csv
. As a result you will have 2 files in yourdata
directory:meta_Digital_Music.jsonl.gz
andmeta_Digital_Music.jsonl.gz
. Cool! β
bash download.sh
- Start PostgreSQL docker:
bash start_postgres.sh
. It will automatically download postgres container and start it.
bash start_postgres.sh
- Wait until you see the message
database system is ready to accept connections
- PostgreSQL is ready to serve you.
Let it run in this terminal so you can see logs and easily stop it later (with just Ctrl-C). CodeSpace will pop-up the notification that Your application running on port 5432 is available.
- just ignore it (close that pop-up).
- Start 2nd terminal (click
+
button on the panel in the right corner above terminal) and switch to it. - You can check database connection by running
pgcli -h 172.17.0.2 -p 5432 -u postg -d amzn_reviews
, passwordpostg
. All dockerized apps of this setup are suppose to run in defaultbridge
network,172.17.0.x
, with172.17.0.2
for PostgreSQL. If this step fails you're in trouble π But, when you run it in a CodeSpace and follow the instruction it should work fine, I tested multiple times. Typequit
to return to terminal CLI. Ok, PostgreSQL is running. β
pgcli -h 172.17.0.2 -p 5432 -u postg -d amzn_reviews
Dataset files have been downloaded, PostgreSQL is running - time to ingest your data!
- Run
bash process.sh
to ingest dataset files into PostgreSQL database.
bash process.sh
It executes python script process_jsonl.py
with default parameters from your .env
settings. By default (without parameter) it processes the 1st dataset files. As the result of successful loading you will see some progress messages and finally Finished ingesting data/... into the PostgreSQL database! Total time ... second(s) +++ PROCESSING finished: OK!
.
Congratulations, the first approach to load data accomplished and you have records in 2 tables of your database: meta
with products, reviews
with ratings. β
- Now you can ingest the next dataset with the same approach:
- download 2nd dataset: run
bash download.sh dataset_urls1.csv
(Health_and_Personal_Care files)
bash download.sh dataset_urls1.csv
- process 2nd dataset: run
bash process.sh dataset_urls1.csv
bash process.sh dataset_urls1.csv
- download 3rd dataset: run
bash download.sh dataset_urls2.csv
(All_Beauty files)
bash download.sh dataset_urls2.csv
If you want to try data workflow orchestration have some patience and follow to Visualize data step. If you don't want (or have no time) to play with Mage.AI, just process similarly dataset_urls2.csv
for All_Beauty, and download then process dataset_urls3.csv
for Software.
Finally, probably the most interesting part, let's see our data using Metabase - open sourse and free self-hosted Business Intelligence, Dashboards and Data Visualization tool.
- Start it in Docker - run
bash start_metabase.sh
It will automatically download Metabase container and start it.
- CodeSpace will pop-up the notification that
Your application running on port 3000 is available.
- clickOpen in Browser
.
New page would probably open white. Please wait a couple of seconds to let it start, then refresh the page. Now you will see login screen. Just login with john@mailinator.com
, pass: Jj123456
(no worries, it's self-hosted, all safe). Pre-configured dashboard with reports is already there for you! β
π
π‘ In case you accidentally close that pop-up or Metabase page and you need it later (after ingesting new datasets), you can always open that page from Ports
tab:
- Explore the dashboard on the main screen. There are 2 tabs:
Products
andReviews
. - Products: you can see 2 reports:
- pie chart with number of Products by Main category
- Products number distribution by Average rating
- Reviews: you can see 4 reports:
- pie chart with Reviews number by Verified/not purchase
- Reviews rating number distribution by Verified/not purchase over time (by months)
- pie chart with Reviews number by Main category
- Reviews rating number distribution by Main category over time (by months)
You can see some screenshots below.
- The more datasets you load, the more categories you can see. That's why I offer you to download and process at least 2 datasets. You can process 3rd dataset by following familiar steps in Ingest data using CLI (bash & python) or go ahead and discover Mage AI.
I wouldn't say it was so simple and easy as Matt showed us in videos, but with some time and effort I managed to find the way to convert the logic of my process_jsonl.py
script into Mage
pipelines and bricks. Why is it worth my/your time? Because larger datasets (like Kindle_Store) will probably demand a serious cloud storage, database and compute than free CodeSpace playground. And I (you?) need to learn how to deal with them with a more scalable system, providing long job execution, partitioning, monitoring and logs. So let's see 2 pipelines I managed to setup with Mage.
π‘ In case something goes wrong or complicated, you can still ingest more datasets with step 11 Ingest data using CLI (bash & python) and then reload Metabase dashboard page.
- Run
unzip mage.zip
to extract pre-configured Mage AI workflow.
unzip mage.zip
- Run
docker-compose build
to automatically download and prepare Docker container with Mage AI.
docker-compose build
- Run
docker-compose up
to start Docker container with Mage AI.
docker-compose up
- CodeSpace will pop-up the notification that
Your application running on port 6789 is available.
- clickOpen in Browser
.
That new page would probably open white. Please wait a couple of seconds to let Mage AI start, then refresh the page. Now you will see Mage dashboard. Your orchestration center is ready to serve you! β
- Go to
Pipelines
- move mouse to the left edge of the window, it will open menu slider, click onPipelines
.
You will see 2 pipelines I configured:
- Click
load_run_dataset
. It will open current pipeline triggers. Then clickRun@once
button.
- In the
Run pipeline now
dialogue you can define a variable, in this case it will be the file name of dataset.
Enter dataset_urls2.csv
(that you downloaded but not processed yet) and click Run now
button. Ingestion process started. You can see the progress on Pipeline runs
page. It appears with a little delay, no worries.
- After ingesting new dataset you can switch to Metabase page, update it and see new data in the dashboard reports. Congratulations! β π
- If you get to this point, please βοΈstarβοΈ my repo! π
Bonus for those who survived reading instructions!
In addition to using PostgreSQL in docker (local on your computer or as described in GitHub CodeSpace) you can use/experiment with Supabase - an open source Firebase alternative.
It provides a quite generous free plan (not a trial, no credit card required): 500 MB database space, 2 Core shared CPU, 1GB RAM, 5 GB bandwidth, 1 GB file storage. Free projects are paused after 1 week of inactivity. Limit of 2 active projects.
I tested it, it works with my scripts, and with Metabase as well. You just need to:
- register free account
- get access credentials (host, port, user, password, database)
- put new values to
.env
file - create in Supabase 2 tables -
meta
andreviews
, then define structure (according to my python script) - and finally use my
process.sh
(or Mage AI pipelines) to ingest data there - Metabase works with Supabase as with any PostgreSQL database - create dashboard for your new database.
Let me know how it worked for you!
- Simple way - stop all together by stopping your CodeSpace. Remember, this will leave all downloaded data in your CodeSpace - you can start it later and continue playing with tools and data. You can also delete CodeSpace with all the data.
- Stop all active Docker containers - run this command in terminal
docker stop $(docker ps -a -q)
- Stop Mage AI
- switch to 2nd terminal and press
Ctrl-C
- then run
docker-compose down
in terminal
- switch to 2nd terminal and press
- Stop PostgreSQL - switch to 1st terminal and press
Ctrl-C
- You can delete downloaded dataset .jsonl.gz files from
/data
folder.
- Refactor python code
- Export to .parquet
- Export to BigQuery
- [/] BigQuery: update Mage orchestration, visualization and instructions
- [/] Load more data to analyze
Kindle Store
dataset - Visualize using Apache Superset (as an alternative to Metabase)
- Visualize using Looker Studio
- Orchestrate with Prefect (as an alternative to Mage)
- Orchestrate with Apache Airflow (as an alternative to Mage)
Stay tuned!
π Thank you for your attention and time!
- If you experience any issue while following this instruction (or something left unclear), please add it to Issues, I'll be glad to help/fix. And your feedback, questions & suggestions are welcome as well!
- Feel free to fork and submit pull requests.
If you find this project helpful, please βοΈstarβοΈ my repo https://github.com/dmytrovoytko/data-engineering-amazon-reviews to help other people discover it π
Made with β€οΈ in Ukraine πΊπ¦ Dmytro Voytko