This is a sample project to showcase a simple Data Engineering task. It's goal is to help in learning new tools, i.e. dbt and Airflow, and to help beginners break into the field of Data Engineering by showing a simple project from start to finish.
The RSS Mining project's aim is to fetch data from various RSS feeds of newspapers and aggregate them into on single database. The results show be displayed via a simple Web Dashboard, s.t. end-users can explore the data.
To achieve this, the RSS feeds of three different german newspapers (Frankfurter Allgemeine (FAZ), Süddeutsche (SZ) and Die ZEIT) are fetched via a HTTP request and saved to disk. The data is then loaded into a Postgres database, transformed via dbt and visualized using Metabase.
This project was developed using:
- Python 3.10.6
- Poetry 1.4.2
- Docker 24.0.6, build ed223bc
- Enter poetry shell
poetry shell
- Install dependencies
poetry install
# Install dbt plugins
cd dbt_transformations
dbt deps
- Start the docker containers
./up.sh
- Open the Airflow Web UI (http://localhost:8080) and log in using the default credentials (User: airflow, Password: airflow)
- Run the
init_db
DAG - Run the individual DAGs for the RSS feeds
- Run the
update-overall-mart
DAG to consilidate the data of the individual feeds in one table - Open Metabase (http://localhost:3000) and register your local account
- Explore the data marts using Metabase and build your own dashboard
The RSS mining showcase is a good project to get to know Data Engineering in a nutshell. In this small example already, one can see the challenges that arise when "moving data from A to B", using different online sources on the way. There are many tools (Python, Airflow, dbt, Postgres) that need to work together in order to set up this kind of pipeline. Also, one needs to think about cleaning data, upserting already fetched data, how to handle missing values, etc..
To further enhance the project, one could think about moving it to the cloud. The first things that come to mind are the usage of cloud storage (e.g. Azure Blob Storages, AWS S3) to store the RSS feeds. If one would want improve in DevOps, an Airflow deployment to a cloud Kubernetes service would also seem viable. In addition, instead of using CSV files, one could use modern "data lake" formats, such as parquet and delta lake, to store the transformed data.
Please feel free to contact me if you have any questions at LinkedIn or directly in the repository