Skip to content

An end-to-end data engineering pipeline that fetches data from Wikipedia, cleans and transforms it with Apache Airflow and saves it on Azure Data Lake. Other processing takes place on Azure Data Factory, Azure Synapse and Tableau.

Notifications You must be signed in to change notification settings

airscholar/FootballDataEngineering

Repository files navigation

Football Data Engineering

This Python-based project crawls data from Wikipedia using Apache Airflow, cleans it and pushes it Azure Data Lake for processing.

Table of Contents

  1. System Architecture
  2. Requirements
  3. Getting Started
  4. Running the Code With Docker
  5. How It Works
  6. Video

System Architecture

system_architecture.png

Requirements

  • Python 3.9 (minimum)
  • Docker
  • PostgreSQL
  • Apache Airflow 2.6 (minimum)

Getting Started

  1. Clone the repository.

    git clone https://github.com/airscholar/FootballDataEngineering.git
  2. Install Python dependencies.

    pip install -r requirements.txt

Running the Code With Docker

  1. Start your services on Docker with
    docker compose up -d
  2. Trigger the DAG on the Airflow UI.

How It Works

  1. Fetches data from Wikipedia.
  2. Cleans the data.
  3. Transforms the data.
  4. Pushes the data to Azure Data Lake.

Video

FootballDataEngineering

About

An end-to-end data engineering pipeline that fetches data from Wikipedia, cleans and transforms it with Apache Airflow and saves it on Azure Data Lake. Other processing takes place on Azure Data Factory, Azure Synapse and Tableau.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published