Project Overview

This project aims to extract data from Kaggle and turn it available on GCP Big Query.

Chosen technologies set is described bellow:

Orquestration: Astro-Python-SDK
Data Quality: SODA
Data Transformations: DBT
Storage DataLake: GCP
Storage Data WareHouse: Big Query

Project Status

Containerization Docker: Done
Python Script for extracting data: Pending
Orquestration Extracting from source: Pending
Orquestration Upload into GCS Datalake: Done
Orquestration Data Quality on RAW layer: Done
Orquestration Transforming DBT models: Done

Considerations about handy previous knowledge.

For this project it is nice to have previous experience with Python development, Docker CLI and Airflow practices. The Airflow enviroment is builded by performing Astro-Python-SDK. This appoach provides speed on building process and takes away some possible issues on setting up the dependencies. It will be great if you are familiar with configuration files, as the SODA application uses it a lot for stablish connections and perform data quality assurance. I intend to describe in detail all the requirements for making it as easier as possible.

Airflow some important commands

To start a new enviroment: astro dev init
After building up that, you can see a folder structure inside your root folder.
To start Airflow UI, just type astro dev start on your terminal. It ll trigger the building of all dependencies and start the UI.
In the Airflow UI, add a new Connnection called gcp and choose Google Cloud as connection type. In the Keyfile Path type the path to service_account.json file it is probally something like /usr/local/airflow/include/gcp/service_account.json. This file will be created on GCP platform, and it is described in the next session.

GCP enviroment setup

Create an account if you don't have one.
Create a bucket
Create a Service Account with these privilegies: Storage Admin Bigquery Admin
For this Service Account, create a new key. click on the service account name and look for the Keys menu. Create a new key and save it as Json inside include/GCP directory

SODA configuration:

SODA is a framework which enable us to implement Data Quality step on our pipelines. It is performed by using yml files that hold the expected data types and columns names. For better understanding you can take a look at include\soda\checks\sources\raw_invoices.yml.

It is required to fill up the include\soda\configuration.yml with your credentials.

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
.astro		.astro
dags		dags
include		include
tests/dags		tests/dags
.dockerignore		.dockerignore
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
packages.txt		packages.txt
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Project Overview

Project Status

Considerations about handy previous knowledge.

Airflow some important commands

GCP enviroment setup

SODA configuration:

About

Releases

Packages

Languages

JC3008/data_quality_on_airflow

Folders and files

Latest commit

History

Repository files navigation

Project Overview

Project Status

Considerations about handy previous knowledge.

Airflow some important commands

GCP enviroment setup

SODA configuration:

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages