Data Engineering Zoomcamp

Register in DataTalks.Club's Slack
Join the #course-data-engineering channel
Join the course Telegram channel with announcements
The videos are published on DataTalks.Club's YouTube channel in the course playlist
Frequently asked technical questions

Syllabus

Week 1: Introduction & Prerequisites
Week 2: Workflow Orchestration
Week 3: Data Warehouse
Week 4: Analytics Engineering
Week 5: Batch processing
Week 6: Streaming
Week 7, 8 & 9: Project

Taking the course

2023 Cohort

Start: 16 January 2023 (Monday) at 18:00 CET
Registration link: https://airtable.com/shr6oVXeQvSI5HuWD
Subscribe to our public Google Calendar (it works from Desktop only)
Cohort folder with homeworks and deadlines

Self-paced mode

All the materials of the course are freely available, so that you can take the course at your own pace

Follow the suggested syllabus (see below) week by week
You don't need to fill in the registration form. Just start watching the videos and join Slack
Check FAQ if you have problems
If you can't find a solution to your problem in FAQ, ask for help in Slack

Asking for help in Slack

The best way to get support is to use DataTalks.Club's Slack. Join the #course-data-engineering channel.

To make discussions in Slack more organized:

Follow these recommendations when asking for help
Read the DataTalks.Club community guidelines

Syllabus

Note: NYC TLC changed the format of the data we use to parquet. But you can still access the csv files here.

Week 1: Introduction & Prerequisites

Course overview
Introduction to GCP
Docker and docker-compose
Running Postgres locally with Docker
Setting up infrastructure on GCP with Terraform
Preparing the environment for the course
Homework

More details

Week 2: Workflow Orchestration

Data Lake
Workflow orchestration
Introduction to Prefect
ETL with GCP & Prefect
Parametrizing workflows
Prefect Cloud and additional resources
Homework

More details

Week 3: Data Warehouse

Data Warehouse
BigQuery
Partitioning and clustering
BigQuery best practices
Internals of BigQuery
Integrating BigQuery with Airflow
BigQuery Machine Learning

More details

Week 4: Analytics engineering

Basics of analytics engineering
dbt (data build tool)
BigQuery and dbt
Postgres and dbt
dbt models
Testing and documenting
Deployment to the cloud and locally
Visualizing the data with google data studio and metabase

More details

Week 5: Batch processing

Batch processing
What is Spark
Spark Dataframes
Spark SQL
Internals: GroupBy and joins

More details

Week 6: Streaming

Introduction to Kafka
Schemas (avro)
Kafka Streams
Kafka Connect and KSQL

More details

Week 7, 8 & 9: Project

Putting everything we learned to practice

Week 7 and 8: working on your project
Week 9: reviewing your peers

More details

Workshop: Maximizing Confidence in Your Data Model Changes with dbt and PipeRider

More details

Overview

Architecture diagram

Technologies

Google Cloud Platform (GCP): Cloud-based auto-scaling platform by Google
- Google Cloud Storage (GCS): Data Lake
- BigQuery: Data Warehouse
Terraform: Infrastructure-as-Code (IaC)
Docker: Containerization
SQL: Data Analysis & Exploration
Prefect: Workflow Orchestration
dbt: Data Transformation
Spark: Distributed Processing
Kafka: Streaming

Prerequisites

To get the most out of this course, you should feel comfortable with coding and command line and know the basics of SQL. Prior experience with Python will be helpful, but you can pick Python relatively fast if you have experience with other programming languages.

Prior experience with data engineering is not required.

Instructors

Tools

For this course, you'll need to have the following software installed on your computer:

Docker and Docker-Compose
Python 3 (e.g. via Anaconda)
Google Cloud SDK
Terraform

See Week 1 for more details about installing these tools

Supporters and partners

Thanks to the course sponsors for making it possible to create this course

Do you want to support our course and our community? Please reach out to alexey@datatalks.club

Name		Name	Last commit message	Last commit date
Latest commit History 651 Commits
.devcontainer		.devcontainer
cohorts		cohorts
images		images
week_1_basics_n_setup		week_1_basics_n_setup
week_2_workflow_orchestration		week_2_workflow_orchestration
week_3_data_warehouse		week_3_data_warehouse
week_4_analytics_engineering		week_4_analytics_engineering
week_5_batch_processing		week_5_batch_processing
week_6_stream_processing		week_6_stream_processing
week_7_project		week_7_project
.gitignore		.gitignore
README.md		README.md
after-sign-up.md		after-sign-up.md
arch_diagram.md		arch_diagram.md
asking-questions.md		asking-questions.md
dataset.md		dataset.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Data Engineering Zoomcamp

Taking the course

2023 Cohort

Self-paced mode

Asking for help in Slack

Syllabus

Week 1: Introduction & Prerequisites

Week 2: Workflow Orchestration

Week 3: Data Warehouse

Week 4: Analytics engineering

Week 5: Batch processing

Week 6: Streaming

Week 7, 8 & 9: Project

Workshop: Maximizing Confidence in Your Data Model Changes with dbt and PipeRider

Overview

Architecture diagram

Technologies

Prerequisites

Instructors

Tools

Supporters and partners

About

Releases

Packages

Languages

JMGGarcia/data-engineering-zoomcamp-1

Folders and files

Latest commit

History

Repository files navigation

Data Engineering Zoomcamp

Taking the course

2023 Cohort

Self-paced mode

Asking for help in Slack

Syllabus

Workshop: Maximizing Confidence in Your Data Model Changes with dbt and PipeRider

Overview

Architecture diagram

Technologies

Prerequisites

Instructors

Tools

Supporters and partners

About

Resources

Stars

Watchers

Forks

Languages