- Register in DataTalks.Club's Slack
- Join the
#course-data-engineering
channel - Join the course Telegram channel with announcements
- The videos are published on DataTalks.Club's YouTube channel in the course playlist
- Frequently asked technical questions
Syllabus
- Week 1: Introduction & Prerequisites
- Week 2: Workflow Orchestration
- Week 3: Data Warehouse
- Week 4: Analytics Engineering
- Week 5: Batch processing
- Week 6: Streaming
- Week 7, 8 & 9: Project
- Start: 16 January 2023 (Monday) at 18:00 CET
- Registration link: https://airtable.com/shr6oVXeQvSI5HuWD
- Subscribe to our public Google Calendar (it works from Desktop only)
- Cohort folder with homeworks and deadlines
All the materials of the course are freely available, so that you can take the course at your own pace
- Follow the suggested syllabus (see below) week by week
- You don't need to fill in the registration form. Just start watching the videos and join Slack
- Check FAQ if you have problems
- If you can't find a solution to your problem in FAQ, ask for help in Slack
The best way to get support is to use DataTalks.Club's Slack. Join the #course-data-engineering
channel.
To make discussions in Slack more organized:
- Follow these recommendations when asking for help
- Read the DataTalks.Club community guidelines
Note: NYC TLC changed the format of the data we use to parquet. But you can still access the csv files here.
- Course overview
- Introduction to GCP
- Docker and docker-compose
- Running Postgres locally with Docker
- Setting up infrastructure on GCP with Terraform
- Preparing the environment for the course
- Homework
- Data Lake
- Workflow orchestration
- Introduction to Prefect
- ETL with GCP & Prefect
- Parametrizing workflows
- Prefect Cloud and additional resources
- Homework
- Data Warehouse
- BigQuery
- Partitioning and clustering
- BigQuery best practices
- Internals of BigQuery
- Integrating BigQuery with Airflow
- BigQuery Machine Learning
- Basics of analytics engineering
- dbt (data build tool)
- BigQuery and dbt
- Postgres and dbt
- dbt models
- Testing and documenting
- Deployment to the cloud and locally
- Visualizing the data with google data studio and metabase
- Batch processing
- What is Spark
- Spark Dataframes
- Spark SQL
- Internals: GroupBy and joins
- Introduction to Kafka
- Schemas (avro)
- Kafka Streams
- Kafka Connect and KSQL
Putting everything we learned to practice
- Week 7 and 8: working on your project
- Week 9: reviewing your peers
- Google Cloud Platform (GCP): Cloud-based auto-scaling platform by Google
- Google Cloud Storage (GCS): Data Lake
- BigQuery: Data Warehouse
- Terraform: Infrastructure-as-Code (IaC)
- Docker: Containerization
- SQL: Data Analysis & Exploration
- Prefect: Workflow Orchestration
- dbt: Data Transformation
- Spark: Distributed Processing
- Kafka: Streaming
To get the most out of this course, you should feel comfortable with coding and command line and know the basics of SQL. Prior experience with Python will be helpful, but you can pick Python relatively fast if you have experience with other programming languages.
Prior experience with data engineering is not required.
For this course, you'll need to have the following software installed on your computer:
- Docker and Docker-Compose
- Python 3 (e.g. via Anaconda)
- Google Cloud SDK
- Terraform
See Week 1 for more details about installing these tools
Thanks to the course sponsors for making it possible to create this course
Do you want to support our course and our community? Please reach out to alexey@datatalks.club