Model, build & automate production-ready Big Data infrastructure
- How different database types meet different data use cases
- Create relational databases & ETL pipelines with PostgreSQL
- Create non-relational databases & ETL pipelines with Apache Cassandra
Create custom database schemas & ETL pipelines with PostgreSQL, Apache Cassandra and Python
- Understand how to use essential computing, storage, and analytics tools in Amazon Web Services (AWS)
- Dissect the core components of data warehouses and learn how to optimize them for different situations
- Implement a data warehouse in AWS — including scalable storage, ETL strategies and design & query optimization
Build an ETL pipeline that extracts data from Amazon S3, stages it in Redshift and transforms it into tables
- Practice using Apache Spark for cleaning and aggregating data
- Run Spark on a distributed cluster in AWS and learn best practices for debugging & optimizing Spark apps
- Dive into data lakes — understand their importance, core components, and different setup options & issues in the cloud
- Build data lakes & ETL pipelines with Spark
Sparkify’s data keeps growing! Time to move from data warehouse to data lake with Spark
- Understand how Airflow works and configure, schedule and debug pipeline jobs
- Track data lineage, set up schedules, partition data for optimization, and write tests that ensure data quality
- Build production data pipelines with a strong emphasis on maintainability and reusability
Automate Sparkify’s systems with dynamic, reusable pipelines that allow easy backfills
- Choose a use case that appeals to your analytics table, app back-end, source-of-truth database, etc.
- Gather the data you'll be using for your project (at least two sources and >1 million rows)
- Explore the data, clean it, model it, and then build, monitor and optimize the appropriate ETL for its consumption
Build your own end-to-end data-engineering project, then perfect your code with the help of our reviewers