↗️ Link to Session 2
Setup instructions and exercises for the Modern Data Stack training session.
NOTE: Cloud SQL will be powered down after each training session to prevent excessive/unnecessary cost accumulation.
Unfortunately because of cost and technical limitations certain services need to be shared among all participants, e.g. database instances. Therefore it is necessary to prefix these with a unique string to prevent conflicts or accidental overwrites of other people's work.
So in the remainder of this training wherever <YOUR_UNIQUE_PREFIX>
is
mentioned you should replace that with your actual unique prefix string, e.g.
using your name (i.e. <YOUR_UNIQUE_PREFIX>_database
becomes
john_database
)
Prerequisites:
- Docker is installed and running
- You have the GCP credentials
.json
key file - See Slack - You have the CloudSQL's postgres db password - See Slack
⚠️ The below shell commands are assuming a unix-like OS, e.g. macOS or Linux.
- In the root of the repository create a directory
.config
(this directory is gitignored) - Put the
.json
GCP credentials key file in there ascreds.json
- NOTE: the cloud sql proxy defined in the docker compose yaml file will use this
- Run docker-compose (from the root of the repository):
- Make sure docker
This will create the following services: * CloudSQL Proxy - to connect to the postgres instance on GCP * pgAdmin - a webapp to navigate the postgres instance and execute commands on it * Airbyte - a (local) server and fronted (webapp) to load data from postgres to BigQuery (in our case)docker compose up
- 🙏 Hopefully docker compose instantiated all services successfully 🙏
We are going to use pgAdmin to connect to the CloudSQL instance on GCP, create a database and populate it.
- Navigate to
http://localhost:8080
- Login with
- username:
admin@admin.com
- password:
root
- username:
- Click on
Add New Server
- Give the server a name, e.g.
modern-data-stack-pg-server
- In the
Connection
tab configure the following:- Host name/address:
cloudsql-proxy
- Port:
5432
- MaintenanceDB:
postgres
- Username:
postgres
- Password:
<see slack for password>
- Host name/address:
- Click
Save
- We should now be connected to the CloudSQL postgres instance on GCP
- The sidebar should now contain this server
- Expand this see its databases
- Right click on the
postgres
database and clickQuery Tool
-
Now let's create our source database:
🚨 Because we are working with only a single postgres instance every one should prefix a unique string, e.g. your name, to make the database name unique
Copy-paste the following SQL into the Query Tool, replace
<YOUR_UNIQUE_PREFIX>
with your unique prefix and run the query (F5 - keyboard shortcut).CREATE DATABASE <YOUR_UNIQUE_PREFIX>_northwind WITH OWNER = postgres ENCODING = 'UTF8' CONNECTION LIMIT = -1 IS_TEMPLATE = False;
-
In the sidebar right click on
Database
and clickRefresh
, your database should be visible there now
With the source database created we can populate it with tables and data for the dummy retail store Northwind.
⚠️ To make sure subsequent queries will be run against your new database close the existing Query Tool which is connected to thepostgres
database
- Open the
Query Tool
on the (newly created) source database, i.e. right click on the<YOUR_UNIQUE_PREFIX>_northwind
database and clickQuery Tool
- Copy-paste the SQL from
data_source/create_northwind.sql
into the Query Tool and run it - If the script ran successfully, in the sidebar, under your Databases ->
<YOUR_UNIQUE_PREFIX>_northwind
-> Schemas -> public -> Tables- If the tables do not show up a refresh of the database might be required
- To double check the query worked, clear the opened Query tool and run:
This should return all 91 customers.
select * from customers;
NOTE: after restart of airbyte, i.e.
docker compose down
->docker compose up
, if the docker volumes where not deleted then the anything configured, e.g. sources, destinations and connections, will still exist. So the setup for these components should only have to be done once.
- Go to
http://localhost:8000
- Login with the following credentials
- Username:
airbyte
- Password:
password
- Username:
- To create a data source, click
Sources
in the sidebar and select source typePostgres
- Set up Source:
- Host:
localhost
- Port:
5432
- Database Name:
<YOUR_UNIQUE_PREFIX>_northwind
- Username:
postgres
- Password:
<see slack for password>
- Host:
- Click
Set up source
- This will create the source and test the connection
- Should anything be wrong with the connection details Airbyte should pick this up.
- To create a data destination, click on
Destinations
and select destination typeBigQuery
- Set up Destination:
- Project ID:
modern-data-stack-training
- Dataset Location:
EU
- Default Dataset ID:
<YOUR_UNIQUE_PREFIX>_northwind_raw
- 🚨 A prefix is again required to avoid conflicts because we are working in the same BigQuery instance
- Note also that the dataset does not have to exist yet, Airbyte will create a dataset in BigQuery by this name
- Service Account Key JSON: copy-paste the contents of the
.json
credentials file into the field- The file should exist at
.config/creds.json
, as we've set it up in the beginning
- The file should exist at
- Project ID:
- Click
Set up destination
- Just like with the source Airbyte will verify the connection details.
- Create a
Connection
: go toConnections
(in the sidebar) and clickCreate your first connection
- Select an existing source:
Postgres
- Click
Use existing source
- Select an existing destination:
BigQuery
- Click
Use existing destination
- A new connection setup should be shown. All fields can be left unchanged. At the bottom click
Set up Connection
- This create the connection and automatically start the initial sync process
- Under
Connections
->Postgres<>BigQuery
->Status
->Sync History
the sync should be running. - After the sync is done navigate to BigQuery
https://console.cloud.google.com/bigquery?project=modern-data-stack-training
- Under the
<YOUR_UNIQUE_PREFIX>_northwind_raw
some_airbyte_*
tables should be created alongside the source tables, e.g.customers
,orders
, etc. - For each table Airbyte has added some metadata columns:
_airbyte_ab_id
,_airbyte_emitted_at
,_airbyte_normalized_at
,_airbyte_products_hashid
. - The types for each table should be the same as in the source (postgres). However the nullability has not been carried over.
- Under the
Now that the we have loaded the raw data into our analytics database, BigQuery,
we can use dbt to define transformations (e.g. join
s and aggregations) that
will help us reach our analytics goals, be that business intelligence or ML.
- Install dbt
pip install dbt-bigquery
should suffice- Perhaps using a Python virtual environment if preferred.
The dbt CLI comes with a command to help you scaffold a dbt project. To create your dbt project:
-
Ensure dbt is installed by runningÂ
dbt --version
:dbt --version
-
Initialize your dbt project, run:
dbt init northwind
Follow the instructions. You will be asked about: * Which database adapter to use:
BigQuery
* Authentication method:service_account
* Authentication keyfile path:/full/path/to/repo/.config/creds.json
* GCP Project ID:modern-data-stack-training
* Your dataset name:<YOUR_UNIQUE_PREFIX>_dbt
* NOTE: we haven't created this (BigQuery) dataset yet, which is okay, dbt will create it for us and place all output tables there. * Region:EU
cd
 into your project:cd northwind
You can use pwd
 to confirm that you are in the right spot.
- Open your project (i.e. the directory you just created) in a code editor like Atom or VSCode.
You should see a directory structure withÂ
.sql
 andÂ.yml
 files that were generated by theÂinit
 command.
Test dbt's connection to BigQuery:
- Run
dbt debug
to test the connection was setup correctly- If the connection test failed have a look at the
profiles.yaml
configuration- This is a file that dbt created during the init process and is saved at
~/.dbt/profiles.yaml
by default - It contains the connection information
- This is a file that dbt created during the init process and is saved at
profiles.yaml
should look like this
northwind: outputs: dev: dataset: <YOUR_UNIQUE_PREFIX>_dbt job_execution_timeout_seconds: 300 job_retries: 1 keyfile: /full/path/to/repo/.config/creds.json location: EU method: service-account priority: interactive project: modern-data-stack-training threads: 2 type: bigquery target: dev
- If the connection test failed have a look at the
Our freshly created project has some example models in it. We're going to check that we can run them to confirm everything is in order.
Execute the dbt compile
 command to build the example models.
- Look at what the compiled sql code looks like in
target/compiled/northwind/models/example
- How does the model differ from the compiled
.sql
file?
- How does the model differ from the compiled
- Compare the two examples, do you see any differences in the DDL language?
- How do the model configurations differ?
- Which Jinja function are used?
NOTE: the
models/example
directory can be deleted from this point onwards
For now the most important parts of the generated dbt project structure are:
models/
directory- Here we create
.yml
files to define source tables and data models we wish to test and document. - And the actual
.sql
(Jinja-templated) models that define our transformation.
- Here we create
dbt_project.yml
- The
models.northwind
property is used to define defaults for our models, e.g. the materialization
- The
In theory dbt will work just fine if we create a single .yml
file in models/
and
configure all our sources and models in there.
However splitting up the .yml
files and creating some directories to include
specific models should improve organization of our models a bit.
Under models/
we will create
- A
staging/
directory- This is the entry point to our transformation pipeline(s)
- Models defined here are prefixed with
stg_
, e.g.stg_orders.sql
- Best practice is to only apply the following transformations in this phase of the pipeline:
- Renaming
- Type casting
- Basic computations (e.g. cents to dollars)
- Categorizing (using conditional logic to group values into buckets or booleans, such as in the case when statements above)
- ❌ Staging models should not include any joins or aggregations.
- A
marts/
directory- Transformations in these models generally contain joins or aggregations
- Mart models are refined from staging models and prepared for use by an external tool (e.g. dashboard or ML training pipeline)
This results in the following structure:
northwind/models
├── marts
│  ├── _models.yml
│  └── orders.sql
└── staging
├── _models.yml
├── _sources.yml
└── stg_orders.sql
_sources.yml
: here source tables are defined which should be referenced in staging models- Source tables should generally only be referenced directly in staging models
_models.yml
: here data models, e.g.stg_orders
, are defined alongside possible tests and documentation
Create staging models for orders
, order_details
, products
, categories
and employees
- Define sources in
_sources.yml
- Create
stg_[NAME].sql
model files with the relevant select statement- Use the
source
Jinja function to refer to a source table - Select all but the
_airbyte*
columns
- Use the
- Run dbt to create the staging models in BigQuery
dbt run
Navigate to BigQuery and have a look at the staging tables dbt should have created (a refresh might be required).
Supplementary exercises:
- Change the materialization of
stg_orders.sql
, what happens?- Look at the dbt documentation to learn about the different materializations strategies for dbt models. What is the difference between a table and a view?
- Run only a single model at a time (docs)
- Write some bad SQL to cause an error — can you debug this error?
- Note that if an error persists even after it should have been fixed, removing the
target/
directory might help.
- Note that if an error persists even after it should have been fixed, removing the
Adding tests to a project helps validate that your models are working correctly. So let's add some tests to our project!
Models are configured with tests (and documented) in the .yml
file using the following structure:
Test are defined in the .yml
files under the relevant directory
version: 2
models:
- name: stg_orders
columns:
- name: order_id
Read test docs on how to define tests. Working with dbt's out of the box four generic tests can often be enough to significantly improve data quality insurances.
Try to come up with relevant tests by studying the staging tables on BigQuery.
Here are some test suggestions:
- Test
order_id
is unique and not null for thestg_orders
model - Test
employee_id
is unique and not null for thestg_employees
model - Test referential integrity,
order_id
instg_orders
matchesorder_id
in thestg_order_details
model. - Test accepted values for
city
in thestg_employees
model- London, Tacoma, Redmond, Seattle, Kirkland
Run the tests to confirm they all pass:
dbt test
Supplementary exercises:
- Write a test that fails, for example, omit one of the cities in theÂ
accepted_values
 list. What does a failing test look like? Can you debug the failure? - Run the tests for one model only. If you grouped yourÂ
stg_
 models into a directory, try running the tests for all the models in that directory.
Exercise: Document your models
Adding documentation to your project allows you to describe your models in rich detail, and share that information with your team. Here, we're going to add some basic documentation to our project.
- Update yourÂ
models/staging/_models.yml
 file to include some descriptions.- Both on model and column level.
- ExecuteÂ
dbt docs generate
 to generate the documentation for your project.- dbt introspects your project and your warehouse to generate a json file with rich documentation about your project, which can be found at
target/catalog.json
- dbt introspects your project and your warehouse to generate a json file with rich documentation about your project, which can be found at
- [CLI]Â ExecuteÂ
dbt docs serve
 to launch the documentation in a local website.- if necessary specify a different port by using
--port 9090
- if necessary specify a different port by using
Supplementary exercise:
- Use a docs block to add a Markdown description to a model.
Let's create models in the marts area. Here we will combine and/or aggregate staging models to create new models which are ready to be used for analytical purposes.
NOTE: Use the
ref
Jinja function to refer to staging models.
A solution is provided for each exercise for if you get stuck.
- Create a model
marts/order_details.sql
.- Calculate sales price for each order after discount is applied, i.e. create a column like
discounted_total_price
.
- Calculate sales price for each order after discount is applied, i.e. create a column like
Solution
with
order_details as (select * from {{ ref("stg_order_details") }}),
products as (select * from {{ ref("stg_products") }}),
order_details_extended as (
select
order_details.order_id,
order_details.product_id,
order_details.unit_price,
order_details.quantity,
order_details.discount,
products.product_name,
(
order_details.unit_price
* order_details.quantity
* (1 - order_details.discount)
) as discounted_total_price
from products
inner join order_details using (product_id)
)
select *
from order_details_extended
- Create a model
marts/category_sales.sql
- This model should calculate the total sales for each product category
- Group orders by category and sum the total sales for each group.
- The sales amount calculation for each product sale should use the
discounted_total_price
created in the previous model.
Solution
with
products as (select * from {{ ref("stg_products") }}),
categories as (select * from {{ ref("stg_categories") }}),
order_details as (select * from {{ ref("order_details") }}),
category_sales as (
select
products.category_id,
categories.category_name,
sum(order_details.discounted_total_price) as total_sales
from products
inner join order_details on products.product_id = order_details.product_id
inner join categories on products.category_id = categories.category_id
group by products.category_id, categories.category_name
)
select *
from category_sales
- Create a model
marts/employee_sales.sql
.- Calculate the total sale amount for each employee
Solution
with
employees as (select * from {{ ref("stg_employees") }}),
orders as (select * from {{ ref("stg_orders") }}),
order_details as (select * from {{ ref("order_details") }}),
order_subtotals as (
select
order_id,
sum(order_details.discounted_total_price) as sale_amount
from order_details
group by order_id
),
employee_sales as (
select
orders.employee_id,
min(order_date) as first_order_date,
max(order_date) as last_order_date,
sum(order_subtotals.sale_amount) as total_sale_amount
from orders
inner join order_subtotals on orders.order_id = order_subtotals.order_id
group by orders.employee_id
),
final as (
select
employees.first_name,
employees.last_name,
employees.country,
employee_sales.total_sale_amount,
employee_sales.first_order_date,
employee_sales.last_order_date,
date_diff(employee_sales.last_order_date, employee_sales.first_order_date, day) as days_active
from employee_sales
inner join employees on employee_sales.employee_id = employees.employee_id
order by employee_sales.total_sale_amount desc
)
select *
from final
- Create a model
marts/orders_per_month.sql
- Calculate the total number of orders per month (per year)
Solution
with
orders as (select * from {{ ref("stg_orders") }}),
orders_with_ym as (
select
extract(year from order_date) as order_year,
extract(month from order_date) as order_month,
*
from orders
),
final as (
select order_year, order_month, count(1) as num_orders
from orders_with_ym
group by order_year, order_month
order by order_year, order_month
)
select *
from final
This (data) model could be used to train a forecaster model on.
Supplementary exercise:
- Rerun
dbt docs generate
anddbt docs serve
and have a look at the lineage graph- By using the
ref
Jinja function dbt can create this graph - This is especially useful for large dbt projects
- By using the
- Best practice how to structure your project
- Learn how to refactor long SQL queries: https://docs.getdbt.com/tutorial/refactoring-legacy-sql