Analytics Engineering

Note: Check branch week 4 for dbt content

What is Analytics Engineering?

We will focus on
- Data modelling
- Data presentation
Data modelling concepts
- ETL vs ELT
  - In ETL the data is transformed before loading to the data warehouse
  - In ETL the data is tranformed in the data warehouse
  - ETL is more stable
  - ELT is faster and more flexible
- Kimball's Dimensional Modeling:
  - Objective:
    - Deliver data, that is understandable to the business users
    - Deliver fast query performance
  - Approach:
    - Prioritise user understandability and query performance over non redundand data (3NF)
  - Other approaches:
    - Bill Inmon
    - Data vault
- Elements of Dimensional Modeling
  - Facts tables
    - Measurements, metrics or facts
    - Corresponds to a business process
    - "verbs" (e.g. sales, orders)
  - Dimensions tables
    - Corresponds to a business entity
    - Provides context to a business process
    - "nouns" (e.g. customer, product)
  - Also known as star schema
- Architecture of Dimensional Modeling
  - Stage Area
    - Contains the raw data
    - Not meant to be exposed to everyone
  - Processing area
    - From raw data to data models
    - Focuses in efficency
    - Ensuring standards
  - Presentation area
    - Final presentation of the data
    - Exposure to business stakeholder
  - Restaurant analogy:
    - Stage area: food is processed (limited to people who know how to do it)
    - Processing area: kitchen (limited to people who know how to do it)
    - Presentation area: Dining area (exposure)

What is dbt?

dbt is a transformation tool that allows anyone that knows SQL to deploy analytics code following software engineering best practices like modularity, partability, CI / CD, and documentation.
dbt workflow
- develop models
- test and document them
- deployment phase (version control and CI / CD)

How does dbt work?
- Add modeling layer to our data, where the data is transformed
- The transformed data is persist back to the data warehouse
- The transformation is doen with SQL, which are our dbt models
- dbt compiles the code and pushes to the data warehouse

How to use dbt?

dbt Core
- Open source project, that allows data transformations
- Builds and runs a dbt project (.sql, and .yml files)
- Includes a CLI interface to run dbt commands locally
- Opens source and free to use
dbt Cloud
- SaaS application to develop and manage dbt projects
- Web-based IDE to develop, run and test a dbt project
- Jobs orchestration
- Logging and alerting
- Integrated documentation
- Free for individuals (one developer seat)

How are we going to use dbt?

BigQuery
- Development using cloud IDE
- No local installatio of dbt core
Postgres
- Development using a local IDE of your choice
- Local installation of dbt core connecting to the Postgres database
- Running dbt models through the CLI

Starting a dbt project: BigQuery

dbt provides a starter project with all the basic folders and files.
There are essentially two ways to use it:
- With the CLI
  - After having installed dbt loacally and setup the profiles.yml, run dbt_init in the path we want to start the project to clone the starter project
- With dbt cloud
  - After having set up dbt cloud credentials (repo and dwh) we can start the project from the web-based IDE
Upload needed data (if not already done)
- In case data is missing here is a script for uploading https://github.com/DataTalksClub/data-engineering-zoomcamp/blob/main/week_3_data_warehouse/extras/web_to_gcs.py
In Google BigQuery create two new tables (schemas) in our project: "dbt-some-name" (for development), "production" (for production)
Make a dbt-cloud account and connect to BigQuery: https://www.getdbt.com/signup/
Create a service account for dbt in BigQuery following this instruction: https://github.com/froukje/data-engineering-zoomcamp/blob/main/week_4_analytics_engineering/dbt_cloud_setup.md
- Note to add the dbt key to the github repo, go to your repo and click on settings in the top menu, not below your profile
Create a new branch on github
After finishing the setup go to "Develop" in the dbt cloud
Switch to the new branch
Initialize the project by clicking the button
This creates several folders and files
Open the dbt_project.yml
- Change the name to e.g. "taxi_rides_ny"
- Change "models" to "taxi_rides_ny"
- Delete the last to rows "example ...", we are not using this here

Starting a dbt Project locally (alternative)

As an alternative to the cloud, that require to have a cloud database, you will be able to run the project installing dbt locally.
https://docs.getdbt.com/docs/get-started/installation or use a docker image from https://github.com/dbt-labs/dbt-core
Create a repository on github
Create a file called profiles.yml to define the connection to the database

     target: dev
     outputs:
       dev:
         type: postgres
	 host: localhost
	 user: my_user
	 password: my_password
	 port: 5432
	 dbname: production
	 schema: dbt_victoria_mola
	 threads: 4
	 keepalives_idle: 0 # default 0, indicatiting the system default
  pg-de-zoomcamp
    target: dev
    outputs:
      dev:
        type: bigquery
        method: service-account
        project: taxi-rides-ny
        dataset: dbt_frauke
	threads: 4
	keafile: my_key_file.json

This file is stored outside of the project in the home directory: .dbt/profiles.yml
Move to repository and initialize the repository: dbt init
This creates a new folder, which is the project,name it e.g. "taxi_rides_ny"
In this folder several files are created
- dbt_project.yml needs to be adapted and change profile to the profile name we created in the profiles.yml
We can use dbt debug to check our connection

Development of dbt Models

We can use SQL and Jinja for the model
Inside a Junja block we can use functions (macros)
- At the beginning of a model usually the config macro is used
- In the example below we work with the "materialization strategy" table, other provided startegies are View, Incremental, and Ephemeral
- Using this materialization strategy will drop the table, if it already exists and create a table in the schema we are working with, with the name of the model (filename), in the example my_model

The model is translated to a SELECT statement
The FROM clause in this statement is defined by the the Jinja-macro called source
The sources can be defined in a .yml-file
Data can also be read from a .csv-file ("Seeds")
This .csv-file is stored in our repository

Move to the console and create our own model: https://cloud.getdbt.com/
Create two more folder s inthe "models" folder: "staging" (for raw models), "core" (for models that are at the end exposed)
Create a first model under "staging":
- stg_green_tripdata.sql
- We are using a "view" (not a "table"), so that we don't have to refresh them constantly but still have the latest data
- We want to model the SQL-query: select * from stoked-mode-375206.dezoomcamp.rides_green
- For that we need to create a file schema.yml in the "staging" folder and define the source

Run the model
- At the bottom of the page is a field to write in: dtb run
- This runs all the models, that are defined

Change the model:

{{ config(materialized='view') }}

select 
  --identifiers
  cast(vendorid as integer) as vendorid,
  cast(ratecodeid as integer) as ratecodeid,
  cast(pulocationid as integer) as pickup_locationid,
  cast(dolocationid as integer) as dropoff_locationid,
  --timestamps
  cast(lpep_pickup_datetime as timestamp) as pickup_datetime,
  cast(lpep_dropoff_datetime as timestamp) as dropoff_datetime,
  --trip info
  store_and_fwd_flag,
  cast(passenger_count as integer) as passenger_count,
  cast(trip_distance as numeric) as trip_distance,
  cast(trip_type as integer) as trip_type,
  --payment info
  cast(fare_amount as numeric) as fare_amount,
  cast(extra as numeric) as extra
  cast(mta_tax as numeric) as mta_tax,
  cast(tip_amount as numeric) as tip_amount,
  cast(tolls_amount as numeric) as tolls_amount,
  cast(improvement_surchagrge as numeric) as improvement_surchagrge,
  cast(total_amount as numeric) as total_amount,
  cast(payment_type as integer) as payment_type,
  cast(congestion_surcharge as numeric) as congestion_surcharge

from {{ source('staging', 'rides_green') }}
limit 100

To run jus one file we can use dbt run --select stg_green_tripdata
In BigQuery we can now see our reated table (called (stg_green_tripdata)

Macros

What are macros?
- Macros are similar to functions, but return code

Create a macro in the project
- In the folder "macros" create a new file called "get_payment_type_description.sql"

{#
	This macro returns the description of the payment_type
#}

{% macro get_payment_type_description{payment_type} -%}
	
	case {{ payment_type }}
		when 1 then 'Credit card'
		when 2 then 'Cash'
		when 3 then 'No charge'
		when 4 then 'Dispute'
		when 5 then 'Unknown'
		when 6 then 'Voided trip'
	end

{%- endmacro %}

Then we can call the macro in the stg_green_tripdata.sql file: {{ get_payment_type_description('payment_type') }} as payment_type_description
In dbt packages can be used
- dbt projects with modules and macros, that can be used
- Create a packages.yml file in the project, where all packages you want to used are listed

In the project create a packages.yml file at the root of the repo with

packages:
  - package: dbt-labs/dbt_utils
    version: 0.8.0

To install the packages, run in the field on the bottom of the page: dpt deps
Add it to the model: {{ dbt_utils.surrogate_key(['vendorid', 'lpep_pickup_datetime']) }} as tripid

Variables

Can be used across the project

Add a variable to the model, that defines whether a run is a test run or not to limit the data processed in tests
At the end of the SQL statement:

limit 100
{%endif%}

This can be run -- dbt build -f <model.sql> --var 'is_test_run: false'

A second model

Create a second model for the yellow trip data

dbt Seeds

.csv-files, that we can have in our repo in the "Seeds" folder
meant to be used for smaller files, that contain data that does not change very often
To get the file into the dbt cloud, we need to upload it into the repo and pull from the dbt cloud or create the file directly there
Create a file called taxi_zone_lookup.csv in the "Seeds" folder
Run dbt seed
In our dbt_project.yml file add:

seeds:
  stoked-mode-375206:
    taxi_zone_lookup:
      + column_types:
	locationid: numeric

Create a model based on this
- Create a new file in "models/core" called dim_zones.sql
- This time we use a table
- Create a model called fact_trips.sql to unify everything

Tests

Tests are not necessary, but recommended
Test: assumption about our data
Tests in dbt are SELECT statements, that are compiled and return the amount of failing records
dbt provides 4 basic tests to check if the column values are unique, not null, accepted values or a foreign key to another table
Custom tests can be defined
we can run dbt test [--select my_model] to run the tests

Documentation

descriptions can be added to the .yml file

Deployment

Run the models we developed in a development environment in a production environment
Deployment environment runs against a different schema

dbt cloud includes a scheduler to create jobs to run in production

In the dbt cloud go to environments
Create a new (Deployment) environment
- Give it a name
- Select the branch from the github repository (week4)
- Give a dataset (name of schema created in BigQuery)
Create a job
- select commands
- schedule
- press "run now" to run it immediately

Visualization

Use Google lookerstudio (lookerstudio.google.com)
Create a data source:
- click on "create"
- click on "Data Source"
- choose BigQuery
- For this example choose the de-zoomcamp project, "production", "fact_trips"
- click "connect"
- change default aggregations from "sum" to "None", except for "passenger_count"
- click "create report"
  - by default a table is created
  - we delete this table and create our own report by clicking on "Add a chart"
  - we can choose a type of chart we want to plot
  - we can add controls of e.g. the date range shown and much more!
  - Create a barplot of trips per months
    - we can create new fields, e.g. "trips per months"
    - On the right hand side, at the bottom of our data, we click on "add field"
    - We give it a name: "pickup_month"
    - We then can create new field by applying formulas, we use MONTH(pickup_datetime) and save it
    - We can the use pickup_month as dimension
    - We create another field: "pickup_year" (YEAR(pickup_datetime)) and use this as "breakdown dimension"
Note that the report made in this folder looks different than in the videos, as it is based on less data
The report can be shared similar to othe Google documents
The report can be scheduled to be send by e-mail

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

notes_week_04.md

notes_week_04.md

Analytics Engineering

What is Analytics Engineering?

What is dbt?

How to use dbt?

How are we going to use dbt?

Starting a dbt project: BigQuery

Starting a dbt Project locally (alternative)

Development of dbt Models

Macros

Variables

A second model

dbt Seeds

Tests

Documentation

Deployment

Visualization

Files

notes_week_04.md

Latest commit

History

notes_week_04.md

File metadata and controls

Analytics Engineering

What is Analytics Engineering?

What is dbt?

How to use dbt?

How are we going to use dbt?

Starting a dbt project: BigQuery

Starting a dbt Project locally (alternative)

Development of dbt Models

Macros

Variables

A second model

dbt Seeds

Tests

Documentation

Deployment

Visualization