This dbt package stitches together identifiers in an ID graph table.
The primary ouput of this package is id_graph
. There are a few intermediate models used to create this model.
Model | Description |
---|---|
queries | Generates select statements which pull IDs from your tables. |
edges | Combines the results of those select statement to create a table containing edges (IDs) the first time it is run, and matches edges on subsequent runs. |
check_edges | Determines if there are still edges to match. |
id_graph | Creates an ID graph table. |
Check dbt Hub for the latest installation instructions, or read the docs for more information on installing packages.
Set ID columns and IDs to exclude in dbt_project.yml
:
vars:
id-columns: ('anonymous_id', 'user_id', 'email')
ids-to-exclude: ('sources','user@company.com')
This package searches your data warehouse for tables that include multiple columns defined in id-columns
. Any IDs defined in ids-to-exclude
are disregarded.
The edges
model must be run enough times to match all edges (IDs). Five or six passes is usually sufficient. The check_edges
model will show 0 when all edges have been matched. Edit your job commands for dbt Cloud or run.sh
script for dbt CLI to run the edges
model however many times is necessary.
Create a job with the following commands:
dbt run --full-refresh --select queries edges
dbt run --select edges
dbt run --select edges
dbt run --select edges
dbt run --select edges
dbt run --select edges check_edges id_graph
Run the included run.sh
shell script:
./run.sh
Additional intstrumentation can be created to evaluate the check_edges
model to determine programatically whether to run the edges
model subsequent times.