-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update dbt views to select from other dbt models where possible #71
Update dbt views to select from other dbt models where possible #71
Conversation
@@ -1,7 +1,7 @@ | |||
-- View to convert raw ACS5 variables into useable statistics | |||
WITH distinct_years AS ( | |||
SELECT DISTINCT year | |||
FROM spatial.parcel | |||
FROM {{ ref('parcel') }} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Somewhat annoyingly, we can't specify schemas when we select from dbt ref
s; dbt requires that model names be unique, with an alias system for models that share the same name in different schemas, and as a result the ref()
interface only accepts a unique identifier and can't be qualified with a schema. I don't love this interface -- it's helpful to know where a model lives in the schema structure when reading query code, and I don't like that we lose that information by switching to ref
s.
There is a two-argument variant of ref()
, but the second argument specifies the dbt "project" that the model comes from, not the schema (or "database" in dbt lingo). We could refactor our models so that they live in different projects, but then we would end up duplicating our project configuration files, so it doesn't really feel like the right abstraction.
If we find the lack of schemas in these references to be unacceptable, another option would be to more widely leverage aliases in order to prefix every model identifier with the schema it lives under, e.g.:
FROM {{ ref('parcel') }} | |
FROM {{ ref('schema_parcel') }} |
This would be similar to the way that the views are organized in the aws-athena/views/
directory, and would allow us to reference the schema for a model without having to change the name of the generated view/table in Athena. We do this already in the code below for a few models whose table names collide with a table in a different schema, e.g. location_census
and spatial_census
. However, this approach also comes with downsides, namely:
- The prefixing convention may not be any clearer than the unique-tablename convention; and
- SQLFluff considers hyphens to be unparseable, so we would have to turn it off and switch to sqlfmt for these view definitions
I'm open to different options here, curious what you think!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dfsnow Based on the section of the docs where dbt explains how they organize projects, it seems like a common naming convention that could support schemas is [source]__[entity].sql
. This isn't exactly the same use case as ours; dbt uses this convention to indicate the "source" (i.e. origin database) of a "staging" model (i.e. initial transformation), whereas we would be using it to indicate the schemas of all our models, regardless of whether they're intended for initial or final transformations. But I think that their guidance for how to organize projects is overly complicated for our use case, and I'm more interested in the way that [source]__[entity].sql
seems to be an existing standard for namespacing entities in dbt.
Following this pattern, this ref would look like:
FROM {{ ref('parcel') }} | |
FROM {{ ref('spatial__parcel') }} |
After seeing it used in their sample code, I am somewhat reassured that it's not that confusing of a pattern, and it's largely similar to how aws-athena/views/
is organized anyway. The downside would be some amount of duplication between our subdirectory structure and our file prefixing pattern, but that feels like a negligible concern to me. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's go with [source]__[entity].sql
then, since it's a common pattern, would be consistent across all tables, and maintains the semantic info added by [location].[table]
.
TBD on whether we should keep the symlink system or eventually move all the model definitions into the dbt subdirectory.
FROM spatial.parcel AS pin | ||
LEFT JOIN location.census | ||
FROM {{ ref('parcel') }} AS pin | ||
LEFT JOIN {{ ref('location_census') }} AS census |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here's an example of a table (location.census
) where we're using an alias (location_census
) to act as a unique identifier for the ref()
function without having to change the name of the table itself. We'll see below that this corresponds to a change in the model's filename, in addition to an alias
attribute being added to the query via a config()
call.
dbt/models/default/schema.yml
Outdated
@@ -31,15 +31,15 @@ models: | |||
- pin | |||
- year | |||
config: | |||
error_if: ">280679" | |||
error_if: ">280721" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bumping these thresholds yet again... I'm going to chat with Billy tomorrow (Wednesday) to see if we can remove these tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion: I recommend you comment these out for now. We can re-add them once we dive more into the tests.
dbt/models/location/access.sql
Outdated
@@ -0,0 +1,3 @@ | |||
{{ config(materialized='ephemeral') }} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I noticed while testing this branch that I had neglected to add a few ephemeral location
models in #64, so the next couple of files fix that omission!
@@ -0,0 +1,3 @@ | |||
{{ config(materialized='ephemeral', alias='census') }} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And here's the query definition for a model with a non-unique table name that's using an alias to resolve the collision.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
todo(blocking): Change the ref system and file names to match [source]__[entity].sql
, then this is all set!
@@ -1,7 +1,7 @@ | |||
-- View to convert raw ACS5 variables into useable statistics | |||
WITH distinct_years AS ( | |||
SELECT DISTINCT year | |||
FROM spatial.parcel | |||
FROM {{ ref('parcel') }} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's go with [source]__[entity].sql
then, since it's a common pattern, would be consistent across all tables, and maintains the semantic info added by [location].[table]
.
TBD on whether we should keep the symlink system or eventually move all the model definitions into the dbt subdirectory.
dbt/models/default/schema.yml
Outdated
@@ -31,15 +31,15 @@ models: | |||
- pin | |||
- year | |||
config: | |||
error_if: ">280679" | |||
error_if: ">280721" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion: I recommend you comment these out for now. We can re-add them once we dive more into the tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alright @dfsnow, after falling down a rabbit hole today, I think I've finally figured out a better schema namespacing pattern 😪
One problem emerged with the double-underscore prefix during my testing: Even though we were using a custom generate_alias_name
schema to remove the prefix from the relation name when building relations in Athena, the dbt docs did not understand this, and they reported the full filename (e.g. spatial__parcel
) as the entity name. (There is a stale, closed feature request from another user asking dbt Core to support using aliases in the docs site... alas.) Perhaps it wasn't worth investing a whole afternoon in fixing this, but I found it to be annoying enough that it felt like it cancelled out the helpfulness of including the schema prefix in ref()
calls.
I realized that we could replace double underscores with dots in the schema prefix (e.g. spatial__parcel
➡️ spatial.parcel
) in order to more closely match the existing SQL convention, and I changed up our namespacing to follow this pattern in 4bb96d4. But testing revealed yet another problem, namely that ephemeral models can't have a dot in their name due to an open bug in dbt Core. This meant that we could use the dot to build the docs site, but it would cause errors when trying to build the models themselves.
In order to get around this second problem, I ended up refactoring our ephemeral models to turn them into sources instead. Ultimately, I think this is a better abstraction for these relations, and I was thinking about opening up a ticket to do this refactor anyway; so I think in the long term it's the right thing to do. But it ends up making this diff annoyingly large, so I apologize for that!
Here's what the updated docs site looks like with both of these changes (1. adding dot prefixes for schemas and 2. refactoring ephemeral models to sources):
Another fun benefit of using sources instead of ephemeral models is they are marked distinctly in green in the lineage graph, which helps to visually indicate which relations are built using dbt and which are built outside of it:
Let me know if you'd like to talk through any of this in person, I expect it's probably confusing 😬
ccao: | ||
# Set default materialization on a per-schema basis. Schemas that contain | ||
# _any_ persistent models should be set to `view` or `table`, or left | ||
# to inherit the `view` default which is set on the `athena` project, but | ||
# schemas that are entirely ephemeral should be marked `ephemeral`. The | ||
# point of this pattern of defaults is that schemas with non-ephemeral | ||
# defaults will be marked for deletion by the cleanup_dbt_resources.sh | ||
# script, since `dbt list --select config.materialization:view` only | ||
# reads this config file and does not consider model-level overrides. | ||
# Hence, only schemas that contain exclusively ephemeral models should | ||
# be marked with an `ephemeral` default, since we won't build those | ||
# schemas into Athena and we don't want cleanup_dbt_resources.sh to | ||
# try to clean them up. | ||
+materialized: ephemeral | ||
+schema: ccao |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This config is no longer required, since we don't need to configure materialization or schemas for sources (schemas are inferred from the source name thanks to the dbt-athena plugin). If and when we materialize these sources, we can add these blocks back in, but for now I removed them to suppress warnings about unused configs in the dbt CLI.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
praise: So this change actually reduces the complexity of the previous implementation. Awesome find!
# Unique by 14-digit PIN and year | ||
- unique_combination_of_columns: | ||
name: vw_pin_appeal_unique_by_14_digit_pin_and_year | ||
combination_of_columns: | ||
- pin | ||
- year | ||
config: | ||
error_if: ">280679" | ||
# Unique by case number and year | ||
- unique_combination_of_columns: | ||
name: vw_pin_appeal_unique_by_case_number_and_year | ||
combination_of_columns: | ||
- year | ||
- case_no | ||
config: | ||
error_if: ">366017" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Billy said I could delete these, so I went ahead and did that instead of commenting them out like Dan suggested!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Excellent work @jeancochrane. Love that this gets rid of all the ephemeral model stubs and differentiates loaded tables from built ones. Sorry I didn't flag it as a potential abstraction in the prior review of this PR.
Also excited about the possibility of using source freshness as another form of testing. This is especially applicable to constantly-updated iasWorld tables like HTPAR and PERMIT. I'll create a GitHub issue (edit: see #75) scoping out freshness tests for these tables.
ccao: | ||
# Set default materialization on a per-schema basis. Schemas that contain | ||
# _any_ persistent models should be set to `view` or `table`, or left | ||
# to inherit the `view` default which is set on the `athena` project, but | ||
# schemas that are entirely ephemeral should be marked `ephemeral`. The | ||
# point of this pattern of defaults is that schemas with non-ephemeral | ||
# defaults will be marked for deletion by the cleanup_dbt_resources.sh | ||
# script, since `dbt list --select config.materialization:view` only | ||
# reads this config file and does not consider model-level overrides. | ||
# Hence, only schemas that contain exclusively ephemeral models should | ||
# be marked with an `ephemeral` default, since we won't build those | ||
# schemas into Athena and we don't want cleanup_dbt_resources.sh to | ||
# try to clean them up. | ||
+materialized: ephemeral | ||
+schema: ccao |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
praise: So this change actually reduces the complexity of the previous implementation. Awesome find!
In #64 we added a bunch of new stub models to the dbt DAG, representing the remaining views and tables that we had stored in Athena but had not yet tracked in dbt. This follow-up PR tweaks our view query definitions to select data from references to those new models instead of hardcoded tablenames; for example,
SELECT * FROM iasworld.pardat
becomesSELECT * FROM {{ source('iasworld', 'pardat') }}
, which allows dbt to properly construct our DAG in order to know that the query depends on theiasworld.pardat
source.Thanks to these changes, the dbt docs can now generate a real lineage graph!
Closes #65.
Off topic but interesting
I found this blog post to be a somewhat convincing criticism of the dbt ref system. Having spent the better part of the day replacing references by hand, I'm sympathetic to the assertion that dbt should do this dependency resolution automatically without requiring us to tag all of our dependencies explicitly; but I'm also sympathetic to the idea that it's better to enforce explicitness, and that refs allow us to clearly mark which views/tables are tracked in the DAG and which are generated outside it. Not necessary context to review this repo, but I found it to be food for thought!