Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update dbt views to select from other dbt models where possible #71

Conversation

jeancochrane
Copy link
Contributor

@jeancochrane jeancochrane commented Aug 15, 2023

In #64 we added a bunch of new stub models to the dbt DAG, representing the remaining views and tables that we had stored in Athena but had not yet tracked in dbt. This follow-up PR tweaks our view query definitions to select data from references to those new models instead of hardcoded tablenames; for example, SELECT * FROM iasworld.pardat becomes SELECT * FROM {{ source('iasworld', 'pardat') }}, which allows dbt to properly construct our DAG in order to know that the query depends on the iasworld.pardat source.

Thanks to these changes, the dbt docs can now generate a real lineage graph!

Screenshot 2023-08-15 162834

Closes #65.

Off topic but interesting

I found this blog post to be a somewhat convincing criticism of the dbt ref system. Having spent the better part of the day replacing references by hand, I'm sympathetic to the assertion that dbt should do this dependency resolution automatically without requiring us to tag all of our dependencies explicitly; but I'm also sympathetic to the idea that it's better to enforce explicitness, and that refs allow us to clearly mark which views/tables are tracked in the DAG and which are generated outside it. Not necessary context to review this repo, but I found it to be food for thought!

@@ -1,7 +1,7 @@
-- View to convert raw ACS5 variables into useable statistics
WITH distinct_years AS (
SELECT DISTINCT year
FROM spatial.parcel
FROM {{ ref('parcel') }}
Copy link
Contributor Author

@jeancochrane jeancochrane Aug 15, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Somewhat annoyingly, we can't specify schemas when we select from dbt refs; dbt requires that model names be unique, with an alias system for models that share the same name in different schemas, and as a result the ref() interface only accepts a unique identifier and can't be qualified with a schema. I don't love this interface -- it's helpful to know where a model lives in the schema structure when reading query code, and I don't like that we lose that information by switching to refs.

There is a two-argument variant of ref(), but the second argument specifies the dbt "project" that the model comes from, not the schema (or "database" in dbt lingo). We could refactor our models so that they live in different projects, but then we would end up duplicating our project configuration files, so it doesn't really feel like the right abstraction.

If we find the lack of schemas in these references to be unacceptable, another option would be to more widely leverage aliases in order to prefix every model identifier with the schema it lives under, e.g.:

Suggested change
FROM {{ ref('parcel') }}
FROM {{ ref('schema_parcel') }}

This would be similar to the way that the views are organized in the aws-athena/views/ directory, and would allow us to reference the schema for a model without having to change the name of the generated view/table in Athena. We do this already in the code below for a few models whose table names collide with a table in a different schema, e.g. location_census and spatial_census. However, this approach also comes with downsides, namely:

  1. The prefixing convention may not be any clearer than the unique-tablename convention; and
  2. SQLFluff considers hyphens to be unparseable, so we would have to turn it off and switch to sqlfmt for these view definitions

I'm open to different options here, curious what you think!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dfsnow Based on the section of the docs where dbt explains how they organize projects, it seems like a common naming convention that could support schemas is [source]__[entity].sql. This isn't exactly the same use case as ours; dbt uses this convention to indicate the "source" (i.e. origin database) of a "staging" model (i.e. initial transformation), whereas we would be using it to indicate the schemas of all our models, regardless of whether they're intended for initial or final transformations. But I think that their guidance for how to organize projects is overly complicated for our use case, and I'm more interested in the way that [source]__[entity].sql seems to be an existing standard for namespacing entities in dbt.

Following this pattern, this ref would look like:

Suggested change
FROM {{ ref('parcel') }}
FROM {{ ref('spatial__parcel') }}

After seeing it used in their sample code, I am somewhat reassured that it's not that confusing of a pattern, and it's largely similar to how aws-athena/views/ is organized anyway. The downside would be some amount of duplication between our subdirectory structure and our file prefixing pattern, but that feels like a negligible concern to me. What do you think?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's go with [source]__[entity].sql then, since it's a common pattern, would be consistent across all tables, and maintains the semantic info added by [location].[table].

TBD on whether we should keep the symlink system or eventually move all the model definitions into the dbt subdirectory.

FROM spatial.parcel AS pin
LEFT JOIN location.census
FROM {{ ref('parcel') }} AS pin
LEFT JOIN {{ ref('location_census') }} AS census
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's an example of a table (location.census) where we're using an alias (location_census) to act as a unique identifier for the ref() function without having to change the name of the table itself. We'll see below that this corresponds to a change in the model's filename, in addition to an alias attribute being added to the query via a config() call.

@@ -31,15 +31,15 @@ models:
- pin
- year
config:
error_if: ">280679"
error_if: ">280721"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bumping these thresholds yet again... I'm going to chat with Billy tomorrow (Wednesday) to see if we can remove these tests.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: I recommend you comment these out for now. We can re-add them once we dive more into the tests.

@@ -0,0 +1,3 @@
{{ config(materialized='ephemeral') }}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I noticed while testing this branch that I had neglected to add a few ephemeral location models in #64, so the next couple of files fix that omission!

@@ -0,0 +1,3 @@
{{ config(materialized='ephemeral', alias='census') }}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And here's the query definition for a model with a non-unique table name that's using an alias to resolve the collision.

@jeancochrane jeancochrane marked this pull request as ready for review August 15, 2023 21:47
@jeancochrane jeancochrane requested a review from a team as a code owner August 15, 2023 21:47
Copy link
Member

@dfsnow dfsnow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

todo(blocking): Change the ref system and file names to match [source]__[entity].sql, then this is all set!

@@ -1,7 +1,7 @@
-- View to convert raw ACS5 variables into useable statistics
WITH distinct_years AS (
SELECT DISTINCT year
FROM spatial.parcel
FROM {{ ref('parcel') }}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's go with [source]__[entity].sql then, since it's a common pattern, would be consistent across all tables, and maintains the semantic info added by [location].[table].

TBD on whether we should keep the symlink system or eventually move all the model definitions into the dbt subdirectory.

@@ -31,15 +31,15 @@ models:
- pin
- year
config:
error_if: ">280679"
error_if: ">280721"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: I recommend you comment these out for now. We can re-add them once we dive more into the tests.

Copy link
Contributor Author

@jeancochrane jeancochrane left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright @dfsnow, after falling down a rabbit hole today, I think I've finally figured out a better schema namespacing pattern 😪

One problem emerged with the double-underscore prefix during my testing: Even though we were using a custom generate_alias_name schema to remove the prefix from the relation name when building relations in Athena, the dbt docs did not understand this, and they reported the full filename (e.g. spatial__parcel) as the entity name. (There is a stale, closed feature request from another user asking dbt Core to support using aliases in the docs site... alas.) Perhaps it wasn't worth investing a whole afternoon in fixing this, but I found it to be annoying enough that it felt like it cancelled out the helpfulness of including the schema prefix in ref() calls.

I realized that we could replace double underscores with dots in the schema prefix (e.g. spatial__parcel ➡️ spatial.parcel) in order to more closely match the existing SQL convention, and I changed up our namespacing to follow this pattern in 4bb96d4. But testing revealed yet another problem, namely that ephemeral models can't have a dot in their name due to an open bug in dbt Core. This meant that we could use the dot to build the docs site, but it would cause errors when trying to build the models themselves.

In order to get around this second problem, I ended up refactoring our ephemeral models to turn them into sources instead. Ultimately, I think this is a better abstraction for these relations, and I was thinking about opening up a ticket to do this refactor anyway; so I think in the long term it's the right thing to do. But it ends up making this diff annoyingly large, so I apologize for that!

Here's what the updated docs site looks like with both of these changes (1. adding dot prefixes for schemas and 2. refactoring ephemeral models to sources):

Screenshot 2023-08-16 171258

Another fun benefit of using sources instead of ephemeral models is they are marked distinctly in green in the lineage graph, which helps to visually indicate which relations are built using dbt and which are built outside of it:

Screenshot 2023-08-16 171403

Let me know if you'd like to talk through any of this in person, I expect it's probably confusing 😬

Comment on lines -30 to -44
ccao:
# Set default materialization on a per-schema basis. Schemas that contain
# _any_ persistent models should be set to `view` or `table`, or left
# to inherit the `view` default which is set on the `athena` project, but
# schemas that are entirely ephemeral should be marked `ephemeral`. The
# point of this pattern of defaults is that schemas with non-ephemeral
# defaults will be marked for deletion by the cleanup_dbt_resources.sh
# script, since `dbt list --select config.materialization:view` only
# reads this config file and does not consider model-level overrides.
# Hence, only schemas that contain exclusively ephemeral models should
# be marked with an `ephemeral` default, since we won't build those
# schemas into Athena and we don't want cleanup_dbt_resources.sh to
# try to clean them up.
+materialized: ephemeral
+schema: ccao
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This config is no longer required, since we don't need to configure materialization or schemas for sources (schemas are inferred from the source name thanks to the dbt-athena plugin). If and when we materialize these sources, we can add these blocks back in, but for now I removed them to suppress warnings about unused configs in the dbt CLI.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

praise: So this change actually reduces the complexity of the previous implementation. Awesome find!

Comment on lines -27 to -42
# Unique by 14-digit PIN and year
- unique_combination_of_columns:
name: vw_pin_appeal_unique_by_14_digit_pin_and_year
combination_of_columns:
- pin
- year
config:
error_if: ">280679"
# Unique by case number and year
- unique_combination_of_columns:
name: vw_pin_appeal_unique_by_case_number_and_year
combination_of_columns:
- year
- case_no
config:
error_if: ">366017"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Billy said I could delete these, so I went ahead and did that instead of commenting them out like Dan suggested!

Copy link
Member

@dfsnow dfsnow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excellent work @jeancochrane. Love that this gets rid of all the ephemeral model stubs and differentiates loaded tables from built ones. Sorry I didn't flag it as a potential abstraction in the prior review of this PR.

Also excited about the possibility of using source freshness as another form of testing. This is especially applicable to constantly-updated iasWorld tables like HTPAR and PERMIT. I'll create a GitHub issue (edit: see #75) scoping out freshness tests for these tables.

Comment on lines -30 to -44
ccao:
# Set default materialization on a per-schema basis. Schemas that contain
# _any_ persistent models should be set to `view` or `table`, or left
# to inherit the `view` default which is set on the `athena` project, but
# schemas that are entirely ephemeral should be marked `ephemeral`. The
# point of this pattern of defaults is that schemas with non-ephemeral
# defaults will be marked for deletion by the cleanup_dbt_resources.sh
# script, since `dbt list --select config.materialization:view` only
# reads this config file and does not consider model-level overrides.
# Hence, only schemas that contain exclusively ephemeral models should
# be marked with an `ephemeral` default, since we won't build those
# schemas into Athena and we don't want cleanup_dbt_resources.sh to
# try to clean them up.
+materialized: ephemeral
+schema: ccao
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

praise: So this change actually reduces the complexity of the previous implementation. Awesome find!

@jeancochrane jeancochrane merged commit 7a3d0e0 into data-catalog Aug 17, 2023
3 checks passed
@jeancochrane jeancochrane deleted the jeancochrane/65-data-catalog-update-views-to-point-to-dbt-stub-models branch August 17, 2023 19:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants