Update dbt views to select from other dbt models where possible #71

jeancochrane · 2023-08-15T20:06:38Z

In #64 we added a bunch of new stub models to the dbt DAG, representing the remaining views and tables that we had stored in Athena but had not yet tracked in dbt. This follow-up PR tweaks our view query definitions to select data from references to those new models instead of hardcoded tablenames; for example, SELECT * FROM iasworld.pardat becomes SELECT * FROM {{ source('iasworld', 'pardat') }}, which allows dbt to properly construct our DAG in order to know that the query depends on the iasworld.pardat source.

Thanks to these changes, the dbt docs can now generate a real lineage graph!

Closes #65.

Off topic but interesting

I found this blog post to be a somewhat convincing criticism of the dbt ref system. Having spent the better part of the day replacing references by hand, I'm sympathetic to the assertion that dbt should do this dependency resolution automatically without requiring us to tag all of our dependencies explicitly; but I'm also sympathetic to the idea that it's better to enforce explicitness, and that refs allow us to clearly mark which views/tables are tracked in the DAG and which are generated outside it. Not necessary context to review this repo, but I found it to be food for thought!

…views-to-point-to-dbt-stub-models

jeancochrane · 2023-08-15T21:25:53Z

aws-athena/views/census-vw_acs5_stat.sql

@@ -1,7 +1,7 @@
 -- View to convert raw ACS5 variables into useable statistics
 WITH distinct_years AS (
    SELECT DISTINCT year
-    FROM spatial.parcel
+    FROM {{ ref('parcel') }}


Somewhat annoyingly, we can't specify schemas when we select from dbt refs; dbt requires that model names be unique, with an alias system for models that share the same name in different schemas, and as a result the ref() interface only accepts a unique identifier and can't be qualified with a schema. I don't love this interface -- it's helpful to know where a model lives in the schema structure when reading query code, and I don't like that we lose that information by switching to refs.

There is a two-argument variant of ref(), but the second argument specifies the dbt "project" that the model comes from, not the schema (or "database" in dbt lingo). We could refactor our models so that they live in different projects, but then we would end up duplicating our project configuration files, so it doesn't really feel like the right abstraction.

If we find the lack of schemas in these references to be unacceptable, another option would be to more widely leverage aliases in order to prefix every model identifier with the schema it lives under, e.g.:

Suggested change

FROM {{ ref('parcel') }}

FROM {{ ref('schema_parcel') }}

This would be similar to the way that the views are organized in the aws-athena/views/ directory, and would allow us to reference the schema for a model without having to change the name of the generated view/table in Athena. We do this already in the code below for a few models whose table names collide with a table in a different schema, e.g. location_census and spatial_census. However, this approach also comes with downsides, namely:

The prefixing convention may not be any clearer than the unique-tablename convention; and

SQLFluff considers hyphens to be unparseable, so we would have to turn it off and switch to sqlfmt for these view definitions

I'm open to different options here, curious what you think!

@dfsnow Based on the section of the docs where dbt explains how they organize projects, it seems like a common naming convention that could support schemas is [source]__[entity].sql. This isn't exactly the same use case as ours; dbt uses this convention to indicate the "source" (i.e. origin database) of a "staging" model (i.e. initial transformation), whereas we would be using it to indicate the schemas of all our models, regardless of whether they're intended for initial or final transformations. But I think that their guidance for how to organize projects is overly complicated for our use case, and I'm more interested in the way that [source]__[entity].sql seems to be an existing standard for namespacing entities in dbt.

Following this pattern, this ref would look like:

Suggested change

FROM {{ ref('parcel') }}

FROM {{ ref('spatial__parcel') }}

After seeing it used in their sample code, I am somewhat reassured that it's not that confusing of a pattern, and it's largely similar to how aws-athena/views/ is organized anyway. The downside would be some amount of duplication between our subdirectory structure and our file prefixing pattern, but that feels like a negligible concern to me. What do you think?

Let's go with [source]__[entity].sql then, since it's a common pattern, would be consistent across all tables, and maintains the semantic info added by [location].[table].

TBD on whether we should keep the symlink system or eventually move all the model definitions into the dbt subdirectory.

jeancochrane · 2023-08-15T21:33:05Z

aws-athena/views/location-vw_pin10_location.sql

-FROM spatial.parcel AS pin
-LEFT JOIN location.census
+FROM {{ ref('parcel') }} AS pin
+LEFT JOIN {{ ref('location_census') }} AS census


Here's an example of a table (location.census) where we're using an alias (location_census) to act as a unique identifier for the ref() function without having to change the name of the table itself. We'll see below that this corresponds to a change in the model's filename, in addition to an alias attribute being added to the query via a config() call.

jeancochrane · 2023-08-15T21:35:15Z

dbt/models/default/schema.yml

@@ -31,15 +31,15 @@ models:
            - pin
            - year
          config:
-            error_if: ">280679"
+            error_if: ">280721"


Bumping these thresholds yet again... I'm going to chat with Billy tomorrow (Wednesday) to see if we can remove these tests.

suggestion: I recommend you comment these out for now. We can re-add them once we dive more into the tests.

jeancochrane · 2023-08-15T21:36:02Z

dbt/models/location/access.sql

@@ -0,0 +1,3 @@
+{{ config(materialized='ephemeral') }}


I noticed while testing this branch that I had neglected to add a few ephemeral location models in #64, so the next couple of files fix that omission!

jeancochrane · 2023-08-15T21:36:46Z

dbt/models/location/location_census.sql

@@ -0,0 +1,3 @@
+{{ config(materialized='ephemeral', alias='census') }}


And here's the query definition for a model with a non-unique table name that's using an alias to resolve the collision.

dfsnow

todo(blocking): Change the ref system and file names to match [source]__[entity].sql, then this is all set!

dfsnow · 2023-08-16T15:33:32Z

aws-athena/views/census-vw_acs5_stat.sql

@@ -1,7 +1,7 @@
 -- View to convert raw ACS5 variables into useable statistics
 WITH distinct_years AS (
    SELECT DISTINCT year
-    FROM spatial.parcel
+    FROM {{ ref('parcel') }}


Let's go with [source]__[entity].sql then, since it's a common pattern, would be consistent across all tables, and maintains the semantic info added by [location].[table].

TBD on whether we should keep the symlink system or eventually move all the model definitions into the dbt subdirectory.

dfsnow · 2023-08-16T15:35:48Z

dbt/models/default/schema.yml

@@ -31,15 +31,15 @@ models:
            - pin
            - year
          config:
-            error_if: ">280679"
+            error_if: ">280721"


suggestion: I recommend you comment these out for now. We can re-add them once we dive more into the tests.

jeancochrane

Alright @dfsnow, after falling down a rabbit hole today, I think I've finally figured out a better schema namespacing pattern 😪

One problem emerged with the double-underscore prefix during my testing: Even though we were using a custom generate_alias_name schema to remove the prefix from the relation name when building relations in Athena, the dbt docs did not understand this, and they reported the full filename (e.g. spatial__parcel) as the entity name. (There is a stale, closed feature request from another user asking dbt Core to support using aliases in the docs site... alas.) Perhaps it wasn't worth investing a whole afternoon in fixing this, but I found it to be annoying enough that it felt like it cancelled out the helpfulness of including the schema prefix in ref() calls.

I realized that we could replace double underscores with dots in the schema prefix (e.g. spatial__parcel ➡️ spatial.parcel) in order to more closely match the existing SQL convention, and I changed up our namespacing to follow this pattern in 4bb96d4. But testing revealed yet another problem, namely that ephemeral models can't have a dot in their name due to an open bug in dbt Core. This meant that we could use the dot to build the docs site, but it would cause errors when trying to build the models themselves.

In order to get around this second problem, I ended up refactoring our ephemeral models to turn them into sources instead. Ultimately, I think this is a better abstraction for these relations, and I was thinking about opening up a ticket to do this refactor anyway; so I think in the long term it's the right thing to do. But it ends up making this diff annoyingly large, so I apologize for that!

Here's what the updated docs site looks like with both of these changes (1. adding dot prefixes for schemas and 2. refactoring ephemeral models to sources):

Another fun benefit of using sources instead of ephemeral models is they are marked distinctly in green in the lineage graph, which helps to visually indicate which relations are built using dbt and which are built outside of it:

Let me know if you'd like to talk through any of this in person, I expect it's probably confusing 😬

jeancochrane · 2023-08-16T22:02:50Z

dbt/dbt_project.yml

-    ccao:
-      # Set default materialization on a per-schema basis. Schemas that contain
-      # _any_ persistent models should be set to `view` or `table`, or left
-      # to inherit the `view` default which is set on the `athena` project, but
-      # schemas that are entirely ephemeral should be marked `ephemeral`. The
-      # point of this pattern of defaults is that schemas with non-ephemeral
-      # defaults will be marked for deletion by the cleanup_dbt_resources.sh
-      # script, since `dbt list --select config.materialization:view` only
-      # reads this config file and does not consider model-level overrides.
-      # Hence, only schemas that contain exclusively ephemeral models should
-      # be marked with an `ephemeral` default, since we won't build those
-      # schemas into Athena and we don't want cleanup_dbt_resources.sh to
-      # try to clean them up.
-      +materialized: ephemeral
-      +schema: ccao


This config is no longer required, since we don't need to configure materialization or schemas for sources (schemas are inferred from the source name thanks to the dbt-athena plugin). If and when we materialize these sources, we can add these blocks back in, but for now I removed them to suppress warnings about unused configs in the dbt CLI.

praise: So this change actually reduces the complexity of the previous implementation. Awesome find!

jeancochrane · 2023-08-16T22:16:53Z

dbt/models/default/schema.yml

-      # Unique by 14-digit PIN and year
-      - unique_combination_of_columns:
-          name: vw_pin_appeal_unique_by_14_digit_pin_and_year
-          combination_of_columns:
-            - pin
-            - year
-          config:
-            error_if: ">280679"
-      # Unique by case number and year
-      - unique_combination_of_columns:
-          name: vw_pin_appeal_unique_by_case_number_and_year
-          combination_of_columns:
-            - year
-            - case_no
-          config:
-            error_if: ">366017"


Billy said I could delete these, so I went ahead and did that instead of commenting them out like Dan suggested!

dfsnow

Excellent work @jeancochrane. Love that this gets rid of all the ephemeral model stubs and differentiates loaded tables from built ones. Sorry I didn't flag it as a potential abstraction in the prior review of this PR.

Also excited about the possibility of using source freshness as another form of testing. This is especially applicable to constantly-updated iasWorld tables like HTPAR and PERMIT. I'll create a GitHub issue (edit: see #75) scoping out freshness tests for these tables.

dfsnow · 2023-08-17T17:19:51Z

dbt/dbt_project.yml

-    ccao:
-      # Set default materialization on a per-schema basis. Schemas that contain
-      # _any_ persistent models should be set to `view` or `table`, or left
-      # to inherit the `view` default which is set on the `athena` project, but
-      # schemas that are entirely ephemeral should be marked `ephemeral`. The
-      # point of this pattern of defaults is that schemas with non-ephemeral
-      # defaults will be marked for deletion by the cleanup_dbt_resources.sh
-      # script, since `dbt list --select config.materialization:view` only
-      # reads this config file and does not consider model-level overrides.
-      # Hence, only schemas that contain exclusively ephemeral models should
-      # be marked with an `ephemeral` default, since we won't build those
-      # schemas into Athena and we don't want cleanup_dbt_resources.sh to
-      # try to clean them up.
-      +materialized: ephemeral
-      +schema: ccao


praise: So this change actually reduces the complexity of the previous implementation. Awesome find!

jeancochrane added 3 commits August 15, 2023 14:57

Update dbt views to select from other dbt models where possible

f656233

Merge branch 'data-catalog' into jeancochrane/65-data-catalog-update-…

fe9c282

…views-to-point-to-dbt-stub-models

Bump error threshold for vw_pin_appeal tests again

29c3ead

jeancochrane commented Aug 15, 2023

View reviewed changes

jeancochrane marked this pull request as ready for review August 15, 2023 21:47

jeancochrane requested a review from a team as a code owner August 15, 2023 21:47

jeancochrane requested a review from dfsnow August 15, 2023 21:47

dfsnow requested changes Aug 16, 2023

View reviewed changes

jeancochrane added 2 commits August 16, 2023 16:26

Switch to dot notation for namespacing dbt resources

4bb96d4

Switch to using sources for dbt resources created outside the DAG

33c540c

jeancochrane commented Aug 16, 2023

View reviewed changes

jeancochrane requested a review from dfsnow August 16, 2023 22:15

jeancochrane commented Aug 16, 2023

View reviewed changes

dfsnow approved these changes Aug 17, 2023

View reviewed changes

dfsnow mentioned this pull request Aug 17, 2023

[Data catalog] Implement source freshness tests for critical iasWorld tables #75

Closed

jeancochrane merged commit 7a3d0e0 into data-catalog Aug 17, 2023
3 checks passed

jeancochrane deleted the jeancochrane/65-data-catalog-update-views-to-point-to-dbt-stub-models branch August 17, 2023 19:15

jeancochrane mentioned this pull request Aug 17, 2023

[Data catalog] Update views to point to dbt stub models #65

Closed

jeancochrane mentioned this pull request Nov 17, 2023

Contribute fix to dbt core to allow us to use ephemeral models for QC testing #238

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update dbt views to select from other dbt models where possible #71

Update dbt views to select from other dbt models where possible #71

jeancochrane commented Aug 15, 2023 •

edited

Loading

jeancochrane Aug 15, 2023 •

edited

Loading

jeancochrane Aug 16, 2023

dfsnow Aug 16, 2023

jeancochrane Aug 15, 2023

jeancochrane Aug 15, 2023

dfsnow Aug 16, 2023

jeancochrane Aug 15, 2023

jeancochrane Aug 15, 2023

dfsnow left a comment

dfsnow Aug 16, 2023

dfsnow Aug 16, 2023

jeancochrane left a comment

jeancochrane Aug 16, 2023

dfsnow Aug 17, 2023

jeancochrane Aug 16, 2023

dfsnow left a comment •

edited

Loading

dfsnow Aug 17, 2023

		@@ -0,0 +1,3 @@
		{{ config(materialized='ephemeral', alias='census') }}

Update dbt views to select from other dbt models where possible #71

Update dbt views to select from other dbt models where possible #71

Conversation

jeancochrane commented Aug 15, 2023 • edited Loading

Off topic but interesting

jeancochrane Aug 15, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dfsnow left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jeancochrane left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dfsnow left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jeancochrane commented Aug 15, 2023 •

edited

Loading

jeancochrane Aug 15, 2023 •

edited

Loading

dfsnow left a comment •

edited

Loading