Subselectors for state:modified #2704

jtcohen6 · 2020-08-14T16:44:01Z

This is a follow-on from the initial proposals in #2465, #2641. Most of the required work is around exposing + sugaring the foundational work in #2695.

The way state comparison works is by identifying discrepancies between two manifests. When comparing between a past prod manifest and the current development manifest, discrepancies can be the result of two things:

Changes made to the project in development
Env-aware logic that causes different behavior based on the target, env vars, etc.

We're going to do our best to capture only changes that are the result of development. If someone's project has tons of intricate env-aware logic, they'll run more models than they want (i.e. more false positives). So we're giving them the option to turn off some knobs, in the form of more-specific subselectors.

Subselectors

There is potential for overlap: a single change can trigger multiple modification categories.

state:modified.contents:

Models: raw file contents changed
Snapshots: raw file contents changed
Data tests: raw file contents changed
Analyses: raw file contents changed
Seeds: raw file contents changed
- However, if the file is >1 MB, we cannot compare raw file contents, so we raise a warning and just compare based on file path instead.

This alone would get a lot of people what they want! It's basically "just hash the files," excluding YAML configuration.

state:modified.configs:

Models: changes to materialized, quoting, bind, transient, sort/dist, partition_by, incremental_strategy...
- This category captures changes in dbt_project.yml or {{config()}} blocks. If the changes are made in a {{config()}} block, they will also be picked up as content changes.
- If someone has env-aware logic for materialized, where a model is a view in dev and a table in prod, they will not want to include this.
Snapshots: unique_key, strategy, ...
Seeds: quoting, column_types
Schema tests: severity changes

state:modified.descriptions:

If persist_docs is turned on for a node, description changes count as modifications. (If just columns, just column descriptions; just relations, top-level descriptions; if both, then both.)

state:modified.database_representations:

Models: changes to the configured database, schema, identifier. This value represents the manual input only, and it's different from the resolved database representation, which depends on the target and generate_x_name macros.
- If someone manually sets schema = target.schema, or schema = target.schema ~ '_suffix' instead of using the generate_schema_name macro, that will register as a change between environments and they'll want to turn this off.
- Depending on the generate_x_name logic and the current environment, a chance to the configured value may not actually change the database representation. We'll still register it as a modification.
Seeds: treated the same as models.
Sources: database, schema, or identifier has changed. If someone has env-aware definitions, they'll want to turn this off.
Snapshots: treated the same as sources.

Default behavior

I think state:modified should include all changes from all the categories above. The question mark is whether database_representations should be included in the default, since this is the area where people do the most custom things, and it's the knob that will likely be switched off most often. For the sake of clarity, I think it's best to have the state:modified selector be a superset of all modified subselectors.

Future art

state:modified.macros:

A macro's raw contents have changed
By extension, state:modified.macros+ would include all downstream models, tests, etc. that call (directly or indirectly) a macro that has changed
This also includes implicit macro dependencies such as generate_schema_name

state:modified.vars:

A var value has changed
By extension, state:modified.vars+ would include all downstream models, tests, etc. that call (directly or indirectly) a var that has changed

We will update state:modified to include both of these as well.

The text was updated successfully, but these errors were encountered:

jtcohen6 · 2020-09-09T19:22:00Z

There's been some discussion recently (slack thread) about the plausibility of a subselector that would enable us to differentiate between:

a test that is included in dbt test -m state:modified because it depends on a modified node
a test that is itself new or changed (severity of a schema test, file hash of a data test, etc.)

I think representing each of those with a separate modified subselector might enable users to neatly exclude certain tests from their CI environment, i.e. if those tests directly depend on parent models that are not themselves included in state:modified.

An alternative approach, within the CI job definition, is to run all modified models, plus any models that are the immediate ancestor of a modified test:

dbt run -m state:modified 1+state:modified,1+test_type:schema 1+state:modified,1+test_type:data --defer --state ...
dbt test -m state:modified --state ...

The syntax is verbose, and it probably includes a few resources it doesn't need to, but overall this approach is compelling. It also manages to handle cases where one leg of a relationships or equality test would otherwise be missing.

Previously, we were thinking that we'd resolve this wrinkle via dbt test --defer. I'm finding it increasingly hard to reason about deferred tests, and I think we shouldn't add support for them. Adding a subselector instead would allow us to close the same gap in a way that gives users greater control.

[updated 10/15 to use 1+ syntax that actually works]

beckjake · 2020-09-09T19:34:43Z

Well, I've made this pitch before so I guess I'll do it again for old times' sake: I think the best way to do this is to make it so that dbt test -m state:modified only selects modified tests, and selecting tests that descend from modified nodes requires dbt test -m state:modified+. That would apply to all test selectors, so dbt test -m xyz would select tests named xyz, but not tests that ref models named xyz.

That would mean we could get rid of the goofy concept of selector expansion, which only exists to preserve this behavior, where dependent tests are added in after selection runs for dbt test and dbt ls --resource-type=test.

jtcohen6 · 2020-09-09T19:54:11Z

@beckjake Ahh. Thanks for bringing that back. We do have a larger rework of tests on the horizon, so the time could be ripe for changing this change once and for all. Do I have it right that the biggest trade-off here would be losing the ability to do things like:

dbt test -m source:stripe+

Today, this means "run all tests that ref any resource that depends on source:stripe." Without selector expansion, this means "run all tests that directly ref source:stripe." I agree the way it works today is goofy, but that would represent a big loss of functionality. Would we support some kind of middle ground, a la dbt test -m (source:stripe+)+?

For the purposes of this issue, I'll take that as an argument against trying to code up modified subselectors to handle test selector expansion. In the meantime, I can instead encourage people to use the "build immediate ancestor" approach that I find myself preferring, clunky as it is.

beckjake · 2020-09-09T20:01:59Z

I think dbt test -m source:stripe+ would still work exactly the same, because tests are downstream from anything they source or ref.
edit: my example was wrong, after testing, but hopefully the general idea makes sense

beckjake · 2020-09-09T20:15:57Z

I do think it'd be reasonable to create some extra syntax for "all the tests that test the matched selector" to go along with this change. I'm not sure dbt is really ready for meta-selectors or whatever that would be called, but maybe it's more reasonable than I'm imagining. I do get that it's enticing to be able to say "take the dbt run command I just ran and run all of the tests for it". I just think that currently, it's very easy to do that and very painful to avoid selecting particular tests, due to the nature of selector expansion.

jtcohen6 · 2020-09-09T20:18:19Z

Ok, this clicked for me. Far less breaking than I was thinking above. I just went back to reread this, too, while we're here. Was there a time that we previously decided against doing away with test selector expansion? Or did we not reach a verdict, and kept the status quo?

I do think it'd be reasonable to create some extra syntax for "all the tests that test the matched selector" to go along with this change.

Right, I think this makes sense. I mean... maybe dbt test --models keeps the expansion, and dbt test --select does away with it? I seem to recall that as a proposal. Is it the worst of all possible worlds?

I just think that currently, it's very easy to do that and very painful to avoid selecting particular tests, due to the nature of selector expansion.

I strongly agree.

I think this discussion could become a v1.0 issue that we hit along with the other v1.0 issues for tests later this year.

In a world where this behavior exists, the mutually exclusive groups I mentioned above could be specified as:

unmodified tests that directly depend on a modified node: dbt test -s state:modified+1 --exclude state:modified
tests that are themselves modified: dbt test -s state:modified

beckjake · 2020-09-09T20:44:41Z

Was there a time that we previously decided against doing away with test selector expansion?

Yeah, long before #2203 even! I'm not sure where it was, it may unfortunately even have been private messages on slack or something. I recall it was considered just too much of a breaking change and it made it difficult for professional services (and therefore, presumably everyone else who uses dbt, but I think that was the source of the feedback?).

maybe dbt test --models keeps the expansion, and dbt test --select does away with it? I seem to recall that as a proposal. Is it the worst of all possible worlds?

At first I thought this was a great idea but then I remembered that --exclude only has one form. I guess we could add --exclude-models or something? This might be one of those instances where we can easily support both ways via yaml but it's easier to only support one way via the cli. I kind of imagine this is the sort of thing one wants to mix and match in a single command... "Give me all the tests of this selector, except for the tests that have the slow tag" just seems inherently desirable.

huntdm07 · 2020-10-14T11:23:35Z

Weighing in here with a minor comment from Slack (thread) that relates to #2701: I'd like to keep the scenario described by @jtcohen6 on that issue alive, since it's quite a common one and doesn't seem to have a clear solution at the moment:

If a model has been modified, this should run tests against the new model. If a test is new/modified, but the model it selects from is not, this should run tests against the deferred model.

An example scenario would be the one I describe in the Slack thread, where a data test references both a new model and an existing unmodified model (which might not be anywhere in the graph neighbourhood of the new model), and current deferral functionality will not defer either model (while I would expect it to defer the existing unmodified model).

huntdm07 · 2020-10-14T18:26:35Z

Adding a little bit more context that emerged: we arrived at a solution for this particular case that dragged in a few extra models (by using the 1+ parents option a couple of times across intersecting options: 1+state:modified,1+test_type:data), but the ideal case wasn't possible due to the fixed order of operations for selectors.

A possible solution for this would be adding support for order of operations - on the one hand, maybe this is overengineering (and it makes maintaining compatibility between the CLI and the YAML selectors a bit of a pain), but on the other hand lack of control over the order here does rule out some common scenarios like the one outlined in my previous comment.

This could look something like:

selectors:
  - name: parents_of_data_tests
    definition:
      # run this intersection first
      intersection:
        - 'state:modified'
        - 'test_type:data'
      # then get parents of the resulting models
      parents: true
      parents_depth: 1

... and the equivalent in the CLI:

dbt ls -s 1+(state:modified,test_type:data) --state ./target/

dwallace0723 · 2020-11-11T15:56:33Z

@jtcohen6 just here to give a big 👍 to the idea of modification subselectors. Specifically, state:modified.contents. We have such a heavy reliance on environment variables in our dbt project that using state:modified is effectively a non-starter for us right now. Would love to be able to use it in the future though!

ncolomer · 2021-02-24T08:18:45Z

This definitely looks promising!

Up to now, current state:modified feature partially help us because, for some of our dbt models, we rely a lot on variables and jinja templating. Today, this forces us to always redeploy (ie. --full-refresh) those models. And some can be costly because they materialize data.

The state:modified.macros and state:modified.vars subselectors (that would be included by default in state:modified) would a great addition to solve this problem.

jtcohen6 added the enhancement New feature or request label Aug 14, 2020

jtcohen6 modified the milestones: Marian Anderson, 0.19.0 Aug 14, 2020

jtcohen6 mentioned this issue Aug 18, 2020

Add pseudo selectors that select models based on artifact states #2465

Closed

jtcohen6 added the state Stateful selection (state:modified, defer) label Sep 9, 2020

jtcohen6 mentioned this issue Sep 15, 2020

Support deferred tests #2701

Closed

jtcohen6 removed this from the Kiyoshi Kuromiya milestone Oct 1, 2020

jtcohen6 mentioned this issue Oct 15, 2020

Relation doesn't exist for dbt test after dbt run -m state:modified if the test was modified but the model wasn't #2833

Closed

5 tasks

jtcohen6 added this to the Margaret Mead milestone Mar 22, 2021

jtcohen6 mentioned this issue Mar 24, 2021

Selecting models using -m state:modified does not work as expected #3191

Closed

5 tasks

jtcohen6 modified the milestones: Margaret Mead, Oh-Twenty-One Apr 13, 2021

This was referenced Jul 12, 2021

Store the unresolved form of source database representations #2744

Closed

Feature: state:modified.macros #3559

Merged

jtcohen6 self-assigned this Jul 12, 2021

jtcohen6 added the node selection Functionality and syntax for selecting DAG nodes label Aug 2, 2021

jtcohen6 mentioned this issue Aug 3, 2021

using partition in model config causes state:modified false positive #3645

Closed

5 tasks

jtcohen6 closed this as completed in #3559 Aug 12, 2021

jtcohen6 mentioned this issue Nov 18, 2021

[vars] [false negatives] Changes in used variables do not cause models to be selected by state:modified #4304

Open

jtcohen6 removed their assignment Nov 14, 2022

trouze mentioned this issue May 9, 2023

[CT-2551] Make State Selection Collectively Exhaustive #7564

Closed

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Subselectors for state:modified #2704

Subselectors for state:modified #2704

jtcohen6 commented Aug 14, 2020 •

edited

Loading

jtcohen6 commented Sep 9, 2020 •

edited

Loading

beckjake commented Sep 9, 2020 •

edited

Loading

jtcohen6 commented Sep 9, 2020 •

edited

Loading

beckjake commented Sep 9, 2020 •

edited

Loading

beckjake commented Sep 9, 2020

jtcohen6 commented Sep 9, 2020 •

edited

Loading

beckjake commented Sep 9, 2020 •

edited

Loading

huntdm07 commented Oct 14, 2020

huntdm07 commented Oct 14, 2020

dwallace0723 commented Nov 11, 2020

ncolomer commented Feb 24, 2021

Subselectors for state:modified #2704

Subselectors for state:modified #2704

Comments

jtcohen6 commented Aug 14, 2020 • edited Loading

Subselectors

Default behavior

Future art

jtcohen6 commented Sep 9, 2020 • edited Loading

beckjake commented Sep 9, 2020 • edited Loading

jtcohen6 commented Sep 9, 2020 • edited Loading

beckjake commented Sep 9, 2020 • edited Loading

beckjake commented Sep 9, 2020

jtcohen6 commented Sep 9, 2020 • edited Loading

beckjake commented Sep 9, 2020 • edited Loading

huntdm07 commented Oct 14, 2020

huntdm07 commented Oct 14, 2020

dwallace0723 commented Nov 11, 2020

ncolomer commented Feb 24, 2021

jtcohen6 commented Aug 14, 2020 •

edited

Loading

jtcohen6 commented Sep 9, 2020 •

edited

Loading

beckjake commented Sep 9, 2020 •

edited

Loading

jtcohen6 commented Sep 9, 2020 •

edited

Loading

beckjake commented Sep 9, 2020 •

edited

Loading

jtcohen6 commented Sep 9, 2020 •

edited

Loading

beckjake commented Sep 9, 2020 •

edited

Loading