Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Subselectors for state:modified #2704

Closed
jtcohen6 opened this issue Aug 14, 2020 · 11 comments · Fixed by #3559
Closed

Subselectors for state:modified #2704

jtcohen6 opened this issue Aug 14, 2020 · 11 comments · Fixed by #3559
Labels
enhancement New feature or request node selection Functionality and syntax for selecting DAG nodes state Stateful selection (state:modified, defer)
Milestone

Comments

@jtcohen6
Copy link
Contributor

jtcohen6 commented Aug 14, 2020

This is a follow-on from the initial proposals in #2465, #2641. Most of the required work is around exposing + sugaring the foundational work in #2695.

The way state comparison works is by identifying discrepancies between two manifests. When comparing between a past prod manifest and the current development manifest, discrepancies can be the result of two things:

  1. Changes made to the project in development
  2. Env-aware logic that causes different behavior based on the target, env vars, etc.

We're going to do our best to capture only changes that are the result of development. If someone's project has tons of intricate env-aware logic, they'll run more models than they want (i.e. more false positives). So we're giving them the option to turn off some knobs, in the form of more-specific subselectors.

Subselectors

There is potential for overlap: a single change can trigger multiple modification categories.

state:modified.contents:

  • Models: raw file contents changed
  • Snapshots: raw file contents changed
  • Data tests: raw file contents changed
  • Analyses: raw file contents changed
  • Seeds: raw file contents changed
    • However, if the file is >1 MB, we cannot compare raw file contents, so we raise a warning and just compare based on file path instead.

This alone would get a lot of people what they want! It's basically "just hash the files," excluding YAML configuration.

state:modified.configs:

  • Models: changes to materialized, quoting, bind, transient, sort/dist, partition_by, incremental_strategy...
    • This category captures changes in dbt_project.yml or {{config()}} blocks. If the changes are made in a {{config()}} block, they will also be picked up as content changes.
    • If someone has env-aware logic for materialized, where a model is a view in dev and a table in prod, they will not want to include this.
  • Snapshots: unique_key, strategy, ...
  • Seeds: quoting, column_types
  • Schema tests: severity changes

state:modified.descriptions:

  • If persist_docs is turned on for a node, description changes count as modifications. (If just columns, just column descriptions; just relations, top-level descriptions; if both, then both.)

state:modified.database_representations:

  • Models: changes to the configured database, schema, identifier. This value represents the manual input only, and it's different from the resolved database representation, which depends on the target and generate_x_name macros.
    • If someone manually sets schema = target.schema, or schema = target.schema ~ '_suffix' instead of using the generate_schema_name macro, that will register as a change between environments and they'll want to turn this off.
    • Depending on the generate_x_name logic and the current environment, a chance to the configured value may not actually change the database representation. We'll still register it as a modification.
  • Seeds: treated the same as models.
  • Sources: database, schema, or identifier has changed. If someone has env-aware definitions, they'll want to turn this off.
  • Snapshots: treated the same as sources.

Default behavior

I think state:modified should include all changes from all the categories above. The question mark is whether database_representations should be included in the default, since this is the area where people do the most custom things, and it's the knob that will likely be switched off most often. For the sake of clarity, I think it's best to have the state:modified selector be a superset of all modified subselectors.

Future art

state:modified.macros:

  • A macro's raw contents have changed
  • By extension, state:modified.macros+ would include all downstream models, tests, etc. that call (directly or indirectly) a macro that has changed
  • This also includes implicit macro dependencies such as generate_schema_name

state:modified.vars:

  • A var value has changed
  • By extension, state:modified.vars+ would include all downstream models, tests, etc. that call (directly or indirectly) a var that has changed

We will update state:modified to include both of these as well.

@jtcohen6
Copy link
Contributor Author

jtcohen6 commented Sep 9, 2020

There's been some discussion recently (slack thread) about the plausibility of a subselector that would enable us to differentiate between:

  1. a test that is included in dbt test -m state:modified because it depends on a modified node
  2. a test that is itself new or changed (severity of a schema test, file hash of a data test, etc.)

I think representing each of those with a separate modified subselector might enable users to neatly exclude certain tests from their CI environment, i.e. if those tests directly depend on parent models that are not themselves included in state:modified.

An alternative approach, within the CI job definition, is to run all modified models, plus any models that are the immediate ancestor of a modified test:

dbt run -m state:modified 1+state:modified,1+test_type:schema 1+state:modified,1+test_type:data --defer --state ...
dbt test -m state:modified --state ...

The syntax is verbose, and it probably includes a few resources it doesn't need to, but overall this approach is compelling. It also manages to handle cases where one leg of a relationships or equality test would otherwise be missing.

Previously, we were thinking that we'd resolve this wrinkle via dbt test --defer. I'm finding it increasingly hard to reason about deferred tests, and I think we shouldn't add support for them. Adding a subselector instead would allow us to close the same gap in a way that gives users greater control.

[updated 10/15 to use 1+ syntax that actually works]

@beckjake
Copy link
Contributor

beckjake commented Sep 9, 2020

Well, I've made this pitch before so I guess I'll do it again for old times' sake: I think the best way to do this is to make it so that dbt test -m state:modified only selects modified tests, and selecting tests that descend from modified nodes requires dbt test -m state:modified+. That would apply to all test selectors, so dbt test -m xyz would select tests named xyz, but not tests that ref models named xyz.

That would mean we could get rid of the goofy concept of selector expansion, which only exists to preserve this behavior, where dependent tests are added in after selection runs for dbt test and dbt ls --resource-type=test.

@jtcohen6
Copy link
Contributor Author

jtcohen6 commented Sep 9, 2020

@beckjake Ahh. Thanks for bringing that back. We do have a larger rework of tests on the horizon, so the time could be ripe for changing this change once and for all. Do I have it right that the biggest trade-off here would be losing the ability to do things like:

dbt test -m source:stripe+

Today, this means "run all tests that ref any resource that depends on source:stripe." Without selector expansion, this means "run all tests that directly ref source:stripe." I agree the way it works today is goofy, but that would represent a big loss of functionality. Would we support some kind of middle ground, a la dbt test -m (source:stripe+)+?

For the purposes of this issue, I'll take that as an argument against trying to code up modified subselectors to handle test selector expansion. In the meantime, I can instead encourage people to use the "build immediate ancestor" approach that I find myself preferring, clunky as it is.

@beckjake
Copy link
Contributor

beckjake commented Sep 9, 2020

I think dbt test -m source:stripe+ would still work exactly the same, because tests are downstream from anything they source or ref.
edit: my example was wrong, after testing, but hopefully the general idea makes sense

@beckjake
Copy link
Contributor

beckjake commented Sep 9, 2020

I do think it'd be reasonable to create some extra syntax for "all the tests that test the matched selector" to go along with this change. I'm not sure dbt is really ready for meta-selectors or whatever that would be called, but maybe it's more reasonable than I'm imagining. I do get that it's enticing to be able to say "take the dbt run command I just ran and run all of the tests for it". I just think that currently, it's very easy to do that and very painful to avoid selecting particular tests, due to the nature of selector expansion.

@jtcohen6
Copy link
Contributor Author

jtcohen6 commented Sep 9, 2020

Ok, this clicked for me. Far less breaking than I was thinking above. I just went back to reread this, too, while we're here. Was there a time that we previously decided against doing away with test selector expansion? Or did we not reach a verdict, and kept the status quo?

I do think it'd be reasonable to create some extra syntax for "all the tests that test the matched selector" to go along with this change.

Right, I think this makes sense. I mean... maybe dbt test --models keeps the expansion, and dbt test --select does away with it? I seem to recall that as a proposal. Is it the worst of all possible worlds?

I just think that currently, it's very easy to do that and very painful to avoid selecting particular tests, due to the nature of selector expansion.

I strongly agree.

I think this discussion could become a v1.0 issue that we hit along with the other v1.0 issues for tests later this year.

In a world where this behavior exists, the mutually exclusive groups I mentioned above could be specified as:

  1. unmodified tests that directly depend on a modified node: dbt test -s state:modified+1 --exclude state:modified
  2. tests that are themselves modified: dbt test -s state:modified

@beckjake
Copy link
Contributor

beckjake commented Sep 9, 2020

Was there a time that we previously decided against doing away with test selector expansion?

Yeah, long before #2203 even! I'm not sure where it was, it may unfortunately even have been private messages on slack or something. I recall it was considered just too much of a breaking change and it made it difficult for professional services (and therefore, presumably everyone else who uses dbt, but I think that was the source of the feedback?).

maybe dbt test --models keeps the expansion, and dbt test --select does away with it? I seem to recall that as a proposal. Is it the worst of all possible worlds?

At first I thought this was a great idea but then I remembered that --exclude only has one form. I guess we could add --exclude-models or something? This might be one of those instances where we can easily support both ways via yaml but it's easier to only support one way via the cli. I kind of imagine this is the sort of thing one wants to mix and match in a single command... "Give me all the tests of this selector, except for the tests that have the slow tag" just seems inherently desirable.

@jtcohen6 jtcohen6 added the state Stateful selection (state:modified, defer) label Sep 9, 2020
@jtcohen6 jtcohen6 removed this from the Kiyoshi Kuromiya milestone Oct 1, 2020
@huntdm07
Copy link

Weighing in here with a minor comment from Slack (thread) that relates to #2701: I'd like to keep the scenario described by @jtcohen6 on that issue alive, since it's quite a common one and doesn't seem to have a clear solution at the moment:

If a model has been modified, this should run tests against the new model. If a test is new/modified, but the model it selects from is not, this should run tests against the deferred model.

An example scenario would be the one I describe in the Slack thread, where a data test references both a new model and an existing unmodified model (which might not be anywhere in the graph neighbourhood of the new model), and current deferral functionality will not defer either model (while I would expect it to defer the existing unmodified model).

@huntdm07
Copy link

Adding a little bit more context that emerged: we arrived at a solution for this particular case that dragged in a few extra models (by using the 1+ parents option a couple of times across intersecting options: 1+state:modified,1+test_type:data), but the ideal case wasn't possible due to the fixed order of operations for selectors.

A possible solution for this would be adding support for order of operations - on the one hand, maybe this is overengineering (and it makes maintaining compatibility between the CLI and the YAML selectors a bit of a pain), but on the other hand lack of control over the order here does rule out some common scenarios like the one outlined in my previous comment.

This could look something like:

selectors:
  - name: parents_of_data_tests
    definition:
      # run this intersection first
      intersection:
        - 'state:modified'
        - 'test_type:data'
      # then get parents of the resulting models
      parents: true
      parents_depth: 1

... and the equivalent in the CLI:

dbt ls -s 1+(state:modified,test_type:data) --state ./target/

@dwallace0723
Copy link

@jtcohen6 just here to give a big 👍 to the idea of modification subselectors. Specifically, state:modified.contents. We have such a heavy reliance on environment variables in our dbt project that using state:modified is effectively a non-starter for us right now. Would love to be able to use it in the future though!

@ncolomer
Copy link

This definitely looks promising!

Up to now, current state:modified feature partially help us because, for some of our dbt models, we rely a lot on variables and jinja templating. Today, this forces us to always redeploy (ie. --full-refresh) those models. And some can be costly because they materialize data.

The state:modified.macros and state:modified.vars subselectors (that would be included by default in state:modified) would a great addition to solve this problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request node selection Functionality and syntax for selecting DAG nodes state Stateful selection (state:modified, defer)
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants