Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add state:modified and state:new selectors #2695

Merged
merged 6 commits into from
Aug 14, 2020

Conversation

beckjake
Copy link
Contributor

@beckjake beckjake commented Aug 11, 2020

resolves #2641

Description

Added state:modified and state:new selectors. All new nodes are modified nodes.

There are a few kinds of modifications:

  • names (database, schema, alias)
  • content (for seeds, a file checksum if <1mb or path if >=1mb. For tests, the test name + kwargs. For everything else, the raw sql of the node)
  • configuration (materialization, fqn, quoting, ...)
  • description fields, if persist_docs is set to persist the changed description

Tags configured via dbt_project.yml are not a change (doing it via {{ config(...) }} is a raw_sql change, sadly).

Seed files larger than 1mb do not get compared.

There's probably room to be less fiddly about what's a change and what isn't!

If you pass a state:modified selector and dbt detects that any of your macros have changed, dbt logs a blanket warning once per selector to let the user know that state:modified won't see those changes. This warning is never an error (it goes directly through logger.warning, not warn_or_error).

Two small structural changes to dbt:

  • I moved the various things about files themselves into a new contracts module alongside the manifest contracts (where they used to live) as part of this. Now that checksums live on nodes, it caused an import cycle. Also, they never really belonged in the manifest module.
  • I re-worked main to accept --state to commands that use selectors and merge some common logic on that front. This was inspired by making all these changes and then realizing I forgot to handle ls because it's totally separate... hopefully this is better.

This PR does bloat the manifest a little bit with checksums. It seemed hard to avoid, though technically we could change it so only seeds have them.

Checklist

  • I have signed the CLA
  • I have run this code in development and it appears to resolve the stated issue
  • This PR includes tests, or tests are not required/relevant for this PR
  • I have updated the CHANGELOG.md and added information about my change to the "dbt next" section.

@cla-bot cla-bot bot added the cla:yes label Aug 11, 2020
@beckjake beckjake force-pushed the feature/state-modified-selector branch from 6a397c7 to d77113d Compare August 11, 2020 20:43
@beckjake beckjake marked this pull request as ready for review August 11, 2020 21:07
@beckjake beckjake force-pushed the feature/state-modified-selector branch from 982395c to 153eb7e Compare August 12, 2020 13:45
Copy link
Contributor

@jtcohen6 jtcohen6 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I want to celebrate just what a tremendous leap this is. The fact that I can define a selector,

selectors:
  - name: deltas
    definition:
      method: state
      value: modified

set an environment variable, and then just dbt run --selector deltas is pretty neat, given that these are all new constructs in v0.18.0.

Playing around with this has helped me realize which modified signals make a lot of sense, and which make very little. I totally agree with your sentiment:

There's probably room to be less fiddly about what's a change and what isn't!

Name (database/schema/alias)

  • If the target.database or target.schema is different between the current invocation and the manifest, all nodes are considered modified. As this is almost always true for development/CI runs—it's the premise behind deferred runs—we definitely need a different behavior here.
  • This gets even trickier with generate_thing_name macros for database/schema/alias. E.g. if the macro ignores schema configs in non-prod envs, changing the schema config won't mark any nodes as modified in dev/ci.
  • This has spun me around 180 degrees: I now think we should never consider database/schema/alias for marking nodes as modified, because these are high env-aware values that are likely to be different across environments. What do you think?

Vars

  • In v1 dbt_project.yml config, vars are scoped to a set of model nodes, so changing a variable marks all models in that scope as changed. I guess that's right, and it's only in the old config.
  • In v2 config, which scopes vars at the project level, changing a var does not mark any nodes as having been changed.
  • In the future, if possible, we should identify which nodes depend on vars, similar to macro dependencies. (I know vars vis-a-vis parsing can be a nightmare.) In the meantime, would it be possible to raise a similar warning?
During a state comparison, dbt detected a change in vars: 'var1', 'var2'. This will not be marked as a modification.

) -> Iterator[UniqueId]:
if self.previous_state is None or self.previous_state.manifest is None:
raise RuntimeException(
'Got a state selector method, but no deferred manifest'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a nit, "comparison" feels like the right word here (you use it the warning above)

Suggested change
'Got a state selector method, but no deferred manifest'
'Got a state selector method, but no comparison manifest'

if self.macros_were_modified:
logger.warning(warning_tag(
'During a state comparison, dbt detected a change in '
'macros. This will not be marked as a modification.'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be possible to list up to three of the modified macros? Or is this an order of magnitude simpler as a boolean check? I don't feel that strongly

f'previous file had a checksum type of '
f'{second.checksum.name}, so it has changed'
)
warn_or_error(msg, node=first)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Love this warning/error series. Could we include the seed name as well?

@beckjake
Copy link
Contributor Author

beckjake commented Aug 12, 2020

I now think we should never consider database/schema/alias for marking nodes as modified, because these are high env-aware values that are likely to be different across environments. What do you think?

I think this would be fine, perhaps not intuitive. One thing we could look into is comparing the configured database/schema/alias values, before target has been considered (so they can be None). Would that be better? I'm not sure exactly how easy it is, but it sure feels like it should be easy! This wouldn't account for generate_*_name macros though.

Vars

First of all, if we're considering vars, I think we should just not support v1 configs here. If you set state:modified and your config is v1, it's an error, can't do it. I don't want to think about it and support every silly edge case just for deprecated functionality.

We can do a vars check - as you've correctly identified, it's just like the macro dependencies issue. One issue we'll see is that we don't really have the vars in the manifest currently, so we'll have to include those. We'll have the same problem with env_var and I don't have a solution for that: we can't dump the env, there's secrets in there!

@jtcohen6
Copy link
Contributor

This wouldn't account for generate_*_name macros though.

Use of these macros is pretty widespread, and hard to account for in its permutations. I feel strongly that we need state comparison to work for projects that override default database/schema/alias behavior in an env-based way, and I want to worry as little as possible about the ways in which they do it.

So I'm trying to think of a counterargument. What's the downside of completely ignoring changes to database/schema/alias in a CI environment? We already have good errors at compile time if dbt finds multiple resources with the same database representation.

I agree this would be less intuitive. It warrants an explanation in our docs.

What about for snapshots, where target_database+ target_schema tend to be stable across environments? I don't think it matters. It's not currently possible to dbt snapshot --select state:modified, and I don't see a use case for "only execute my modified snapshots," anyway.

Vars

  • Totally agree with eschewing v1
  • Sounds feasible for v2
  • Env vars I don't care about. I see them in the same category as profile/target configs and database/schema/alias: these things should be different across deployment + development environments, so diffing on them does not help. (I'm open to the argument that vars should fall into this category, too, but for now I don't think they do.)

@beckjake
Copy link
Contributor Author

What's the downside of completely ignoring changes to database/schema/alias in a CI environment?

Well, I imagine a world where you change a database name field or something and it ends up being queried by an external dashboard. Should dbt run --models state:modified pick that up? Maybe not, I actually don't have much intuition here.

@beckjake
Copy link
Contributor Author

I'm open to the argument that vars should fall into this category, too, but for now I don't think they do.

I have to think about this more, intuitively I think CLI vars are like env vars, but vars defined in your dbt_project.yml are like config items. That's not a very good answer though, because that's a pain to implement.

@jtcohen6
Copy link
Contributor

Well, I imagine a world where you change a database name field or something and it ends up being queried by an external dashboard.

It's a really good thought. I think that'd be a separate (and really cool) project, maybe plugging into existing thoughts on catalog comparison + having dbt know more about downstream model "exposures." Separate from running models in CI, I could imagine providing a warning in CI: "FYI you changed some columns in model X / the alias of model X, be aware that this may impact Y, Z downstream exposures that depend on it."

I'm good with punting on vars until v0.19.0.

@beckjake
Copy link
Contributor Author

beckjake commented Aug 13, 2020

I'm good with punting on vars until v0.19.0.

I would really like to do that, because then we can remove v1 config support before we do it. I think I will feel a lot better about this when calculating vars for a given model only requires its package. That's the kind of thing we can just cram into the manifest.

If we're going to say that database/schema/alias is no longer considered a change for nodes, should it be for sources?

Edit: and is a table's quoting setting not a change then? Doesn't feel like a change.

@jtcohen6
Copy link
Contributor

jtcohen6 commented Aug 13, 2020

[EDIT: OVERTAKEN BY EVENTS. This turned into an offline discussion that I'll summarize in a new issue]

Sources are an excellent question. Changes to a source's database representation could indicate a meaningful modification, such as switching an ingestion mechanism, which is definitely worth testing downstream dependencies. By default, dbt expects sources to have the same database representations across environments. (Same as snapshots.)

That said, users can (and often do) define sources with env-aware logic like:

sources:
  - name: application_db
    schema: "{{ 'application_db_prod' if target.name == 'prod' else 'application_db_staging' }}"

The former, which is an "actual" modification, takes place quite rarely. The latter would be identified as a false positive every single run. So while I don't feel great about it, I don't think we should compare source database/schema/identifier. That also keeps us consistent across node types.

Here are more complex approaches I can imagine taking:

  • If a source's database/schema/name/identifier is a Jinja expression, we don't try to compare it; if it's a string literal, we do. This may not be possible, and it may be too confusing to be desirable, anyway.
  • An eventual resolution (not for v0.18.0) would be the ability to compare state against one manifest and defer to a different one. This would allow someone to dbt compile --target ci, have all the namespaces in that manifest be right for comparison, but then defer to another manifest with a different set of namespaces.

Quoting still feels like a change. Accidentally switching off quoting could definitely be a breaking change. Also, I would never expect quoting behavior to be different across environments. Even though this is tightly linked to database representation, it feels more like a model config in the vein of materialized, sort/dist, etc.

@beckjake beckjake force-pushed the feature/state-modified-selector branch from 80d3bcd to 13fb235 Compare August 14, 2020 14:49
@beckjake beckjake mentioned this pull request Aug 14, 2020
4 tasks
Copy link
Contributor

@jtcohen6 jtcohen6 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is awesome! I confirmed locally:

  • Macro warning lists some of the changed macros
  • Seed warning includes name of too-big seed
  • Modifications to database representation are based on the user-configured database/schema/alias rather than the resolved values

@beckjake beckjake merged commit d554835 into dev/marian-anderson Aug 14, 2020
@beckjake beckjake deleted the feature/state-modified-selector branch August 14, 2020 19:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Slim CI runs, via state comparison and deferred refs
4 participants