Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[vars] [false negatives] Changes in used variables do not cause models to be selected by state:modified #4304

Open
Tracked by #9562
CyberShadow opened this issue Nov 18, 2021 · 10 comments · Fixed by #10793
Labels
paper_cut A small change that impacts lots of users in their day-to-day state: modified state Stateful selection (state:modified, defer)
Milestone

Comments

@CyberShadow
Copy link

Hi there,

In https://docs.getdbt.com/reference/node-selection/state-comparison-caveats, it is stated:

Vars

If a model uses a var or env_var in its definition, dbt is unable today to identify that lineage in such a way that it can include the model in state:modified because the var or env_var value has changed.

I was wondering if there is an issue tracking this limitation; I searched for one and couldn't find one, so I hope it's OK that I opened this issue so that our team can follow its status. (If there was one and I missed it, apologies!)

Also, I apologize for the naive question, but would it not be easier, at least for the case where the desired effect is to re-run all changed models and their children, to check if the source code after preprocessing (Jinja macro expansion) has changed, instead of before? I would expect that this would automatically cause changes in variables to propagate to changes to the compiled code of the models, which would then indicate a need to re-run them.

Thanks!

@jtcohen6 jtcohen6 added triage enhancement New feature or request state Stateful selection (state:modified, defer) Team: Language labels Nov 18, 2021
@jtcohen6
Copy link
Contributor

@CyberShadow Thanks for opening the issue! I think the previous issue we had to track this (#2704) was closed, on account of being mostly (though not entirely) resolved.

We just recently did a bunch of work to capture env_var calls for partial parsing. That allowed us to track (a) when those env vars change, and (b) which files need to be re-parsed accordingly. We can reasonably extend that to support --vars as well, which currently trigger a full re-parse.

There's a lot of overlap between that work, and what we'd need to support this in state:modified. There's still a gap between them, however. Bridging that gap would require us to do one of two things:

  • Add just enough of this information to manifest.json, to allow it to make the same comparisons—namely, the dictionary of env var values, and a mapping from each env var to every node that depends on it in some way (including macros)
  • Switch dbt state selection to use the internal manifest (partial_parse.msgpack) rather than the documented, externally contracted manifest (manifest.json)

Also, I apologize for the naive question, but would it not be easier, at least for the case where the desired effect is to re-run all changed models and their children, to check if the source code after preprocessing (Jinja macro expansion) has changed, instead of before?

This is a totally fair question! We want state:modified to detect, as much as possible, changes that are accountable to you, the dbt developer. Those changes could occur because someone has edited a model, because someone has edited a macro that a model depends on, or because someone has changed the configuration for dozens of models via dbt_project.yml. As much as possible, we want to limit "false positives" — differences that occur not due to editing dbt project source code, but because a project is being run in a different environment, or because something in the database has changed, impacting dynamically templated SQL.

To that end, I'm starting to think that changes in --vars + env vars look less like the kind of development-time changes we want to be capturing in state:modified, and more like the kind of environment-based changes we want to ignore for state-based selection (but observe well enough to re-parse the project correctly, of course). That is, I'm starting to wonder if state:modified.vars and state:modified.env_vars are actually desirable!

So, to turn the question around: Could you say more about your use case? What sorts of changes are you hoping to test in CI, or capture in production, that depend on vars or env vars?

@CyberShadow
Copy link
Author

Hey, thank you for the detailed response!

Good to know about state:modified.vars and that it is on the roadmap :)

Thank you for the insight about state:modified. The explanation makes sense, though I would mention that occasional false positives (e.g. on a dbt upgrade) would be OK for us as it simply means that some models will be wastefully re-run. On the other hand, false negatives are potentially dangerous if some models were not rebuilt when they really should have, as it would cause the final results to be wrong.

I'm happy to talk about our use case :) One part of the topic I'm working on right now is reproducibility of our data transformation pipelines, i.e., the ability to re-run previous pipelines and obtain the exact same result. This will make it easier to reproduce past problems in order to understand and debug them. As part of this, the approach I was considering was that each time we receive updated data, it would be assigned a version identifier, with that version of the data being immutable henceforth. This data version identifier to use would then be passed to dbt as a var, which would then be expanded to the fully-qualified name of some table or snapshot in the SQL template.

You can probably see where this is going: if the specified data version to use is changed, then all models that use that data source should also be re-run. So, from this point of view, it would make sense if a change in a variable's value causes models which use it to be considered modified; but, I also understand your perspective where vars indicate environmental configuration, changes in which should not signal modifications.

Hope this helps!

@LHarris55
Copy link

Hi there,

We are experiencing the exact same issue where once a variable is changed in the dbt_project.yml file the downstream models which reference the variable are not ran when calling state:modifed. Hence, the variables not updated in these models.

It sounds like state:modified.vars would be a great solution to this.

Hope this can be implemented in the future!

@axelborja
Copy link

axelborja commented Sep 29, 2022

We are also experiencing the same limitations, so, unfortunately, because we are using some vars in our models, we cannot use the state:modified feature at all. (or we can but it's a non-sense)

@jtcohen6, any update about this feature?

@Pgillets
Copy link

Im my case, we use the DBT grant feature to manage access in BigQuery roll level security, and every time we include a new Service Account or user to the list all models are set to modified.
So, a PR that only includes an e-mail to the dbt_project.yml make all the tables in the project to run on the Slim CI.

@axelborja
Copy link

We are also experiencing the same limitations, so, unfortunately, because we are using some vars in our models, we cannot use the state:modified feature at all. (or we can but it's a non-sense)

@jtcohen6, any update about this feature?

@jtcohen6 any update about this one? :)

@jtcohen6
Copy link
Contributor

jtcohen6 commented Mar 7, 2023

I've come around from some of the hesitations I expressed above (back in Nov 2021). I do think this would be desirable, and we can make it configurable for end users based on a subselector.

The prerequisite for this work is capturing var calls during parse time, and storing them in something like depends_on. I think it's the same prerequisite work for:

I'm going to mark this issue a paper_cut. Please keep upvoting / commenting to help us inform priority!

@jtcohen6 jtcohen6 added paper_cut A small change that impacts lots of users in their day-to-day Team:Language and removed Team:Execution labels Mar 7, 2023
@graciegoheen graciegoheen changed the title Changes in used variables do not cause models to be selected by state:modified [false negatives] Changes in used variables do not cause models to be selected by state:modified Feb 14, 2024
@graciegoheen graciegoheen changed the title [false negatives] Changes in used variables do not cause models to be selected by state:modified [vars] [false negatives] Changes in used variables do not cause models to be selected by state:modified Feb 14, 2024
Copy link
Contributor

This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please comment on the issue or else it will be closed in 7 days.

@github-actions github-actions bot added the stale Issues that have gone stale label Aug 13, 2024
Copy link
Contributor

Although we are closing this issue as stale, it's not gone forever. Issues can be reopened if there is renewed community interest. Just add a comment to notify the maintainers.

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Aug 21, 2024
@ChenyuLInx
Copy link
Contributor

This change is reverted in #10813 due to it would cause new parsing error in project that would work before. So reopening the issue.
We also noticed some memory issue goes away after this PR is reverted, not sure it is related but worth checking memory consumption of the fix before shipping again. (Internal Thread)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
paper_cut A small change that impacts lots of users in their day-to-day state: modified state Stateful selection (state:modified, defer)
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants