-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[CT-168] Cache objects for selected resources only? #4688
Comments
Hi, we are trying to run individual model at a time from Apache Airflow (by parsing the manifest.json). We have a common project where multiple teams are working. So there are about 30 schemas in total in various databases (Snowflake). When running single model, it takes 2 minutes just to run I was hoping if there could be a command line or config parameter to change this default behavior. This would give options to users on what is best for their use case. |
I spoke with a dbt user today who is trying to reference the data type of a column during a post-hook macro. They observed that, if the data type of a column changes between two successive runs (say from a TEXT to a NUMBER), then the post-hook macro that references the data type will be the data type before to the run (e.g., TEXT rather than NUMBER). I think this is an example of negative/unexpected behavior of caching. |
@boxysean Great catch. Confirming that the user was on Spark/Databricks? That's the only adapter I know of where we cache column-level info at the start of the run, since it's available from the caching query, such that subsequent calls to This is a tricky one — wouldn't be solved by limiting the cache to selected/run relations, nor would it solved by updating the cache with a more detailed relation object after the materialization runs (since this is after the post-hook). It's a good reason to prefer a less powerful caching query that doesn't return column info (option 2 here: dbt-labs/dbt-spark#296). |
@jtcohen6 They are on Snowflake. I'll try to get them to chime in directly and share their code to illustrate. |
Background
Should we cache everything dbt cares about? Or try to cache objects relevant only to resources that are selected to run? This is an increasingly important consideration as folks move to:
The intent of the relation cache is to speed up performance, after all. If it's not serving that purpose, in its current unadaptive form, then we should change its behavior.
Details
We know the set of selected resources at the time when we populate the adapter cache. We only create schemas (if they do not yet exist) for objects that are selected. Should we also limit the reach of our metadata queries, to only introspect the schemas we care about?
dbt-core/core/dbt/task/run.py
Lines 457 to 458 in a588607
Should we go one step further, and use filters/wildcards to only cache the objects we care about? This seems necessary on
dbt-spark
(issue linked below.)Risks
--defer
depends on being able to access cache information about resources that are not selected (context in Defer iff unselected reference does not exist in current env #2946):dbt-core/core/dbt/contracts/graph/manifest.py
Lines 952 to 955 in a588607
Questions
--defer
behavior?Related
list_relations_without_caching
method dbt-spark#228The text was updated successfully, but these errors were encountered: