-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cache relations in DBT #911
Comments
Really great writeup, thanks @beckjake. Some thoughts:
This was sort of a first-pass stream of consciousness response. To be sure, there's a lot of details to get right here, but I want to make sure we're thinking about this correctly at the macro-level. Let me know if you disagree with any of this: super happy to discuss! |
|
Agree. This may inform the structure of the
I agree, but it's important that when caching "throws its hands up," it does so in a way that doesn't cause radical behavior. We have, as a rule, stayed away from SQL parsing to date, but I don't know that we'll be able to continue to do so. If we need to parse SQL to do this right, let's do it. |
@beckjake I think it's important to write down the pathological cases for our materializations and caching scenarios. Materializations should, as much as possible, avoid leaving the warehouse in an incomplete state. In RDBMSes, this is easy to reason about because everything is transactional -- if the cache says a table exists, but you get an error when trying to ALTER it, you can usually roll back the entire transaction. This isn't the case for the other cloud data warehouses. If a user of dbt were to |
@beckjake to your third point:
I don't think there's a general way to find this info out cross-platform. Or, if there is, it would be no better than just running the actual introspective query to see what still exists! |
I guess the only real "risk" is the risk of a query failing in the middle of a dbt run, since |
I wrote this like it was a computer science paper. It's probably needlessly verbose, but I wanted to define a language that we can use to talk about caching. This is info about how we can use 1) the dbt DAG and 2) the properties of modern warehouses to cache relations intelligently: Caching StrategyOverviewThere are three classes of database operations that are cache-mutative:
All “create” operations are “local”; that is: they only mutate the cache for the entity that is being created. Both the “drop” and “swap” classes operations are “nonlocal” in nature. For these nonlocal operations, other entities in the cache may need to be mutated as a result of the nonlocal operations. The exact cache mutation logic is dependent on 1) the type of warehouse in use and 2) the type of relations present in the database. Creative operationsIn the Destructive operationsDropping a relationDatabase relations can be “structurally dependent” on other relations. This structural dependency is a function of one relation selecting from another relation. The term “structurally dependent” is used to denote a tighter dependency than the typical ”logical dependency” inherent in dbt. Logical DependencyTables can select from other relations. When this happens, the table and the relation(s) that it selects from are logically dependent. In this scenario, the selecting table maintains its own schema and data, so there is no structural dependency present. Instead, the relationship between a table and it’s parent is a logical dependency. Structural DependencyContrastingly, views can also select from other relations. A view does not maintain its own schema or data — rather, the schema and data that comprise a view are a function of 1) the view query definition and 2) the relations that it selects from. Because the definition of a view is a function of the relations that it selects from, it is said to be “structurally” dependent on its parents. More formally: structurally dependent relationships exist where the definition for one relation is inherently tied to the existence of a separate relation. Crucially, this structural dependency is recursive in nature. Consider the following example:
Here, Table A is (by definition) not structurally dependent on any relations. Adapter-specific behaviorBoth BigQuery and Snowflake have implemented “views” in a late-binding fashion. Views on these databases are logically dependent, but not structurally dependent. In Redshift and Postgres, views are structurally dependent. Interestingly, Redshift supports Late Binding Views which are not structurally dependent. When a relation is Dropping a schemaFor cache purposes, dropping a schema is equivalent to dropping all of the relations in the schema, and then dropping the schema itself. When a schema is dropped, all of the relations that structurally depend on relations inside of that schema should also be dropped. |
I didn't actually touch on the renaming piece... need to think about that one a little bit more |
Renaming is just a drop + a create from a cache perspective, right? |
@beckjake that's right. I wanted to investigate if there's any sort of difference in how postgres/redshift handles the swap-and-drop if we do it inside of a transaction. I don't think it makes any difference, but I wanted to confirm. |
After chatting with Jake: The dbt graph is an approximation of the the structural dependency graph inherent in the database. In situations where users use Instead, dbt should built a separate graph of structural dependencies by querying the database at the beginning of the run. |
It's worth noting that we only have to build the graph for postgres and redshift, and for redshift we can completely ignore late-binding views (since all we really care about is "will my |
Implemented in #1025 |
Relation Caching plan
This is a general roadmap for how we can implement a relation caching mechanism for dbt adapters. Instead of
list_relations()
going out to the database every time, we can maintain a local cache in between adapter invocations. As long as we're careful to invalidate and update the cache as appropriate, we can save a lot of time on metadata queries.New concepts
Relations will be cached on a per-schema basis.
When we create the cache, we are actually keeping track of the state of all models that have a structural dependency upon the schema.
ref
it cares about actual exists-in-the-database dependencies only.As we perform operations, we can see what nodes will be dropped by
drop ... cascade
(dependent views will be dropped) and what nodes will be changed byalter table ... rename
(views will point to the new name) by finding all dependent nodes in the DAGNew types/modules
new type
RelationsCache
Stores per-schema relations caches
methods
__contains__(self, schema)
:clear(self)
: Clears the schema for all schemas.clear_schema(self, schema)
: Clears the schema if it's present, otherwise does nothing.rename_relation(self, from_relation, to_relation)
: Replace the old relation with a new one, if present. Otherwise insert it (or error?).clear_relation(self, relation)
: Invalidate a single relation by removing it from the schema cache, if present. Otherwise does nothing (or error?).list_relations(self, schema)
: See NamespaceCache.list_relationsget_relation(self, schema, search)
: See NamespaceCache.get_relation.add_relation(self, relation)
: See NamespaceCache.add_relation.new (internal) type
NamespaceCache
Store a single schema's relations caches. I don't think this has to be exposed anywhere outside of the
RelationsCache
.attributes
fully_valid
:bool
indicating that the cache is a complete representation of the relations in the schemamethods
__contains__(self, relation)
:clear_relation(self, identifier)
: Invalidate a single relation by removing it from the schema cache. (TODO: ensure we don't need relationship_type!)add_relation(self, relation)
: Add a new relation to the schema cache.list_relations(self)
: Get the list of relations cached for this schemaget_relation(self, search)
: Search for the relation in the cache.Changes to existing code
Update
DefaultAdapter
and subclassesdrop_relation
andrename_relation
add_relation
is pretty trivial, it has no impact on the graph.drop_relation
will traverse the DAG we create above and clear all tables/views that would get hit by 'drop ... cascade'- We can't just use the manifest because the manifest can include relationships that don't actually exist - for example, if a user uses
ref
inside a comment/conditional or if a relationship existed that the current dbt graph doesn't know about.rename_relation
should be simple after that - drop the old and add new.get_relation
/list_relations
, catch anyCacheInvalid
and update the cache appropriatelyexecute_model
(cls, profile, schema, identifier, relation_type, sql, ...) methodUpdate macros
create_view_as
create_table_as
create_archive_table
?The text was updated successfully, but these errors were encountered: