Cache relations in DBT #911

beckjake · 2018-08-08T21:32:32Z

Relation Caching plan

This is a general roadmap for how we can implement a relation caching mechanism for dbt adapters. Instead of list_relations() going out to the database every time, we can maintain a local cache in between adapter invocations. As long as we're careful to invalidate and update the cache as appropriate, we can save a lot of time on metadata queries.

New concepts

Relations will be cached on a per-schema basis.

When we create the cache, we are actually keeping track of the state of all models that have a structural dependency upon the schema.

this is basically a second DAG, much like the manifest, but instead of caring about ref it cares about actual exists-in-the-database dependencies only.
we have a query for this for postgres and redshift
we can ignore this for snowflake, bigquery, and redshift's late-binding views

As we perform operations, we can see what nodes will be dropped by drop ... cascade (dependent views will be dropped) and what nodes will be changed by alter table ... rename (views will point to the new name) by finding all dependent nodes in the DAG

New types/modules

new type `RelationsCache`

Stores per-schema relations caches

methods

__contains__(self, schema):
- Returns True if the schema is cached.
clear(self): Clears the schema for all schemas.
- No return value
clear_schema(self, schema): Clears the schema if it's present, otherwise does nothing.
- No return value.
rename_relation(self, from_relation, to_relation): Replace the old relation with a new one, if present. Otherwise insert it (or error?).
- No return value.
clear_relation(self, relation): Invalidate a single relation by removing it from the schema cache, if present. Otherwise does nothing (or error?).
- No return value.
list_relations(self, schema): See NamespaceCache.list_relations
get_relation(self, schema, search): See NamespaceCache.get_relation.
add_relation(self, relation): See NamespaceCache.add_relation.
- Creates a new partially valid NamespaceCache if none exists.

new (internal) type `NamespaceCache`

Store a single schema's relations caches. I don't think this has to be exposed anywhere outside of the RelationsCache.

attributes

fully_valid: bool indicating that the cache is a complete representation of the relations in the schema

methods

__contains__(self, relation):
- Returns True if the relation is in the cache
clear_relation(self, identifier): Invalidate a single relation by removing it from the schema cache. (TODO: ensure we don't need relationship_type!)
- No return value.
add_relation(self, relation): Add a new relation to the schema cache.
- No return value.
list_relations(self): Get the list of relations cached for this schema
- Returns a list of Relations
get_relation(self, search): Search for the relation in the cache.
- Returns a list of Relations that match (caller must handle duplicates).

Changes to existing code

Update `DefaultAdapter` and subclasses

We will now instantiate adapters with profile, project, and manifest in the factory
- We'll want to make adapters singletons (classes are singletons and we use that fact)
- If we make it a singleton, the cache can be a member of the adapter instance
  - Integration tests will have to clear this in setup/teardown, or un-set the singleton adapter.
  - We will need to create a DAG of the structural dependencies in postgres/redshift. This only impacts early-binding views, which is why we can ignore it for snowflake/bigquery and redshift's late-binding views.
They will now call into the RelationsCache to update it in drop_relation and rename_relation
- add_relation is pretty trivial, it has no impact on the graph.
- drop_relation will traverse the DAG we create above and clear all tables/views that would get hit by 'drop ... cascade'
  - note that late-binding views will not be cleared
  - this is something we'll want to unit test extensively
    - We can't just use the manifest because the manifest can include relationships that don't actually exist - for example, if a user uses ref inside a comment/conditional or if a relationship existed that the current dbt graph doesn't know about.
- rename_relation should be simple after that - drop the old and add new.
- check out @drewbanin's comment here for some great detail on this
They will look at the cache for get_relation/list_relations, catch any CacheInvalid and update the cache appropriately
Add an execute_model(cls, profile, schema, identifier, relation_type, sql, ...) method
- add the new relation (it will always be an addition because we rename existing tables)
- call execute()

Update macros

Macros that create things will also need to add those relations to the cache (via execute_model, I think?):
- create_view_as
- create_table_as
- create_archive_table?

The text was updated successfully, but these errors were encountered:

drewbanin · 2018-08-09T17:20:14Z

Really great writeup, thanks @beckjake. Some thoughts:

I noticed that you included "truncate" in the list of cache-altering statements. Truncate will delete all of the rows in a table, but the table will still physically exist. Is there another reason to care about truncate statements here?
We're almost certainly going to care about the types of relations. Most databases care about the difference between drop table public.abc and drop view public.abc, so we'll likely want to have the type of the relation handy in the cache
I don't want to try to get too cute with caching, but I do think we should leverage the properties of the DAG where appropriate. When a relation is drop....cascaded, we can statically determine which relations in the cache will need to be invalidated. Briefly, we would need to drop dependent views (not late-binding) on Redshift and Postgres. I'm not sure if it's a good idea to encode this type of logic into the caching framework, but I think that invalidating whole schemas every time a drop... cascade is run will probably mitigate a lot of the benefit of caching in the first place!
I think it would be unwise to do any sort of sql parsing at all. I think that hooks are totally appropriate for 1) granting access to relations 2) inserting records into an audit table and 3) running maintenance tasks on tables. I think that running any sort of statements that directly affect the dbt graph is an anti-pattern, and it should be ok for caching to throw its hands up in that scenario. I just don't imagine it being feasible at all for us to parse sql, understand the caching implications, and then do intelligent things in 100% of cases, especially given the different syntaxes across warehouses. If statements like this are definitely necessary, then i'd like to provide macros that return the desired SQL and also invoke the correct caching behavior.

This was sort of a first-pass stream of consciousness response. To be sure, there's a lot of details to get right here, but I want to make sure we're thinking about this correctly at the macro-level. Let me know if you disagree with any of this: super happy to discuss!

beckjake · 2018-08-09T17:33:19Z

@drewbanin

Nope, that was a mistake on my part.
I agree. The relation field in those caches is always a Relation, which includes the relation_type field. clear_relation() does take just the identifier, but it's remove only and only operates on the cache, so I don't think it needs the type. All it has to do is iterate over the cache members, find the entry with a matching identifier, and remove it. Unless some databases allow views and tables to have the same name, is that the concern?
Is it possible to ask the database what a cascade would delete? Or heck, even did delete? I do think this is possible though, I'll look into that some more.
I'm fine with letting hooks do whatever.

cmcarthur · 2018-08-09T17:34:40Z

we should leverage the properties of the DAG where appropriate.

Agree. This may inform the structure of the SchemaCache / RelationCache object -- it may end up looking more catalog-y.

I think that running any sort of statements that directly affect the dbt graph is an anti-pattern, and it should be ok for caching to throw its hands up in that scenario.

I agree, but it's important that when caching "throws its hands up," it does so in a way that doesn't cause radical behavior. We have, as a rule, stayed away from SQL parsing to date, but I don't know that we'll be able to continue to do so. If we need to parse SQL to do this right, let's do it.

cmcarthur · 2018-08-09T17:46:19Z

@beckjake I think it's important to write down the pathological cases for our materializations and caching scenarios. Materializations should, as much as possible, avoid leaving the warehouse in an incomplete state. In RDBMSes, this is easy to reason about because everything is transactional -- if the cache says a table exists, but you get an error when trying to ALTER it, you can usually roll back the entire transaction. This isn't the case for the other cloud data warehouses.

If a user of dbt were to DROP SCHEMA ... CASCADE in the middle of their dbt run, I think that dbt would not do anything nasty. It'd just abort mid-run. I suspect that there are other types of queries that could cause real problems, but I don't know that for a fact. It could be that dbt would handle these situations well. Let's enumerate the error cases and only address the bad ones, and maybe we can forgo things like SQL parsing.

drewbanin · 2018-08-09T17:48:24Z

@beckjake to your third point:

Is it possible to ask the database what a cascade would delete? Or heck, even did delete? I do think this is possible though, I'll look into that some more.

I don't think there's a general way to find this info out cross-platform. Or, if there is, it would be no better than just running the actual introspective query to see what still exists!

beckjake · 2018-08-09T17:53:38Z

I guess the only real "risk" is the risk of a query failing in the middle of a dbt run, since drop ... cascade is only going to delete stuff and then we'd be selecting from missing tables or renaming missing tables or something. I think that would be fine, even if you can't roll back, because you could just rebuild all your models, right? It might take a long time is all.

drewbanin · 2018-08-10T19:08:14Z

I wrote this like it was a computer science paper. It's probably needlessly verbose, but I wanted to define a language that we can use to talk about caching. This is info about how we can use 1) the dbt DAG and 2) the properties of modern warehouses to cache relations intelligently:

Caching Strategy

Overview

There are three classes of database operations that are cache-mutative:

Create a schema or relation
Drop a schema or relation
Swap a relation

All “create” operations are “local”; that is: they only mutate the cache for the entity that is being created. Both the “drop” and “swap” classes operations are “nonlocal” in nature. For these nonlocal operations, other entities in the cache may need to be mutated as a result of the nonlocal operations. The exact cache mutation logic is dependent on 1) the type of warehouse in use and 2) the type of relations present in the database.

Creative operations

In the create case, a simple addition to the cache is required. No further cache mutation is required.

Destructive operations

Dropping a relation

Database relations can be “structurally dependent” on other relations. This structural dependency is a function of one relation selecting from another relation. The term “structurally dependent” is used to denote a tighter dependency than the typical ”logical dependency” inherent in dbt.

Logical Dependency

Tables can select from other relations. When this happens, the table and the relation(s) that it selects from are logically dependent. In this scenario, the selecting table maintains its own schema and data, so there is no structural dependency present. Instead, the relationship between a table and it’s parent is a logical dependency.

Structural Dependency

Contrastingly, views can also select from other relations. A view does not maintain its own schema or data — rather, the schema and data that comprise a view are a function of 1) the view query definition and 2) the relations that it selects from. Because the definition of a view is a function of the relations that it selects from, it is said to be “structurally” dependent on its parents. More formally: structurally dependent relationships exist where the definition for one relation is inherently tied to the existence of a separate relation. Crucially, this structural dependency is recursive in nature.

Consider the following example:

+--------------+               +------------------+          +-------------------+
|              +--------------->                  |          |                   |
| Table A      |               |   View B         +---------->    View C         |
+--------------+               +------------------+          +-------------------+

Here, Table A is (by definition) not structurally dependent on any relations.
View B is structurally dependent on Table A.
View C is structurally dependent on Table B, and through recursive dependence, it is also structurally dependent on Table A.

Adapter-specific behavior

Both BigQuery and Snowflake have implemented “views” in a late-binding fashion. Views on these databases are logically dependent, but not structurally dependent.

In Redshift and Postgres, views are structurally dependent. Interestingly, Redshift supports Late Binding Views which are not structurally dependent.

When a relation is drop…cascaded, all of the structurally dependent descendants of the relation should also be marked as deleted in the cache.

Dropping a schema

For cache purposes, dropping a schema is equivalent to dropping all of the relations in the schema, and then dropping the schema itself. When a schema is dropped, all of the relations that structurally depend on relations inside of that schema should also be dropped.

drewbanin · 2018-08-10T19:08:42Z

I didn't actually touch on the renaming piece... need to think about that one a little bit more

beckjake · 2018-08-10T19:50:20Z

Renaming is just a drop + a create from a cache perspective, right?

drewbanin · 2018-08-10T19:52:21Z

@beckjake that's right. I wanted to investigate if there's any sort of difference in how postgres/redshift handles the swap-and-drop if we do it inside of a transaction. I don't think it makes any difference, but I wanted to confirm.

drewbanin · 2018-08-10T21:16:35Z

After chatting with Jake:

The dbt graph is an approximation of the the structural dependency graph inherent in the database. In situations where users use {{ ref() }} outside of a select statement (eg. in a comment (eg. to hoist refs)), then the dbt graph and the database dependency graph might not be exactly equivalent.

Instead, dbt should built a separate graph of structural dependencies by querying the database at the beginning of the run.

beckjake · 2018-08-10T21:17:56Z

It's worth noting that we only have to build the graph for postgres and redshift, and for redshift we can completely ignore late-binding views (since all we really care about is "will my drop ... cascade or alter table ... rename change anything?").

Feature/cache relations (#911)

beckjake · 2018-10-15T15:26:21Z

Implemented in #1025

cmcarthur added the estimate: 32 label Aug 15, 2018

beckjake self-assigned this Aug 23, 2018

drewbanin mentioned this issue Sep 7, 2018

"dbt run" triggers identical information schema query for each model #660

Closed

beckjake added a commit that referenced this issue Oct 12, 2018

Merge pull request #1025 from fishtown-analytics/feature/cache-relations

618dee0

Feature/cache relations (#911)

beckjake mentioned this issue Oct 12, 2018

Feature/cache relations (#911) #1025

Merged

beckjake closed this as completed Oct 15, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cache relations in DBT #911

Cache relations in DBT #911

beckjake commented Aug 8, 2018 •

edited

Loading

drewbanin commented Aug 9, 2018

beckjake commented Aug 9, 2018

cmcarthur commented Aug 9, 2018

cmcarthur commented Aug 9, 2018 •

edited

Loading

drewbanin commented Aug 9, 2018

beckjake commented Aug 9, 2018

drewbanin commented Aug 10, 2018

drewbanin commented Aug 10, 2018

beckjake commented Aug 10, 2018

drewbanin commented Aug 10, 2018 •

edited

Loading

drewbanin commented Aug 10, 2018

beckjake commented Aug 10, 2018

beckjake commented Oct 15, 2018

Cache relations in DBT #911

Cache relations in DBT #911

Comments

beckjake commented Aug 8, 2018 • edited Loading

Relation Caching plan

New concepts

New types/modules

new type RelationsCache

methods

new (internal) type NamespaceCache

attributes

methods

Changes to existing code

Update DefaultAdapter and subclasses

Update macros

drewbanin commented Aug 9, 2018

beckjake commented Aug 9, 2018

cmcarthur commented Aug 9, 2018

cmcarthur commented Aug 9, 2018 • edited Loading

drewbanin commented Aug 9, 2018

beckjake commented Aug 9, 2018

drewbanin commented Aug 10, 2018

Caching Strategy

Overview

Creative operations

Destructive operations

Dropping a relation

Logical Dependency

Structural Dependency

Adapter-specific behavior

Dropping a schema

drewbanin commented Aug 10, 2018

beckjake commented Aug 10, 2018

drewbanin commented Aug 10, 2018 • edited Loading

drewbanin commented Aug 10, 2018

beckjake commented Aug 10, 2018

beckjake commented Oct 15, 2018

beckjake commented Aug 8, 2018 •

edited

Loading

new type `RelationsCache`

new (internal) type `NamespaceCache`

Update `DefaultAdapter` and subclasses

cmcarthur commented Aug 9, 2018 •

edited

Loading

drewbanin commented Aug 10, 2018 •

edited

Loading