Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature/cache relations (#911) #1025

Merged
merged 31 commits into from
Oct 12, 2018
Merged

Conversation

beckjake
Copy link
Contributor

@beckjake beckjake commented Sep 25, 2018

Relation Caching!

Perform caching on relations within a single dbt run.

Currently only the list_relations and get_relation adapter methods read from the cache, and set_relations_cache, drop_relation, rename_relation, and cache_new_relation write to the cache.

This PR also removes list_relations from the list of wrapped (and therefore supported) adapter methods - it's no longer available in jinja code. This PR makes list_relations a bit misleading, as discussed above - it's really only valid for checking the existence of downstream models. See caveats for details...

On a somewhat disappointing note, the tests did not get faster, but on a positive note drew tested it with a large real project and saw improvements (~25%, I think).

new flags/configuration

You can disable using the cache with --bypass-cache, a new flag.
You can enable extremely verbose cache logging with --log-cache-events, a new hidden flag. Integration tests turn this flag on.

Caveats:

The cache is not a real cache and isn't reliable for everything, in particular:

  • Models that fail at create time will still be in the relations list
    • this is because the cache doesn't know about transactions, and the cache add happens before the actual create statement runs
    • this should not matter due to dbt's ordering/dependency guarantees
  • when the current dbt run itself creates a view based on a table, the cache will not know that dropping the table will cascade to the view
    • adding this information reliably would be very hard
    • this also should not matter due to dbt's ordering/dependency guarantees

Given dbt's guarantees, the cache is valid for all relations downstream from the currently-executing model, however models that are currently in an error state and upstream models may be incorrect (either in the cache incorrectly or removed from the cache incorrectly).

Bounus

You can now examine the generated template code as you step through templates in pdb/ipdb. It's not the most intuitive code to step through, but it can help with tracking down issues triggered inside templates

Copy link
Member

@cmcarthur cmcarthur left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@beckjake i'm still working on reviewing this but i'm just going to post my comments to start the conversation. i want to take some extra time to understand what's happening in the cache class, and also i want to think a little more deeply about how the relations cache / get relations relates to the catalog, especially in how we implement case-insensitive schema/table comparison logic in multiple places now. but i don't want to block for another day or two on me thinking about that.

schema, identifier, relations_list,
model_name)
table = self.get_bq_table(schema, identifier)
return self.bq_table_to_relation(table)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this approach.

Can you make get_bq_table and bq_table_to_relation clearly part of the private API of this adapter?

node.schema.lower()
for node in manifest.nodes.values()
})
schemas = frozenset(s.lower() for s in manifest.get_used_schemas())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice


def drop_relation(self, relation, model_name=None):
self.cache.drop(schema=relation.schema, identifier=relation.identifier)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you want if dbt.flags.USE_CACHE around this, no?

def get_relation(self, schema, identifier, model_name=None):
relations_list = self.list_relations(schema, model_name)

matches = self._make_match(relations_list, schema, identifier)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had imagined this would go directly to the cache, skipping list_relations since we are planning to deprecate that. I guess this is functionally the same, but is also a little confusing. I think it'd be cleaner to hit the cache here directly

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the case where USE_CACHE is false, or the schema is not in the cache, that won't work. We can look it up in the cache first and fall back to list_relations, if you prefer that, but going through list_relations is unavoidable to some degree.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, yeah, you are right. Thanks.

def _link_cached_relations(self, manifest, schemas):
# now set up any links
try:
table = self.run_operation(manifest, GET_RELATIONS_OPERATION_NAME)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is surprisingly easy. love it.

try:
table = self.run_operation(manifest, GET_RELATIONS_OPERATION_NAME)
# avoid a rollback when releasing the connection
self.commit_if_has_connection(GET_RELATIONS_OPERATION_NAME)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a rollback should work here, and i think it's less error-prone? just in case someone customizes the get_relations_data operation to actually modify the warehouse

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch, this is from when I was trying to have the cache handle rollbacks (a bad idea!) and I wanted to minimize the number of them.

table = self._relations_filter_table(table, schemas)

for (refed_schema, refed_name, dep_schema, dep_name) in table:
self.cache.add_link(dep_schema, dep_name, refed_schema, refed_name)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note to myself to come back here and look at this again

'--log-cache-events',
action='store_true',
help=argparse.SUPPRESS,
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i wonder if this would be better implemented as a --trace flag or something. i'm sure there are other places where we'd like to log a lot more info, e.g. connection pool management

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or maybe some sort of --log=dbt.cache, where log is a repeatable argument that enables log propagation for the given package? Usually when I want more granular logging, one thing I don't want is more granular logging everywhere.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm maybe @drewbanin would have something more useful to say here. My specific concern is having proliferating set of flags related to logging specific event types. Do you have a use case in mind for --log=dbt.cache? Is that for ease of development?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah. This logging is really only useful for debugging narrow cache-related issues. I think --log=dbt.cache and stuff like it would probably have to be a whole new PR, the way we currently set up logging doesn't really play nice with that structure.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, the cache logging here is way too verbose to be useful in the default case. I like the idea of turning on logs per-module, but that seems more useful for developers of dbt than users of dbt itself. We could also make it a config in profiles.yml I suppose? Like:

config:
  logging: 
    modules: ['dbt.cache', 'dbt.whatever']
    ...

I imagine there's other logging things to configure too. Regardless, don't know that we need to implement it in this PR.

I agree that --log-cache-events feels weird, but since it's set to "suppressed", I feel great about removing it in the future if we do something more comprehensive around logging.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah. I don't think anyone should ever be passing --log-cache-events in production, unless maybe we ask someone to do so as part of tracking down a cache consistency issue. It's nice to have it for integration tests though, I've already tracked down an intermittent cache bug due to the extra output.

self.schema = schema
self.identifier = identifier
self.referenced_by = {}
self.inner = inner
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there a situation where self.schema would not be equivalent to self.inner.schema, and self.identifier would not be equivalent to self.inner.identifier? seems like they are redundant

schema=schema,
identifier=identifier,
inner=inner
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

related to the previous comment -- couldn't the api here be simply def add(self, relation)

relation = self.relations.pop(old_key)

# change the old_relation's name and schema to the new relation's
relation.rename(new_key)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think you want to grab the re-entrant lock around lines 377-380 to make the rename appear atomic across threads

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WAIT i am confusing cache.rename() with cachedrelation.rename(). this is perfect actually, you're already locking around this whole fn.

"""Clear the cache"""
with self.lock:
self.relations.clear()
self.schemas.clear()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in general, I think you did a nice job with this module. asks:

  • a lot of the comments repeat what is in the docstrings. can you give this a once over and clean some of those up?
  • these classes contain public, private, and unit-test-only APIs. as a general rule, I don't like providing functions for unit tests only as I think it complicates the APIs. but the functions here used for unit tests only are marked as such so idc so much. can you just give this a once over and make sure that the public and private APIs are clearly designataed as such?
  • it seems like there is a LOT of redundant logging that was probably useful during development, but should perhaps be removed now. can you give these cache logger debug calls a once over?
  • finally these APIs work with both (schema,identifier) pairs and relations. i'd prefer to use relations where possible. the best example here that i can see is rename() -- we specifically switched adapter.rename to use relations instead of (schema,identifier) pairs, so it seems undesirable to me to have the cache class use old_schema,old_identifier,new_schema,new_identifier as arguments

# put ourselves in the cache using the 'lazycache' method
linecache.cache[filename] = (lambda: source,)

return super(MacroFuzzEnvironment, self)._compile(source, filename)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'd like to talk to you about this for 5min. i understand the idea on a basic level but am not familiar with the python internals here

if self.inner:
# Relations store this stuff inside their `path` dict. But they
# also store a table_name, and usually use it in their .render(),
# so we need to update that as well. It doesn't appear that
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

table_name used to conditionally be name + '__dbt_tmp, but we've since removed that, so now table_name == identifier I believe. One thing to watch out for here might be ephemeral models... I need to do more digging to see if they're a relevant concern here, but wanted to surface it.

def cache_new_relation(self, relation):
"""Cache a new relation in dbt. It will show up in `list relations`."""
if relation is None:
dbt.exceptions.raise_compiler_error()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add a small message here?

dbt.exceptions.raise_compiler_error()
if dbt.flags.USE_CACHE:
self.cache.add(
schema=relation.schema,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think Connor indicated this above, but i think it would be wise to operate at a higher level of abstraction here. In the future, we're probably going to make it possible to configure the database (or project) that models get rendered into, and I imagine we won't want to go back and refactor caching when that change occurs.

Would it make sense to just pass in a relation here, and somehow make the Relation responsible for reporting it's own db/schema/identifier to the cache?

@@ -216,6 +246,13 @@ def rename(self, schema, from_name, to_name, model_name=None):

def rename_relation(self, from_relation, to_relation,
model_name=None):
if dbt.flags.USE_CACHE:
self.cache.rename(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as above, I think this should just operate on relations, not their schemas/identifiers

)
return False
else:
return True
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this method only intended to return if a schema is cached? I imagined that the else branch would check if model_name is present in the cache from looking at the signature and how it's used

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The cache only tracks 'in' status on a schema level, it's impossible to know if an entry is unknown to the cache or actually does not exist. The model_name bit is just for logging.

I'll rename the method to try to communicate that better.

inner=relation
)
self._link_cached_relations(manifest, schemas)
# it's possible that there were no relations in some schemas. We want
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a good catch

referenced_class.schema != dependent_class.schema)
)

select
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'm not sure if this is significant, but is there any chance that these relationships can be duplicated? Like if you join a table to itself in a view:

-- models/downstream_view.sql

select t1.name, t2.name
from some_table t1
join some_table t2 on t1.parent_id = t2.child_id

Does that create two different entries in the internal relationships table between downstream_view and some_table? Regardless, might be worth distinct-ing the results here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nvm, you're grouping by all four so that will distinct the records

@beckjake beckjake force-pushed the feature/cache-relations branch from 5c34a3c to fb7dcab Compare October 4, 2018 20:39
@cmcarthur
Copy link
Member

cmcarthur commented Oct 8, 2018

this is approved contigent on:

  • @drewbanin 's approval in general
  • also his specific approval on adding the new command line option to control cache-related debug logging

@drewbanin
Copy link
Contributor

I would really like to spend more time testing this before we merge! Are other things blocking on this PR? I've been meaning to stress test it for a while now, and will hopefully have an opportunity to dig into it in the next day or two

@beckjake
Copy link
Contributor Author

beckjake commented Oct 8, 2018

I have a branch that handles #1035 and #963 that's based off this branch, and most of my other cards have to do with adapters as well, so I was going to base them off that branch.

@drewbanin
Copy link
Contributor

It looks to me like caching fails hard when views select from tables in schema that aren't operated by dbt. A view model like:

select * from public.event -- this is a simple table I manually created

will succeed the first time around, and then the second run of dbt will fail with:

Cache inconsistency detected: in add_link, referenced link key ReferenceKey(schema='public', identifier='event') not in cache!

When views reference relations defined outside of any dbt schemas, I think caching should just ignore them, right? In this case, public would be any source data schema, so this would be a really common paradigm, and one that we should add good tests for. I will say that --bypass-cache seems to be working though :)

I'd like to run our Internal Analytics project against this branch too, but there's an unrelated bug that's preventing me from doing that. I'll open a separate issue.

@beckjake
Copy link
Contributor Author

beckjake commented Oct 9, 2018

Ooh, nice bug. I think that means we just need to change how add_link works to where if the referenced relation's schema is not in the cache, we just continue. We really only care if we're linking to a table we control.

@beckjake beckjake force-pushed the feature/cache-relations branch from 868032c to c85cb43 Compare October 9, 2018 14:43
@beckjake
Copy link
Contributor Author

beckjake commented Oct 9, 2018

Ok, fixed, and I added a test that exposes the problem (and any similar issues around external references)

@drewbanin
Copy link
Contributor

Cool! Let's get #1048 merged, then I'll be able to smoke test this with a couple of redshift projects.

@drewbanin
Copy link
Contributor

Ok. Let's merge this. Once it's in dev/guion-bluford, we'll be able to get many more other folks to help out with testing too. I'm really excited about this!

@beckjake beckjake merged commit 618dee0 into dev/guion-bluford Oct 12, 2018
@beckjake beckjake deleted the feature/cache-relations branch October 12, 2018 21:24
@beckjake
Copy link
Contributor Author

Fixes #911

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants