Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CT-203] A single show terse call for each database #83

Closed
ajbosco opened this issue Jan 19, 2022 · 8 comments
Closed

[CT-203] A single show terse call for each database #83

ajbosco opened this issue Jan 19, 2022 · 8 comments
Labels
enhancement New feature or request Stale

Comments

@ajbosco
Copy link

ajbosco commented Jan 19, 2022

Describe the feature

Currently, dbt makes a call to Snowflake for every schema. This requires opening a connection and running show terse objects in schema <schema>. For projects with many custom schemas though this can add a lot of overhead. I think it would be faster to instead just pull all the objects for a database and then cache the ones for the schemas we need.

Describe alternatives you've considered

Not sure if there are other options to speed this up.

Additional context

I have a working example here: DevotedHealth@f0bc4aa

Who will this benefit?

Project with many custom schemas.

Are you interested in contributing this feature?

Yes!

@ajbosco ajbosco added enhancement New feature or request triage labels Jan 19, 2022
@VersusFacit
Copy link
Contributor

VersusFacit commented Jan 21, 2022

@ajbosco First off, a hello and a thank you! This is a fun little issue that could easily translate to some time-savings on a dbt invocation, which I as a tools engineer by training appreciate wholeheartedly. It's awesome to see you bringing a working solution to the discussion as well.

So, I dug into your working example and rooted around the code base. I like the idea (assuming I grasp it): Rather than dispatch separate warehouse queries for each schema, request all database information in one query that is then locally trimmed down. That right?

My first "concern" is more a curiosity: It seems from that example that this is in use today? If so, have you ever had issues with caching leading to anomalous runtime quirks? I can envision at least theoretically a naive implementation, either of dbt Snowflake or the feature addition we're discussing, introducing possible state mismatches between caller and warehouse. I figure a rename or a delete would just come back as a generic error, but in theory, querying these separately might have less of those errors at runtime. If not this exact case, wondering if anything in this vein struck you as a concern -- I'm trying to think at scale.

Second, just so I understand what you mean more fully:

cache the ones for the schemas we need

Cached for the duration of a dbt run invocation? Do you expect caching between runs? (I'm guessing no incidentally, but just reducing ambiguity).

@ajbosco
Copy link
Author

ajbosco commented Jan 21, 2022

Hi @VersusFacit! We've been testing this change a bit this week in our deployment and have seen good results. This doesn't actually change the behavior of dbt in regards to caching but rather just how it builds the cache.

Currently, we see a call to Snowflake for every schema in our project on every run even when only running a single model, which according to this is the intended behavior. Each call can take 2-10 seconds so this adds up to a minute+ of extra time for each run. The thought is that we can just get all the data we'd get from the multiple calls in a single one and reduce that time.

@VersusFacit
Copy link
Contributor

VersusFacit commented Jan 26, 2022

Hey, just want to apologize for letting this go a bit cold. I'm typing up a big ol' response over the next day or so. Rest assured, the reason is that this is an issue with more background than I certainly knew when I first responded.

I'll get some notes here soon. ⌛

@VersusFacit
Copy link
Contributor

VersusFacit commented Feb 2, 2022

Alright, finally circling back to this 🙇‍♀️ I'll do my best to keep this succinct.

✍️ Background (feel free to skip -- this is for my own benefit and future readers')
dbt caches a subset of all warehouse relations ahead of time, by schema:

  • dbt populates that cache with the adapter-specific list_relations_without_caching method (although this is used to populate the cache about relations in that schema)
  • as part of this method, dbt invokes show terse objects in schema <schema> for every project-relevant schema

Given that dbt caches relations, we can ask "What is the most efficient way to populate that cache with the following criteria":

  • populates this cache at start
  • benefits the most projects
  • pulls what is necessary for dbt while not resulting in big time-delays
  • find a method that doesn't restort to information_schema for all the reasons in the issue you linked

🤔 What now?

Since finding the right fix is a performance-based problem, we need to benchmark!

Options:

  • status quo: show objects once per schema for each relevant schema
  • propositus: show objects in each database, for each relevant database

Key research questions:

  • What is the 'average' character of a Snowflake project? What does a large organization's Snowflake objects list look like?
  • Can show objects 10k object limit be ameliorated/accommodated/sidestepped/ignored?
  • In the average case, what would be the gain from doing this? (You've already added some lovely preliminary findings on this! 🙏 )

Implementation note: Internally, we did discussing maybe permitting both behaviors, perhaps my making behavior configurable or by trying show objects per database + failing over to show objects per schema.

😵 Summary
Performance issues are hard! The dbt cache is complicated (I certainly am no expert even with some time to research it in depth)! So, for such a feature change, we'll have to tread lightly. It's a great problem, and I'm glad you have a working solution in the meantime for your project. Your solution, assuming it's accommodating your workflow, is great for your use case!

Just beware with the current implementation, that too many database objects will cause this to fail 😮 You of course may never reach this reality. Just a friendly advisory 😄

@jtcohen6
Copy link
Contributor

@ajbosco Thanks again for opening this issue, and for linking some sample code in DevotedHealth@f0bc4aa! What you've got here seems really promising.

After some local testing (not scientific benchmarking), with a large randomly generated project (branch to generate, branch of dbt-core with project + simple printing):

  • One show call per database is often (though not always) faster than one show call per schema
  • It would likely be even faster by multi-threading the show calls if multiple databases (as the current per-schema logic does)

Concerns:

  • show statements have a hard cap of 10k objects. We're much more likely to hit that cap when showing all objects in an entire database, versus in a single schema.
  • If we only need to cache objects from a single schema in a given database, it's faster to use show terse objects in schema than show ... database

Proposal:

  • When there are multiple schemas in a database, call show ... in database for entire database. (It's easy-ish to figure this, based on the SchemaMap returned by _get_cache_schemas)
  • If show terse objects in database returns 10k records, log a warning and "fall back" to running show terse objects in schema. (If show ... schema still returns 10k records, we should log another warning, something we don't do today.)
  • Use multi-threading for all queries, if available, via dbt.utils.executor

@jtcohen6 jtcohen6 removed their assignment Mar 18, 2022
@joshuataylor
Copy link
Contributor

This would make a huge difference for us, mostly when iterating locally against Snowflake.

Our instance is 250-300ms away from me (Perth, AU -> US East) and doing a single call would be huge. We have 26 (and growing) schemas with dbt, as we found this a great way to organise data.

@github-actions
Copy link
Contributor

This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please remove the stale label or comment on the issue, or it will be closed in 7 days.

@github-actions
Copy link
Contributor

This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please remove the stale label or comment on the issue, or it will be closed in 7 days.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request Stale
Projects
None yet
Development

No branches or pull requests

4 participants