-
Notifications
You must be signed in to change notification settings - Fork 178
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[CT-203] A single show terse call for each database #83
Comments
@ajbosco First off, a hello and a thank you! This is a fun little issue that could easily translate to some time-savings on a dbt invocation, which I as a tools engineer by training appreciate wholeheartedly. It's awesome to see you bringing a working solution to the discussion as well. So, I dug into your working example and rooted around the code base. I like the idea (assuming I grasp it): Rather than dispatch separate warehouse queries for each schema, request all database information in one query that is then locally trimmed down. That right? My first "concern" is more a curiosity: It seems from that example that this is in use today? If so, have you ever had issues with caching leading to anomalous runtime quirks? I can envision at least theoretically a naive implementation, either of dbt Snowflake or the feature addition we're discussing, introducing possible state mismatches between caller and warehouse. I figure a rename or a delete would just come back as a generic error, but in theory, querying these separately might have less of those errors at runtime. If not this exact case, wondering if anything in this vein struck you as a concern -- I'm trying to think at scale. Second, just so I understand what you mean more fully:
Cached for the duration of a dbt run invocation? Do you expect caching between runs? (I'm guessing no incidentally, but just reducing ambiguity). |
Hi @VersusFacit! We've been testing this change a bit this week in our deployment and have seen good results. This doesn't actually change the behavior of dbt in regards to caching but rather just how it builds the cache. Currently, we see a call to Snowflake for every schema in our project on every run even when only running a single model, which according to this is the intended behavior. Each call can take 2-10 seconds so this adds up to a minute+ of extra time for each run. The thought is that we can just get all the data we'd get from the multiple calls in a single one and reduce that time. |
Hey, just want to apologize for letting this go a bit cold. I'm typing up a big ol' response over the next day or so. Rest assured, the reason is that this is an issue with more background than I certainly knew when I first responded. I'll get some notes here soon. ⌛ |
Alright, finally circling back to this 🙇♀️ I'll do my best to keep this succinct. ✍️ Background (feel free to skip -- this is for my own benefit and future readers')
Given that dbt caches relations, we can ask "What is the most efficient way to populate that cache with the following criteria":
🤔 What now? Since finding the right fix is a performance-based problem, we need to benchmark! Options:
Key research questions:
Implementation note: Internally, we did discussing maybe permitting both behaviors, perhaps my making behavior configurable or by trying 😵 Summary Just beware with the current implementation, that too many database objects will cause this to fail 😮 You of course may never reach this reality. Just a friendly advisory 😄 |
@ajbosco Thanks again for opening this issue, and for linking some sample code in DevotedHealth@f0bc4aa! What you've got here seems really promising. After some local testing (not scientific benchmarking), with a large randomly generated project (branch to generate, branch of dbt-core with project + simple printing):
Concerns:
Proposal:
|
This would make a huge difference for us, mostly when iterating locally against Snowflake. Our instance is 250-300ms away from me (Perth, AU -> US East) and doing a single call would be huge. We have 26 (and growing) schemas with dbt, as we found this a great way to organise data. |
This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please remove the stale label or comment on the issue, or it will be closed in 7 days. |
This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please remove the stale label or comment on the issue, or it will be closed in 7 days. |
Describe the feature
Currently, dbt makes a call to Snowflake for every schema. This requires opening a connection and running
show terse objects in schema <schema>
. For projects with many custom schemas though this can add a lot of overhead. I think it would be faster to instead just pull all the objects for a database and then cache the ones for the schemas we need.Describe alternatives you've considered
Not sure if there are other options to speed this up.
Additional context
I have a working example here: DevotedHealth@f0bc4aa
Who will this benefit?
Project with many custom schemas.
Are you interested in contributing this feature?
Yes!
The text was updated successfully, but these errors were encountered: