[CT-203] A single show terse call for each database #83

ajbosco · 2022-01-19T19:14:44Z

Describe the feature

Currently, dbt makes a call to Snowflake for every schema. This requires opening a connection and running show terse objects in schema <schema>. For projects with many custom schemas though this can add a lot of overhead. I think it would be faster to instead just pull all the objects for a database and then cache the ones for the schemas we need.

Describe alternatives you've considered

Not sure if there are other options to speed this up.

Additional context

I have a working example here: DevotedHealth@f0bc4aa

Who will this benefit?

Project with many custom schemas.

Are you interested in contributing this feature?

Yes!

The text was updated successfully, but these errors were encountered:

VersusFacit · 2022-01-21T11:26:52Z

@ajbosco First off, a hello and a thank you! This is a fun little issue that could easily translate to some time-savings on a dbt invocation, which I as a tools engineer by training appreciate wholeheartedly. It's awesome to see you bringing a working solution to the discussion as well.

So, I dug into your working example and rooted around the code base. I like the idea (assuming I grasp it): Rather than dispatch separate warehouse queries for each schema, request all database information in one query that is then locally trimmed down. That right?

My first "concern" is more a curiosity: It seems from that example that this is in use today? If so, have you ever had issues with caching leading to anomalous runtime quirks? I can envision at least theoretically a naive implementation, either of dbt Snowflake or the feature addition we're discussing, introducing possible state mismatches between caller and warehouse. I figure a rename or a delete would just come back as a generic error, but in theory, querying these separately might have less of those errors at runtime. If not this exact case, wondering if anything in this vein struck you as a concern -- I'm trying to think at scale.

Second, just so I understand what you mean more fully:

cache the ones for the schemas we need

Cached for the duration of a dbt run invocation? Do you expect caching between runs? (I'm guessing no incidentally, but just reducing ambiguity).

ajbosco · 2022-01-21T14:36:17Z

Hi @VersusFacit! We've been testing this change a bit this week in our deployment and have seen good results. This doesn't actually change the behavior of dbt in regards to caching but rather just how it builds the cache.

Currently, we see a call to Snowflake for every schema in our project on every run even when only running a single model, which according to this is the intended behavior. Each call can take 2-10 seconds so this adds up to a minute+ of extra time for each run. The thought is that we can just get all the data we'd get from the multiple calls in a single one and reduce that time.

VersusFacit · 2022-01-26T10:00:57Z

Hey, just want to apologize for letting this go a bit cold. I'm typing up a big ol' response over the next day or so. Rest assured, the reason is that this is an issue with more background than I certainly knew when I first responded.

I'll get some notes here soon. ⌛

VersusFacit · 2022-02-02T08:02:54Z

Alright, finally circling back to this 🙇‍♀️ I'll do my best to keep this succinct.

✍️ Background (feel free to skip -- this is for my own benefit and future readers')
dbt caches a subset of all warehouse relations ahead of time, by schema:

dbt populates that cache with the adapter-specific list_relations_without_caching method (although this is used to populate the cache about relations in that schema)
as part of this method, dbt invokes show terse objects in schema <schema> for every project-relevant schema

Given that dbt caches relations, we can ask "What is the most efficient way to populate that cache with the following criteria":

populates this cache at start
benefits the most projects
pulls what is necessary for dbt while not resulting in big time-delays
find a method that doesn't restort to information_schema for all the reasons in the issue you linked

🤔 What now?

Since finding the right fix is a performance-based problem, we need to benchmark!

Options:

status quo: show objects once per schema for each relevant schema
propositus: show objects in each database, for each relevant database

Key research questions:

What is the 'average' character of a Snowflake project? What does a large organization's Snowflake objects list look like?
Can show objects 10k object limit be ameliorated/accommodated/sidestepped/ignored?
In the average case, what would be the gain from doing this? (You've already added some lovely preliminary findings on this! 🙏 )

Implementation note: Internally, we did discussing maybe permitting both behaviors, perhaps my making behavior configurable or by trying show objects per database + failing over to show objects per schema.

😵 Summary
Performance issues are hard! The dbt cache is complicated (I certainly am no expert even with some time to research it in depth)! So, for such a feature change, we'll have to tread lightly. It's a great problem, and I'm glad you have a working solution in the meantime for your project. Your solution, assuming it's accommodating your workflow, is great for your use case!

Just beware with the current implementation, that too many database objects will cause this to fail 😮 You of course may never reach this reality. Just a friendly advisory 😄

jtcohen6 · 2022-03-18T11:42:53Z

@ajbosco Thanks again for opening this issue, and for linking some sample code in DevotedHealth@f0bc4aa! What you've got here seems really promising.

After some local testing (not scientific benchmarking), with a large randomly generated project (branch to generate, branch of dbt-core with project + simple printing):

One show call per database is often (though not always) faster than one show call per schema
It would likely be even faster by multi-threading the show calls if multiple databases (as the current per-schema logic does)

Concerns:

show statements have a hard cap of 10k objects. We're much more likely to hit that cap when showing all objects in an entire database, versus in a single schema.
If we only need to cache objects from a single schema in a given database, it's faster to use show terse objects in schema than show ... database

Proposal:

When there are multiple schemas in a database, call show ... in database for entire database. (It's easy-ish to figure this, based on the SchemaMap returned by _get_cache_schemas)
If show terse objects in database returns 10k records, log a warning and "fall back" to running show terse objects in schema. (If show ... schema still returns 10k records, we should log another warning, something we don't do today.)
Use multi-threading for all queries, if available, via dbt.utils.executor

joshuataylor · 2022-05-01T11:04:06Z

This would make a huge difference for us, mostly when iterating locally against Snowflake.

Our instance is 250-300ms away from me (Perth, AU -> US East) and doing a single call would be huge. We have 26 (and growing) schemas with dbt, as we found this a great way to organise data.

github-actions · 2022-10-29T02:12:42Z

This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please remove the stale label or comment on the issue, or it will be closed in 7 days.

github-actions · 2023-04-28T01:53:07Z

This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please remove the stale label or comment on the issue, or it will be closed in 7 days.

ajbosco added enhancement New feature or request triage labels Jan 19, 2022

VersusFacit added backlog and removed triage labels Feb 2, 2022

This was referenced Feb 5, 2022

[CT-168] Cache objects for selected resources only? dbt-labs/dbt-core#4688

Closed

[CT-202] Workaround for some limitations due to list_relations_without_caching method dbt-labs/dbt-spark#228

Open

jtcohen6 added the jira label Feb 9, 2022

github-actions bot changed the title ~~A single show terse call for each database~~ [CT-203] A single show terse call for each database Feb 9, 2022

leahwicz mentioned this issue Feb 28, 2022

[CT-301] [Spike] Perf improvements for single show terse call for each database #98

Closed

jtcohen6 self-assigned this Mar 15, 2022

jtcohen6 removed their assignment Mar 18, 2022

github-actions bot added the Stale label Oct 29, 2022

jtcohen6 removed the Stale label Oct 29, 2022

jtcohen6 mentioned this issue Jan 9, 2023

dbt compile runs in 7+ minutes databricks/dbt-databricks#243

Closed

jtcohen6 mentioned this issue Mar 22, 2023

[ADAP-379] [Feature] Improve startup performance #528

Closed

3 tasks

github-actions bot added the Stale label Apr 28, 2023

github-actions bot closed this as completed May 6, 2023

VersusFacit reopened this Jun 6, 2023

github-actions bot closed this as completed Jun 13, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CT-203] A single show terse call for each database #83

[CT-203] A single show terse call for each database #83

ajbosco commented Jan 19, 2022

VersusFacit commented Jan 21, 2022 •

edited

Loading

ajbosco commented Jan 21, 2022

VersusFacit commented Jan 26, 2022 •

edited

Loading

VersusFacit commented Feb 2, 2022 •

edited

Loading

jtcohen6 commented Mar 18, 2022

joshuataylor commented May 1, 2022

github-actions bot commented Oct 29, 2022

github-actions bot commented Apr 28, 2023

[CT-203] A single show terse call for each database #83

[CT-203] A single show terse call for each database #83

Comments

ajbosco commented Jan 19, 2022

Describe the feature

Describe alternatives you've considered

Additional context

Who will this benefit?

Are you interested in contributing this feature?

VersusFacit commented Jan 21, 2022 • edited Loading

ajbosco commented Jan 21, 2022

VersusFacit commented Jan 26, 2022 • edited Loading

VersusFacit commented Feb 2, 2022 • edited Loading

jtcohen6 commented Mar 18, 2022

joshuataylor commented May 1, 2022

github-actions bot commented Oct 29, 2022

github-actions bot commented Apr 28, 2023

VersusFacit commented Jan 21, 2022 •

edited

Loading

VersusFacit commented Jan 26, 2022 •

edited

Loading

VersusFacit commented Feb 2, 2022 •

edited

Loading