Evaluate schema discovery queries that happen at startup #2127

cmcarthur · 2020-02-12T17:26:35Z

dbt runs a bunch of introspective queries at startup:

finding the schemas that exist in the database
finding the tables in those schemas
creating schemas if they don't exist

On some databases (postgres/redshift) these information schemas are pretty quick. On other databases (snowflake, spark) these queries can (and do) run for 10s of seconds! This means that the typical flow for running dbt looks like:

$ dbt run
<wait a not-insignificant number of seconds>
12:00:00 | Running with 8 threads
....

These queries are unavoidable. We experimented with using faster alternatives to the information schema (#1877) and ultimately, we must reckon with the reality that hitting the information schema can be slow. The next best thing that we can do is understand and optimize when these queries happen in order to provide the best possible experience for users running dbt.

The text was updated successfully, but these errors were encountered:

beckjake · 2020-02-24T19:01:35Z

There are really only two metadata queries that occur before a run starts on snowflake:

we list all the schemas from the information schema. We do this to avoid costly unnecessary 'create schema if not exists ...' calls.
- One alternative here is to always try to create the schema if it doesn't exist. Very unscientifically: against DBT_TEST, the list_schemas call took 2.21s. A create schema if not exists for an existing schema took 2.24s. If that's at least close to normal, changing this will hurt performance in the >1 schemas case.
- We could also try to use show schemas in {database}. The downsides are:
  1. You have to account for the >10k schemas case and fall back (probably fine!)
  2. You have to account for the database case sensitivity, likely with an extra show databases call
  3. It seems like you have to use table(result_scan(last_query_id())) or similar to get useful data out of show commands, which doesn't seem substantially faster than information_schema. Maybe I'm way off here, if I am this might be worth doing.
- We could parallelize this, using one thread per database.
We fill the cache with information about relations, by querying information_schema.tables. This area is probably the only hope for improving performance on this front.
- Previous work on making catalogs faster didn't seem to indicate that describe or show had any significant performance improvements over querying the information_schema, and has similar limits around 10k. I'm happy to create a branch that does it this way and we can try it out on some large snowflake environments and see what happens.could create some frustrating performance side-effects.
- We could do this up front, but parallelize this part across all the databases we're querying. This will probably get slower in the one-db case (paying the threading overhead costs) and faster everywhere else.
- Now that we fully lazy-load our connections, it's possible that we could lazy-fill the cache on a schema by schema basis for non-postgres databases (postgres/redshift need to create the metadata links for drop ... cascades into views, so I think they'd still have to do this up-front). I don't think that's the worst idea ever, though there will be some thread coordination issues we'll have to consider, bottlenecking there could easily tank performance. This idea is the most likely to cause weird annoying deadlock-y bugs and hard-to-predict performance behavior. The graph queue would probably have to get smarter about ordering and schedule tasks in the same database more carefully.

dbt also makes a number of unnecessary BEGIN and COMMIT calls, each of which takes ~0.5s on my machine. We can save a couple seconds by skipping those, there seems to be no good reason for these queries to run in a transaction, beyond dbt's innate eagerness to wrap everything in transactions.

…tartup Use threadpools for filling the cache and listing schemas (#2127)

drewbanin changed the title ~~Deeply evaluate schema discovery queries that happen at startup, decide which ones don't need to happen~~ Evaluate schema discovery queries that happen at startup Feb 20, 2020

beckjake mentioned this issue Feb 25, 2020

Use threadpools for filling the cache and listing schemas (#2127) #2157

Merged

4 tasks

beckjake closed this as completed in #2157 Feb 27, 2020

beckjake added a commit that referenced this issue Feb 27, 2020

Merge pull request #2157 from fishtown-analytics/feature/threadpool-s…

5d4d4f3

…tartup Use threadpools for filling the cache and listing schemas (#2127)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Evaluate schema discovery queries that happen at startup #2127

Evaluate schema discovery queries that happen at startup #2127

cmcarthur commented Feb 12, 2020 •

edited by drewbanin

Loading

beckjake commented Feb 24, 2020

Evaluate schema discovery queries that happen at startup #2127

Evaluate schema discovery queries that happen at startup #2127

Comments

cmcarthur commented Feb 12, 2020 • edited by drewbanin Loading

beckjake commented Feb 24, 2020

cmcarthur commented Feb 12, 2020 •

edited by drewbanin

Loading