Feature: faster snowflake catalogs (#2009) #2037

beckjake · 2020-01-09T21:11:06Z

(Tentatively) fix #2009

Thread catalog queries across information schemas
the get_catalog macro now takes an information schema and a list of schemas in that information schema
filtering happens in the database instead of in python

To be honest, I'm not totally sure about the bigquery impact of these changes :( The bigquery code does some funky stuff to handle the fact that the information schema is sometimes per-schema on that database, and I could easily have screwed it up, in which case we'd issue too many queries.

drewbanin · 2020-01-31T04:01:50Z

plugins/postgres/dbt/include/postgres/macros/catalog.sql

-      and sch.nspname not like 'pg\_%' -- avoid postgres system schemas, '_' is a wildcard so escape it
+    where (
+        {%- for schema in schemas -%}
+          sch.nspname = '{{ schema }}'{%- if not loop.last %} or {% endif -%}


should these predicates be case-insensitive on pg/redshift? I think that would mirror the existing behavior

drewbanin · 2020-01-31T04:02:44Z

plugins/snowflake/dbt/include/snowflake/macros/catalog.sql

+      join columns using ("table_database", "table_schema", "table_name")
+      where (
+        {%- for schema in schemas -%}
+          "table_schema" = '{{ schema }}'{%- if not loop.last %} or {% endif -%}


same here - we'd want to make this case-insensitive, right?

drewbanin · 2020-01-31T04:03:41Z

core/dbt/adapters/base/impl.py

+                executor.submit(self._get_one_catalog, info, schemas, manifest)
+                for info, schemas in schema_map.items() if len(schemas) > 0
+            ]
+            for future in as_completed(futures):


what would happen here if one of these futures fails (eg. with a database error)?

Well, this is my expectation:

future.result() will raise the DatabaseException

the exception will go through get_catalog, etc

eventually dbt will hit the context manager that cancels connections

ultimately dbt exits with an error about the DatabaseException

If we wanted, we could change it to check if the result was an exception, and if so store it until the end so all the threads are done. Then we could just raise the first one, or something.

If we wanted, we could change it to check if the result was an exception, and if so store it until the end so all the threads are done. Then we could just raise the first one, or something.

Yeah - I like this idea! I was initially worried that an exception here could cause dbt to hang, but that's definitely not the case. I was getting mixed up with how we do exception handling in the multithreaded job runner.

I think it would be great if the docs generate command wrote out a catalog.json file if any of the catalog queries succeed. This has proven to be a gnarly issue on BigQuery in particular, in which obscure additional permissions are required to fetch data from information schema tables.

Check out some relevant threads from dbt Slack:

I'm picturing something where we catch DatabaseExceptions raised from these futures and log an error message out, but continue waiting for the other futures to return. If any of them return successfully, then we can proceed on to writing out the catalog from the task. After the catalog is written, the task should exit with a nonzero exit code.

Do you buy that?

I'm picturing something where we catch DatabaseExceptions raised from these futures and log an error message out, but continue waiting for the other futures to return. If any of them return successfully, then we can proceed on to writing out the catalog from the task. After the catalog is written, the task should exit with a nonzero exit code.

Sure, we can do that. We should probably make it a warning I guess, so people can opt-in to it being fatal with --strict.

Yeah, good idea!

- Thread information schema queries, one per db - Pass schema list into catalog queries and filter on that in SQL - break the existing interface for get_catalog (sorry, not sorry)

drewbanin

This is looking really, really slick. Some unfortunate agate behaviors are preventing me from doing more complete timing tests. Let me know what you think about the comments in here.

drewbanin · 2020-02-04T01:44:27Z

core/dbt/task/generate.py

+        if exceptions:
+            logger.error(
+                'dbt encountered {} failure{} while writing the catalog'
+                .format(len(exceptions), (len(exceptions) == 1) * 's')


Do we want != 1 here?

also, this is exceedingly clever

also, this whole flow looks great in practice!

Encountered an error while generating catalog: Database Error Access Denied: Table the-psf:INFORMATION_SCHEMA.SCHEMATA: User does not have permission to query table the-psf:INFORMATION_SCHEMA.SCHEMATA. 21:39:33 | Catalog written to /Users/drew/fishtown/clients/bq/target/catalog.json dbt encountered 1 failures while writing the catalog

Can we just move this log line to be up above the Catalog written to... line? I want to make it clear that the catalog has still been written despite the errors.

Hah, yes, it should be !=1. And it is exceedingly clever, just a stupid pluralization hack I remember from a long time ago.

drewbanin · 2020-02-04T02:02:02Z

core/dbt/adapters/base/impl.py

+        # calculate the possible schemas for a given schema name
+        all_schema_names: Set[str] = set()
+        for schema in schemas:
+            all_schema_names.update({schema, schema.lower(), schema.upper()})


It looks to me like we're doing case-insensitive schema filtering in the catalog queries for every plugin -- do we need to pass in a set of the lower/upper cased variants of the schema name here? In dev, I see catalog SQL like this (pg):

where (upper(sch.nspname) = upper('SNAPSHOTS') or upper(sch.nspname) = upper('TEST_SCHEMA') or upper(sch.nspname) = upper('snapshots') or upper(sch.nspname) = upper('test_schema'))

Ooh, you're right - this is why the comparison was case-sensitive in catalog.sql before. I'll remove this.

that makes a ton of sense - i didn't put those two things together!

drewbanin · 2020-02-04T02:28:03Z

core/dbt/adapters/base/impl.py

+        # we want to re-raise on ctrl+c and BaseException
+        if exc is None:
+            catalog = future.result()
+            catalogs = agate.Table.merge([catalogs, catalog])


Agate seems unhappy about this on Snowflake. I see:

2020-02-04 02:07:41,316840 (MainThread): Tables contain columns with the same names, but different types. 2020-02-04 02:07:41,322568 (MainThread): Traceback (most recent call last): File "/Users/drew/fishtown/dbt/core/dbt/main.py", line 81, in main results, succeeded = handle_and_check(args) File "/Users/drew/fishtown/dbt/core/dbt/main.py", line 159, in handle_and_check task, res = run_from_args(parsed) File "/Users/drew/fishtown/dbt/core/dbt/main.py", line 212, in run_from_args results = task.run() File "/Users/drew/fishtown/dbt/core/dbt/task/generate.py", line 214, in run catalog_table, exceptions = adapter.get_catalog(self.manifest) File "/Users/drew/fishtown/dbt/core/dbt/adapters/base/impl.py", line 1061, in get_catalog catalogs, exceptions = catch_as_completed(futures) File "/Users/drew/fishtown/dbt/core/dbt/adapters/base/impl.py", line 1151, in catch_as_completed catalogs = agate.Table.merge([catalogs, catalog]) File "/Users/drew/fishtown/dbt/env/lib/python3.7/site-packages/agate/table/merge.py", line 45, in merge raise DataTypeError('Tables contain columns with the same names, but different types.') agate.exceptions.DataTypeError: Tables contain columns with the same names, but different types.

In this case, the error is super annoying. One information_schema contains a table with a clustering key defined (of type string). The other information_schema that gets merged in does not have any clustering keys on any tables, so every value in the stats:clustering_key:value columns is None, which I guess Agate assigned to an integery type?

I ran some debug code to see what was going wrong here, and I saw that the types for catalog and catalogs were:

stats:clustering_key:value': {<agate.data_types.number.Number object at 0x109d26a10>, <agate.data_types.text.Text object at 0x10a02c390>}

While I saw this on Snowflake, I have to imagine we'd see similar things on other databases. Any good ideas about what to do about this one? My (not ideal) thinking is that we could push the raw results into a list, then convert that raw list of lists back into a Table...

I have a fix for this (roll my own merge function that tracks null-ness). I think it's actually pretty ok.

Added a custom table merge implementation that tracks if a row is all null and merges those as "any type". - added unit tests for that! Removed some schema casing things fixed pluralization (it was reversed)

drewbanin

LGTM! I tested this out on our Snowflake... not scientific at all, but it looks like the information schema queries on Snowflake went from 23s to 11s using this branch :)

The much, much bigger impact of this change though is that individual catalog queries can fail without bricking the generation of the catlaog.json file. This will be great on BigQuery where:

the previous union over all datasets exhausted the compilation memory that BQ allocated (and this query would fail), or
an individual catalog query would fail b/c of bad permissions, bricking the docs generation!

mferryRV · 2020-02-17T16:29:46Z

@beckjake @drewbanin - is there any way to run this prior to your 0.16.0 release? We just did a load of work to add sources to dbt and now docs won't run 😅

drewbanin · 2020-02-21T23:21:12Z

Hey @mferryRV - I'm just going through some GitHub issue notifications I've received and wanted to make sure you got an answer here too (I do think we discussed on Slack!)

Try building a virtualenv and installing a pre-release of dbt:

python3 -m venv dbt-env
source dbt-env/bin/activate
pip install dbt==0.16.0b1

beckjake requested a review from drewbanin January 9, 2020 21:11

cla-bot bot added the cla:yes label Jan 9, 2020

beckjake force-pushed the feature/faster-snowflake-catalogs branch from 3f09b30 to c6f4434 Compare January 9, 2020 21:18

drewbanin reviewed Jan 31, 2020

View reviewed changes

Jacob Beck added 2 commits January 31, 2020 10:53

Attempt getting snowflake to go faster and use less ram

0658a42

- Thread information schema queries, one per db - Pass schema list into catalog queries and filter on that in SQL - break the existing interface for get_catalog (sorry, not sorry)

migrate other adapters to use the snowflake technique

0bf6eca

beckjake force-pushed the feature/faster-snowflake-catalogs branch from c6f4434 to 12f1188 Compare January 31, 2020 18:34

PR feedback w/ improved catalog results behavior

c1af3ab

beckjake force-pushed the feature/faster-snowflake-catalogs branch from 12f1188 to c1af3ab Compare January 31, 2020 18:52

beckjake marked this pull request as ready for review January 31, 2020 18:53

beckjake requested a review from drewbanin February 3, 2020 18:00

jtcohen6 mentioned this pull request Feb 3, 2020

Saner approaches to getting metadata for Relations dbt-labs/dbt-spark#49

Closed

drewbanin reviewed Feb 4, 2020

View reviewed changes

PR feedback

04bc2a8

Added a custom table merge implementation that tracks if a row is all null and merges those as "any type". - added unit tests for that! Removed some schema casing things fixed pluralization (it was reversed)

beckjake force-pushed the feature/faster-snowflake-catalogs branch from 5b36b7d to 04bc2a8 Compare February 4, 2020 14:48

beckjake requested a review from drewbanin February 4, 2020 16:07

drewbanin approved these changes Feb 6, 2020

View reviewed changes

beckjake merged commit 0df49c5 into dev/barbara-gittings Feb 6, 2020

beckjake deleted the feature/faster-snowflake-catalogs branch February 6, 2020 14:37

dataders mentioned this pull request Sep 3, 2020

hotfix for v.0.16 change (see dbt PR #2037) dbt-msft/dbt-sqlserver#48

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature: faster snowflake catalogs (#2009) #2037

Feature: faster snowflake catalogs (#2009) #2037

beckjake commented Jan 9, 2020

drewbanin Jan 31, 2020

drewbanin Jan 31, 2020

drewbanin Jan 31, 2020

beckjake Jan 31, 2020

drewbanin Jan 31, 2020

beckjake Jan 31, 2020

drewbanin Jan 31, 2020

drewbanin left a comment

drewbanin Feb 4, 2020

drewbanin Feb 4, 2020

drewbanin Feb 4, 2020

beckjake Feb 4, 2020

drewbanin Feb 4, 2020

beckjake Feb 4, 2020

drewbanin Feb 6, 2020

drewbanin Feb 4, 2020

beckjake Feb 4, 2020

drewbanin left a comment

mferryRV commented Feb 17, 2020

drewbanin commented Feb 21, 2020

Feature: faster snowflake catalogs (#2009) #2037

Feature: faster snowflake catalogs (#2009) #2037

Conversation

beckjake commented Jan 9, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

drewbanin left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

drewbanin left a comment

Choose a reason for hiding this comment

mferryRV commented Feb 17, 2020

drewbanin commented Feb 21, 2020