Saner approaches to getting metadata for Relations #49

jtcohen6 · 2020-01-29T02:57:55Z

Up to now, the dbt-spark plugin has leveraged a handful of metadata commands:

-- find all current databases
show databases

-- find all current tables in a database
show tables in my_db

-- determine if a relational object is a view or table
show tblproperties my_db.my_relation ('view.default.database')

The main issue with running one statement per relation is that it's very slow. This is justifiable for the get_catalog method of dbt docs generate, but not so at the start of a dbt run. Most databases have more powerful, accessible troves of metadata, often stored in an information_schema. Spark offers nothing so convenient; describe database returns only information about the database; describe table [extended] must be run for every relation.

@drewbanin ended up finding a semi-documented statement in the Spark source code that does most of the thing we want:

show table extended in my_db like '*'

It returns the same three columns as show tables in my_db, for all relations in my_db, with a bonus column information that packs a lot of good stuff:

Database: my_db
Table: view_model
Owner: root
Created Time: Wed Jan 29 01:58:46 UTC 2020
Last Access: Thu Jan 01 00:00:00 UTC 1970
Created By: Spark 2.4.4
Type: VIEW
View Text: select * from my_db.seed
View Default Database: default
View Query Output Columns: [id, first_name, last_name, email, gender, ip_address]
Table Properties: [transient_lastDdlTime=1580263126, view.query.out.col.3=email, view.query.out.col.0=id, view.query.out.numCols=6, view.query.out.col.4=gender, view.default.database=default, view.query.out.col.1=first_name, view.query.out.col.5=ip_address, view.query.out.col.2=last_name]
Serde Library: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
InputFormat: org.apache.hadoop.mapred.SequenceFileInputFormat
OutputFormat: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
Storage Properties: [serialization.format=1]
Schema: root
 |-- id: long (nullable = true)
 |-- first_name: string (nullable = true)
 |-- last_name: string (nullable = true)
 |-- email: string (nullable = true)
 |-- gender: string (nullable = true)
 |-- ip_address: string (nullable = true)

The same command also exists in Hive. This is a "big if true" find that could immediately clean up our workarounds for relation types. We'll want to check that it's supported by all vendors/implementations of Spark before committing to this approach, so we'll benefit from folks testing in other environments.

The text was updated successfully, but these errors were encountered:

aaronsteers · 2020-01-29T16:56:32Z

Great find! This looks promising and I imagine it could create significant performance benefits. And thank you for the link to the source code. I traced the blame on this line and it looks like the latest commit was >3 years ago, with even the prior version of that line shown here appearing to support the same syntax.

With that said, my guess is that support for this would likely correlate with spark version number moreso than with vendor, and it appears this has been in at least since version Spark 2.2 and likely longer. (Someone else jump in if you have additional/different info.)

For my part, I think this is a relatively safe bet and likely worth the performance boost. Although due to the noted lack of documentation, I also think some type of safe failover or feature flag might be advisable.

aaronsteers · 2020-01-29T20:28:30Z

Related: This feature is a long way out (at best), but here's the last and best reference I could find to the feature request to add INFORMATION_SCHEMA support natively. You can vote and review that Spark Jira issue here: https://issues.apache.org/jira/browse/SPARK-16452 . We cannot block on this feature but I thought it would be helpful to flag it at least here for general awareness.

jtcohen6 · 2020-01-29T21:35:53Z

Thanks for the pointer to that, @aaronsteers!

I have updated my original issue comment to reflect the issue on my end (faulty/outdated JDBC driver) that was causing me to encounter errors with variants of describe. That said, show table extended in my_db like '*' is still the closest thing we have to an information schema that we can access all at once.

If it works across the board, I think it offers a more performant approach to the get_catalog updates in #39 and #41, versus running describe table extended for every relation in the project. The difficulty there is in parsing the information column, which is a big string delimited by \n, rather than additional rows per property as in describe table extended.

aaronsteers · 2020-02-02T17:44:31Z

The difficulty there is in parsing the information column, which is a big string delimited by \n, rather than additional rows per property as in describe table extended.

I think a regex approach could probably make quick work of the column names and data types. I will take a quick stab at that and post back here.

aaronsteers · 2020-02-02T17:50:04Z

This regex search string seems to work on the sample output from above:

\|-- (.*): (.*) \(nullable = (.*)\b

This regex string outputs the three captured pairs:

column name
column type
nullable (true/false)

Link to test results and demo of this regex: https://regex101.com/r/E5YHCs/1

jtcohen6 · 2020-02-03T20:24:39Z

Very nice!

As far as the

more performant approach to the get_catalog updates

that I mentioned above, I think a natural fit here is the _get_one_catalog method that is introduced in dbt-labs/dbt-core#2037 and will hopefully ship in the 0.16.0 release of dbt Core.

Between now and then, we can still try to ship the proposed enhancements to get_catalog as written (looping across all relations and running describe ... extended) in the 0.15.0 release of this plugin.

jtcohen6 · 2020-05-29T21:38:06Z

I'm going to close this and open a more specific issue that suggests reimplementing _get_one_catalog to operate more efficiently on one schema at a time, instead of one relation at a time.

jtcohen6 mentioned this issue Jan 29, 2020

Leverage "show table extended" for Relation type metadata #50

Merged

jtcohen6 closed this as completed Jan 29, 2020

jtcohen6 reopened this Jan 29, 2020

jtcohen6 mentioned this issue Jan 30, 2020

0.15.0 upgrade #46

Merged

jtcohen6 closed this as completed May 29, 2020

jtcohen6 mentioned this issue May 29, 2020

More efficient catalog generation #93

Closed

jtcohen6 mentioned this issue Jan 18, 2022

[CT-202] Workaround for some limitations due to list_relations_without_caching method #228

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Saner approaches to getting metadata for Relations #49

Saner approaches to getting metadata for Relations #49

jtcohen6 commented Jan 29, 2020 •

edited

Loading

aaronsteers commented Jan 29, 2020 •

edited

Loading

aaronsteers commented Jan 29, 2020

jtcohen6 commented Jan 29, 2020 •

edited

Loading

aaronsteers commented Feb 2, 2020

aaronsteers commented Feb 2, 2020

jtcohen6 commented Feb 3, 2020 •

edited

Loading

jtcohen6 commented May 29, 2020

Saner approaches to getting metadata for Relations #49

Saner approaches to getting metadata for Relations #49

Comments

jtcohen6 commented Jan 29, 2020 • edited Loading

aaronsteers commented Jan 29, 2020 • edited Loading

aaronsteers commented Jan 29, 2020

jtcohen6 commented Jan 29, 2020 • edited Loading

aaronsteers commented Feb 2, 2020

aaronsteers commented Feb 2, 2020

jtcohen6 commented Feb 3, 2020 • edited Loading

jtcohen6 commented May 29, 2020

jtcohen6 commented Jan 29, 2020 •

edited

Loading

aaronsteers commented Jan 29, 2020 •

edited

Loading

jtcohen6 commented Jan 29, 2020 •

edited

Loading

jtcohen6 commented Feb 3, 2020 •

edited

Loading