-
Notifications
You must be signed in to change notification settings - Fork 229
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Saner approaches to getting metadata for Relations #49
Comments
Great find! This looks promising and I imagine it could create significant performance benefits. And thank you for the link to the source code. I traced the blame on this line and it looks like the latest commit was >3 years ago, with even the prior version of that line shown here appearing to support the same syntax. With that said, my guess is that support for this would likely correlate with spark version number moreso than with vendor, and it appears this has been in at least since version Spark 2.2 and likely longer. (Someone else jump in if you have additional/different info.) For my part, I think this is a relatively safe bet and likely worth the performance boost. Although due to the noted lack of documentation, I also think some type of safe failover or feature flag might be advisable. |
Related: This feature is a long way out (at best), but here's the last and best reference I could find to the feature request to add |
Thanks for the pointer to that, @aaronsteers! I have updated my original issue comment to reflect the issue on my end (faulty/outdated JDBC driver) that was causing me to encounter errors with variants of If it works across the board, I think it offers a more performant approach to the |
I think a regex approach could probably make quick work of the column names and data types. I will take a quick stab at that and post back here. |
This regex search string seems to work on the sample output from above: \|-- (.*): (.*) \(nullable = (.*)\b This regex string outputs the three captured pairs:
Link to test results and demo of this regex: https://regex101.com/r/E5YHCs/1 |
Very nice! As far as the
that I mentioned above, I think a natural fit here is the Between now and then, we can still try to ship the proposed enhancements to |
I'm going to close this and open a more specific issue that suggests reimplementing |
Up to now, the
dbt-spark
plugin has leveraged a handful of metadata commands:The main issue with running one statement per relation is that it's very slow. This is justifiable for the
get_catalog
method ofdbt docs generate
, but not so at the start of adbt run
. Most databases have more powerful, accessible troves of metadata, often stored in aninformation_schema
. Spark offers nothing so convenient;describe database
returns only information about the database;describe table [extended]
must be run for every relation.@drewbanin ended up finding a semi-documented statement in the Spark source code that does most of the thing we want:
It returns the same three columns as
show tables in my_db
, for all relations inmy_db
, with a bonus columninformation
that packs a lot of good stuff:The same command also exists in Hive. This is a "big if true" find that could immediately clean up our workarounds for relation types. We'll want to check that it's supported by all vendors/implementations of Spark before committing to this approach, so we'll benefit from folks testing in other environments.
The text was updated successfully, but these errors were encountered: