Skip to content

Conversation

@jackyhu-db
Copy link
Contributor

@jackyhu-db jackyhu-db commented Jun 11, 2025

Optimize GetColumnsExtendedAsync in Databricks via DESC TABLE EXTENDED

Motivation

Currently, HiveServer2Statement.GetColumnsExtendedAsync will make 3 thrift calls GetColumns, GetPrimaryKeys and GetCrossReference to get all the information and then join them into one result set. Now Databricks introduces a SQL DESC TABLE EXTENDED <table> AS JSON that will return all of these information in one query, this can save 2 extra roundtrips and improve the performance.

Description

Override HiveServer2Statement.GetColumnsExtendedAsync in DatabricksStatement by executing single SQL DESC TABLE EXTENDED <table> AS JSON to get all the required column info for GetColumnsExtendedAsync then combine and join these info into the result. As this SQL DESC TABLE EXTENDED <table> AS JSON is only available in Databricks Runtime 16.2 or above, it will check the ServerProtocolVersion. if it does not meet the requirement, it will fallback the implementation of base class.

Change

  • Added DescTableExtendedResult that represents the result of DESC TABLE EXTENDED <table> AS JSON (see format here). It also
    • Parses the string-based table_constraints to the structured primary key and foreign key constraints
    • Adds calculated properties DataType, ColumnSize, FullTypeName which are calculated from the column type and type specific properties
  • Changed the access modifiers of some properties and methods in HiveServer2Statement to protected so that they can be used/overridden in subclass DatabricksStatement
    • PrimaryKeyFields
    • ForeignKeyFields
    • PrimaryKeyPrefix
    • ForeignKeyPrefix
    • GetColumnsExtendedAsync
    • CreateEmptyExtendedColumnsResult
  • Updated DatabricksStatement with the changes below
    • Added GetColumnsExtendedAsync in DatabricksStatement, it executes query DESC TABLE EXTENDED <table> AS JSON to get all the required info and then json them into the QueryResult
    • Moved the column metadata schema creation logic from GetColumnsAsync to a reusable method CreateColumnMetadataSchema
    • Moved the column metadata data array creation logic from GetColumnsAsync to a reusable method CreateColumnMetadataEmptyArray
  • Added a Databricks connection Parameter adbc.databricks.use_desc_table_extended
  • Added an internal calculated property CanUseDescTableExtended in DatabricksConnection, DatabricksStatement will call it to decide if it will override GetColumnsExtendedAsync
  • Improved the Databricks driver integration test StatementTest:CanGetColumnsExtended
    • Setup the required table resources during the test instead of depending on manual setup before the test
    • Support running multiple test cases from the inputs
    • Add a deep result check to make sure all the properties of all the columns are set correctly

Testing

  • Added Test class DescTableExtendedResultTest to cover all the deserialization and parsing cases from raw result of DESC TABLE EXTENDED <table> AS JSON
  • Tested all the ColumnsExtended relevant test cases in Databricks/StatementTest

@github-actions github-actions bot modified the milestone: ADBC Libraries 19 Jun 11, 2025
@jackyhu-db jackyhu-db force-pushed the desc-table-extended branch from 81b326f to 7d83b75 Compare June 11, 2025 02:05
@jackyhu-db jackyhu-db force-pushed the desc-table-extended branch from 49b3f6e to 8c73c0b Compare June 11, 2025 04:36
* limitations under the License.
*/

using Apache.Arrow.Adbc.Drivers.Databricks.Result;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we mocify/improve the existing GetColumnExtended tests to verify that:
With the flag on/off, they get same number of columns, same schema, and same data?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated it to support test for both flag on/off as below, they are almost same except for the IS_NULLABLE on primary key.

        [InlineData("all_column_types", "Resources/create_table_all_types.sql", "Resources/result_get_column_extended_all_types.json", true, new[] { "PK_IS_NULLABLE:NO" })]
        [InlineData("all_column_types", "Resources/create_table_all_types.sql", "Resources/result_get_column_extended_all_types.json", false, new[] { "PK_IS_NULLABLE:YES" })]
        public async Task CanGetColumnsExtended(string tableName,string createTableSqlLocation, string resultLocation, bool useDescTableExtended, string[] ?extraPlaceholdsInResult = null)

@jackyhu-db jackyhu-db force-pushed the desc-table-extended branch 2 times, most recently from 4d4af63 to 4a0d05a Compare June 12, 2025 03:53
@jackyhu-db jackyhu-db force-pushed the desc-table-extended branch from 4a0d05a to bb93562 Compare June 12, 2025 04:19
Copy link
Contributor

@CurtHagenlocher CurtHagenlocher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, looks good! I've just got a few nits about formatting and casing of method names.

@jackyhu-db jackyhu-db force-pushed the desc-table-extended branch from ba704ef to aa7a763 Compare June 24, 2025 11:39
Copy link
Contributor

@CurtHagenlocher CurtHagenlocher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! The changes look good but there's a merge conflict. Could you rebase on top of the latest sourced?

@jackyhu-db
Copy link
Contributor Author

Thanks! The changes look good but there's a merge conflict. Could you rebase on top of the latest sourced?

Done.

Copy link
Contributor

@eric-wang-1990 eric-wang-1990 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@CurtHagenlocher CurtHagenlocher merged commit 0000e8d into apache:main Jun 24, 2025
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants