Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

go/adbc/driver/snowflake: Without the DB name the GetObjects call fails #1332

Closed
ryan-syed opened this issue Nov 29, 2023 · 9 comments
Closed

Comments

@ryan-syed
Copy link
Contributor

Currently the Snowflake Go driver needs the database name to make the GetObjects call. However, ODBC based driver just requires the server name and warehouse. To make the experience more inline with other existing snowflake drivers, should the GetObjects be able get the information without needing a database being passed as a default.

@ryan-syed
Copy link
Contributor Author

ryan-syed commented Dec 7, 2023

To elaborate when a connection is being made using the current driver without a database name, no USE Database call is made and therefore any call to INFORMATION_SCHEMA needs to be prefixed with the database name. Without the prefix as implemented currently they will fail with the following error: Cannot perform SELECT. This session does not have a current database. Call 'USE DATABASE', or use a qualified name

We were able to reduce the time it takes for the GetObjects call by removing the cursor implementation, however, this inadvertently relies on the fact that a initial database name is provided so that INFORMATION_SCHEMA calls can be made. There is also a bug, where INFORMATION_SCHEMA.Tables without the db name prefixed will only return tables in the intial database name that is set. Same is the case for INFORMATION_SCHEMA.Columns.

I was looking at adding the cursor back to have INFORMATION_SCHEMA calls with the db name prefix and it would slow the calls down, however, there are still other optimizations that can be made as discussed in the proposal below.

A bit about the current GetObjects API:

It fetches the catalogs, schemas, tables, table constraints (currently not implemented for Snowflake Go Driver), and columns. It also does this by making the following calls:

SQL Notes Depth
SHOW TERSE DATABASES This is used to get the matching catalogNames and the driver does the filtering using PatternsToRegexp and catalogPatternMatch ObjectDepthCatalogs
SELECT CATALOG_NAME, SCHEMA_NAME FROM INFORMATION_SCHEMA.SCHEMATA This is used to get the catalogName and corresponding schemaNames ObjectDepthDBSchemas
SELECT table_catalog, table_schema, table_name, table_type FROM INFORMATION_SCHEMA.TABLES This is used to get the tableName and tableType for the corresponding catalogs, and schemas. ObjectDepthTables
SELECT table_catalog, table_schema, table_name, column_name, ordinal_position, is_nullable::boolean, data_type, numeric_precision, numeric_precision_radix, numeric_scale, is_identity::boolean, identity_generation, identity_increment, character_maximum_length, character_octet_length, datetime_precision, comment FROM dbname.INFORMATION_SCHEMA.COLUMNS This is used to get the columns metadata. ObjectDepthAll or ObjectDepthColumns

and missing implementation:

SQL Notes
SELECT * FROM INFORMATION_SCHEMA.TABLE_CONSTRAINTS This will be used to get the table constraints like primary key and foreign key

Constrast to ODBC:

The following analysis is based on SIMBA ODBC Snowflake driver, ODBC Test and QueryHistory for client generated calls in Snowflake.
I believe the GetObjects call is equivalent to the following ODBC APIs:

ODBC API SQL Notes Equivalent ADBC
SQLTables show objects /* ODBC:TableMetadataSource */in account Gets the names, catalogName, schemaName, kind (VIEW or TABLE), comment, and filters within the driver for the given patterns for catalog, schema, and tables GetObjects with ObjectDepthTables
SQLColumns show /* ODBC:ColumnMetadataSource */ columns in database "foo_db" or show /* ODBC:ColumnMetadataSource */ columns in schema "foo_db"."bar_schema" The call was based on whether the schema name was a pattern It is different from ADBC as catalogName for SQLColumns can't be a pattern. However, it is roughly equivalent to GetObjects with ObjectDepthAll or ObjectDepthColumns
SQLPrimaryKeys show primary keys /* ODBC:PrimaryKeysMetadataSource */ in table "foo_db"."bar_schema" Gets the primary keys GetObjects with ObjectDepthTables, though not currently implemented for Snowflake Go Driver.
SQLForeignKeys show exported keys /* ODBC:ForeignKeysMetadataSource */ in account Get the foreign keys GetObjects with ObjectDepthTables, though not currently implemented for Snowflake Go Driver.

Therefore, the GetObjects API masquerades as a single API and performs the functionality of four ODBC APIs based on the depth provided. However, due to this if you wish to only get the foreign keys using the GetObjects you would end up making several other calls.

Proposal:

Instead of making four separate calls, the design could be changed to just making one SQL call that is for the highest depth and inferring all the other information for the previous depths from it.

Example: For the worst case scenario with the depth as ObjectDepthAll or ObjectDepthColumns we can make only SELECT table_catalog, table_schema, table_name, column_name, ordinal_position, is_nullable::boolean, data_type, numeric_precision, numeric_precision_radix, numeric_scale, is_identity::boolean, identity_generation, identity_increment, character_maximum_length, character_octet_length, datetime_precision, comment FROM dbname.INFORMATION_SCHEMA.COLUMNS call and derive information for tables, schemas and catalogs based on the data from this single call.

Now, to decide the underlying API, there are two ways we can go about this:

Underlying API Proposal 1:

Make show calls similar to ODBC Driver

Pros:

  1. SHOW calls are faster
    • For my instance, It takes 1-2s for SHOW to get all the information till the ObjectDepthTables depth whereas to get the same information using SELECT takes ~ 5s
    • For my instance, It is takes 1-2s for SHOW to get all the information till the ObjectDepthColumns depth whereas to get the same information using SELECT takes ~ 8s
  2. The command does not require a running warehouse to execute.. Therefore probably no cost is incurred for metadata calls

Cons:

  1. we need to filter (patterns) and sort within the driver
    • filtering is already done though using PatternsToRegexp and catalogPatternMatch
    • This will have to be done in-memory, and there will be a performance hit for consuming all the data then filtering/sorting. This may not be that significant when I did some testing as show objects in account returned only ~600 entries. However, I need to test for show columns in account which return ~8K entries.
  2. The command returns a maximum of 10K records for the specified object type, as dictated by the access privileges for the role used to execute the command; any records above the 10K limit are not returned, even with a filter applied.
    • this may become a limitation where there are way too many objects in the DB and something to be aware of.

Underlying API Proposal 2:

Continue with the SELECT calls

Pros:

  1. Existing implementation
  2. No limit for the data returned
    • Not sure, why this is not an issue for the ODBC Driver (probably a known limitation)

Cons:

  1. SELECT calls are slower than SHOW calls and metadata fetch will be slower than ODBC.
    • In the current API spec anyways ADBC calls will be slower than ODBC because GetObjects is equivalent to four APIs in ODBC as previously mentioned.

Please let me know your thoughts about the changes in design and which API would be better among SHOW and SELECT for metadata calls.

Underlying API Proposal 3:

Hybrid approach, where we let the caller pick which API should be used based on the tradeoffs.

Additional Questions:

  1. What is the use case for GetObjects API to be some dense in terms of data being fetched?
  2. Should we break it up to several APIs similar to ODBC? I notice another thread on it: format: expose other metadata in GetObjects or add new metadata function #621.

@ryan-syed
Copy link
Contributor Author

ryan-syed commented Dec 7, 2023

@davidhcoe @zeroshade @jduo @lidavidm Please review and suggest your thoughts about the proposal.

TL;DR

GetObjects makes fours SQL calls at the highest depth of ObjectDepthAll or ObjectDepthColumns. Reduce that call to a single call for the depth provided and infer data for the previous depths based on that call. Also, considering that we need to refactor the SQL calls, should we consider using SHOW instead of SELECT for getting the metadata similar to what ODBC does? We could also have a hybrid approach to allow the caller to toggle between the implementations (though that adds more complexity and maintenance overhead). I have mentioned the trade-offs associated with each.

@zeroshade
Copy link
Member

I think proposal 3 is probably best, allowing the consumer to choose the pros and cons. I'm not a fan of the fact that it would truncate the results if we go with the first proposal, but the performance benefits seem significant. So it's a tough call.

@lidavidm thoughts?

@ryan-syed
Copy link
Contributor Author

Created a draft pull request for reducing the SQL calls: #1351

I will create another draft PR with removing the cursor call and generating the necessary SQL in Go.

@lidavidm
Copy link
Member

So long as the maintenance burden is not excessive. We should also default to the 'correct' method and only use the 'fast' method if desired by the caller.

@ryan-syed
Copy link
Contributor Author

Created a draft PR with reduced SQL calls (though this makes 2 and doesn't use cursor): #1352

@ryan-syed
Copy link
Contributor Author

So long as the maintenance burden is not excessive. We should also default to the 'correct' method and only use the 'fast' method if desired by the caller.

Yeah it makes sense to choose the one that's better in the long run. The code without cursor makes an extra SQL call but is slightly faster in the conditions that I have tested. Need to see which would be better consider all scenarios.

@davidhcoe
Copy link
Contributor

@ryan-syed - I think this can be closed now, yes?

@ryan-syed
Copy link
Contributor Author

Closing this as the number of SQL calls have been reduced and the performance is reasonable.
Approach 3 is used.
Show is used for getting the databases and then only one SELECT call is used to get all the relevant data.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants