Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[destination-snowflake] executes excessive metadata queries #37311

Closed
1 task done
barakavra opened this issue Apr 14, 2024 · 12 comments
Closed
1 task done

[destination-snowflake] executes excessive metadata queries #37311

barakavra opened this issue Apr 14, 2024 · 12 comments
Assignees
Labels
area/connectors Connector related issues community connectors/destination/snowflake team/destinations Destinations team's backlog type/bug Something isn't working

Comments

@barakavra
Copy link

barakavra commented Apr 14, 2024

Connector Name

destination-snowflake

Connector Version

3.7.0

What step the error happened?

None

Relevant information

i was monitoring the Snowflake queries from the Airbyte platform and there is a great performance improvement that should be implemented on the Snowflake Destination code. So there are 3 queries here that are executed many times during sync while you can cash the query results instead, all of them are related to metadata of the destination table, so for example instead of executing the following query 409860 times a week ( <NUM_OF_STREAMS><NUM_OF_CONNECTIONS> <NUM_OF_REFRESHES>= 409860) it could be <NUM_OF_REFRESHES>*<NUM_OF_CONNECTIONS> if we cash it in the beginning of each refresh per connector

SELECT table_schema, table_name, column_name, data_type, is_nullable
FROM information_schema.columns
WHERE table_catalog = ?
  AND table_schema IN (?)
  AND table_name IN (?)
ORDER BY table_schema, table_name, ordinal_position;
by altering to
SELECT table_schema, table_name, column_name, data_type, is_nullable
FROM information_schema.columns
WHERE table_catalog = ?
  AND table_schema IN (?)
ORDER BY table_schema, table_name, ordinal_position;

as its only a small dataset the cost will worth it and wrap it with exception will guarantee that even if schema changes happens during the refresh the load will still executed as planned

image (27)
image (26)
image (25)

Relevant log output

No response

Contribute

  • Yes, I want to contribute
@barakavra barakavra added area/connectors Connector related issues needs-triage type/bug Something isn't working labels Apr 14, 2024
@marcosmarxm marcosmarxm changed the title Snowflake executes excessive metadata queries [destination-snowflake] executes excessive metadata queries Apr 15, 2024
@marcosmarxm
Copy link
Member

Thanks for reporting this @barakavra I added the suggestion in the team backlog for further discussion.

@barakavra
Copy link
Author

barakavra commented Apr 30, 2024

thanks @marcosmarxm, just to stress this out, altering this behaviour could spare Snowflakers around 30 daily credits of snowflake (±30K/$ per year)

@marcosmarxm
Copy link
Member

Thanks for the info @barakavra @airbytehq/destinations can you take a look in this issue?

@pcorbel
Copy link

pcorbel commented Jun 17, 2024

Hello @marcosmarxm, do you have any news regarding this issue?
It prevents us from upgrading to the latest version of the Snowflake destination (we have ~2k connectors relying on it)

@evantahler
Copy link
Contributor

Since this issue was filed, we did reduce the number of 'metadata queries' to one-per-stream per sync. This can probably be reduced even further to a single query for all streams in the sync.

@barakavra
Copy link
Author

indeed - the following query is still running for hours (and we have only ±20 connectors) SELECT column_name, data_type, is_nullable FROM information_schema.columns WHERE table_catalog = ? AND table_schema = ? AND table_name = ? ORDER BY ordinal_position;
although i am green in the AIrbyte platform BE, my thoughts are - as Airbyte is the sole owner of the metadata of these tables/stages (as it should be) , I dont see any reason not to have this data stored when settings are saved, and queried only once out of Airbyte internal database (postgress) when connection is trigger.
this is from just yesterday -
image

@evantahler
Copy link
Contributor

evantahler commented Jun 18, 2024

Grooming notes:

  • we can also issue metadata-only queries that don't require the warehouse to be running (e.g. show tables, not select_from information schema)
    • to test this, in the snowflake UI don't use a warehouse and try to execute the query
  • we should keep the query per stream

@barakavra
Copy link
Author

@evantahler kindly research metadata-only queries costs, it has snowflake's "cloud service credits", in case of high volume of queries can cost more than the warehouse itself. attached are our snowflake daily ±costs - i.e. on JDBC kafka connect connector, becore each flush a select dual command - removing this command saved us about 30% of the costs.

@dpolman07
Copy link

dpolman07 commented Jul 9, 2024

Hi @evantahler @marcosmarxm - Is there any timeline on when this might be resolved? We're running into the same problem when using Snowflake as the destination. The queries below ran for over 7 hours the other day:

SELECT column_name, data_type, is_nullable
FROM information_schema.columns
WHERE table_catalog = ?
  AND table_schema = ?
  AND table_name = ?
ORDER BY ordinal_position;
SELECT SCHEMA_NAME
FROM information_schema.schemata
WHERE schema_name = ?
AND catalog_name = ?;

We're trying to finalize Airbyte as our ELT solution, and a resolution for this would be great to see! Thank you!

@evantahler
Copy link
Contributor

Some additional notes from @stephane-airbyte:

Basically there's 2 types of compute: warehouse cost and "cloud services" cost. SELECT (almost) always uses a warehouse, while DESC/SHOW only use service cost (there's not even a need for a running warehouse). In our case, there's a significant difference between doing a SHOW TABLES and SHOW TABLES LIKE 'xxx' IN SCHEMA 'bbb'. Think of it in term of index utilization. The 1st one will just bring all the data (up to 10k rows), after doing some internal joins (in FoundationDB) and process it in cloud, while the 2nd one (because it's not using a range query, but only an exact match) should do 2-3 index lookups (one for the database, one for the schema and one for the table) and bring all that data directly into cloud-services which has nothing to do after that. If we were to use a '%abc' in the LIKE clause, all the filtering would be done in cloud compute, which can add up in term of costs

@evantahler
Copy link
Contributor

The code in question can be found here:

@Throws(SQLException::class)
fun findExistingTables(
database: JdbcDatabase,
databaseName: String,
streamIds: List<StreamId>
): LinkedHashMap<String, LinkedHashMap<String, TableDefinition>> {
val existingTables = LinkedHashMap<String, LinkedHashMap<String, TableDefinition>>()
// convert list stream to array
val namespaces = streamIds.map { it.finalNamespace }.toTypedArray()
val names = streamIds.map { it.finalName }.toTypedArray()
val query =
"""
|SELECT table_schema, table_name, column_name, data_type, is_nullable
|FROM information_schema.columns
|WHERE table_catalog = ?
|AND table_schema IN (${IntRange(1, streamIds.size).joinToString { "?" }})
|AND table_name IN (${IntRange(1, streamIds.size).joinToString { "?" }})
|ORDER BY table_schema, table_name, ordinal_position;
|""".trimMargin()

@stephane-airbyte
Copy link
Contributor

closing, as it should be fixed by #45422

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/connectors Connector related issues community connectors/destination/snowflake team/destinations Destinations team's backlog type/bug Something isn't working
Projects
No open projects
Development

No branches or pull requests

9 participants