[destination-snowflake] executes excessive metadata queries #37311

barakavra · 2024-04-14T11:36:36Z

Connector Name

destination-snowflake

Connector Version

3.7.0

What step the error happened?

None

Relevant information

i was monitoring the Snowflake queries from the Airbyte platform and there is a great performance improvement that should be implemented on the Snowflake Destination code. So there are 3 queries here that are executed many times during sync while you can cash the query results instead, all of them are related to metadata of the destination table, so for example instead of executing the following query 409860 times a week ( <NUM_OF_STREAMS><NUM_OF_CONNECTIONS> <NUM_OF_REFRESHES>= 409860) it could be <NUM_OF_REFRESHES>*<NUM_OF_CONNECTIONS> if we cash it in the beginning of each refresh per connector

SELECT table_schema, table_name, column_name, data_type, is_nullable
FROM information_schema.columns
WHERE table_catalog = ?
  AND table_schema IN (?)
  AND table_name IN (?)
ORDER BY table_schema, table_name, ordinal_position;
by altering to
SELECT table_schema, table_name, column_name, data_type, is_nullable
FROM information_schema.columns
WHERE table_catalog = ?
  AND table_schema IN (?)
ORDER BY table_schema, table_name, ordinal_position;

as its only a small dataset the cost will worth it and wrap it with exception will guarantee that even if schema changes happens during the refresh the load will still executed as planned

Relevant log output

No response

Contribute

Yes, I want to contribute

The text was updated successfully, but these errors were encountered:

marcosmarxm · 2024-04-15T16:13:21Z

Thanks for reporting this @barakavra I added the suggestion in the team backlog for further discussion.

barakavra · 2024-04-30T07:07:55Z

thanks @marcosmarxm, just to stress this out, altering this behaviour could spare Snowflakers around 30 daily credits of snowflake (±30K/$ per year)

marcosmarxm · 2024-04-30T14:18:14Z

Thanks for the info @barakavra @airbytehq/destinations can you take a look in this issue?

pcorbel · 2024-06-17T12:57:25Z

Hello @marcosmarxm, do you have any news regarding this issue?
It prevents us from upgrading to the latest version of the Snowflake destination (we have ~2k connectors relying on it)

evantahler · 2024-06-17T23:45:43Z

Since this issue was filed, we did reduce the number of 'metadata queries' to one-per-stream per sync. This can probably be reduced even further to a single query for all streams in the sync.

barakavra · 2024-06-18T06:16:23Z

indeed - the following query is still running for hours (and we have only ±20 connectors) SELECT column_name, data_type, is_nullable FROM information_schema.columns WHERE table_catalog = ? AND table_schema = ? AND table_name = ? ORDER BY ordinal_position;
although i am green in the AIrbyte platform BE, my thoughts are - as Airbyte is the sole owner of the metadata of these tables/stages (as it should be) , I dont see any reason not to have this data stored when settings are saved, and queried only once out of Airbyte internal database (postgress) when connection is trigger.
this is from just yesterday -

evantahler · 2024-06-18T20:29:51Z

Grooming notes:

we can also issue metadata-only queries that don't require the warehouse to be running (e.g. show tables, not select_from information schema)
- to test this, in the snowflake UI don't use a warehouse and try to execute the query
we should keep the query per stream

barakavra · 2024-06-18T21:09:53Z

@evantahler kindly research metadata-only queries costs, it has snowflake's "cloud service credits", in case of high volume of queries can cost more than the warehouse itself. attached are our snowflake daily ±costs - i.e. on JDBC kafka connect connector, becore each flush a select dual command - removing this command saved us about 30% of the costs.

dpolman07 · 2024-07-09T16:15:25Z

Hi @evantahler @marcosmarxm - Is there any timeline on when this might be resolved? We're running into the same problem when using Snowflake as the destination. The queries below ran for over 7 hours the other day:

SELECT column_name, data_type, is_nullable
FROM information_schema.columns
WHERE table_catalog = ?
  AND table_schema = ?
  AND table_name = ?
ORDER BY ordinal_position;

SELECT SCHEMA_NAME
FROM information_schema.schemata
WHERE schema_name = ?
AND catalog_name = ?;

We're trying to finalize Airbyte as our ELT solution, and a resolution for this would be great to see! Thank you!

evantahler · 2024-07-29T20:55:41Z

Some additional notes from @stephane-airbyte:

Basically there's 2 types of compute: warehouse cost and "cloud services" cost. SELECT (almost) always uses a warehouse, while DESC/SHOW only use service cost (there's not even a need for a running warehouse). In our case, there's a significant difference between doing a SHOW TABLES and SHOW TABLES LIKE 'xxx' IN SCHEMA 'bbb'. Think of it in term of index utilization. The 1st one will just bring all the data (up to 10k rows), after doing some internal joins (in FoundationDB) and process it in cloud, while the 2nd one (because it's not using a range query, but only an exact match) should do 2-3 index lookups (one for the database, one for the schema and one for the table) and bring all that data directly into cloud-services which has nothing to do after that. If we were to use a '%abc' in the LIKE clause, all the filtering would be done in cloud compute, which can add up in term of costs

evantahler · 2024-07-29T23:50:55Z

The code in question can be found here:

airbyte/airbyte-integrations/connectors/destination-snowflake/src/main/kotlin/io/airbyte/integrations/destination/snowflake/typing_deduping/SnowflakeDestinationHandler.kt

Lines 498 to 517 in 224db75

    
                   @Throws(SQLException::class) 
        
                   fun findExistingTables( 
        
                       database: JdbcDatabase, 
        
                       databaseName: String, 
        
                       streamIds: List<StreamId> 
        
                   ): LinkedHashMap<String, LinkedHashMap<String, TableDefinition>> { 
        
                       val existingTables = LinkedHashMap<String, LinkedHashMap<String, TableDefinition>>() 
        
                       // convert list stream to array 
        
                       val namespaces = streamIds.map { it.finalNamespace }.toTypedArray() 
        
                       val names = streamIds.map { it.finalName }.toTypedArray() 
        
                       val query = 
        
                           """ 
        
                           |SELECT table_schema, table_name, column_name, data_type, is_nullable  
        
                           |FROM information_schema.columns  
        
                           |WHERE table_catalog = ?  
        
                           |AND table_schema IN (${IntRange(1, streamIds.size).joinToString { "?" }})  
        
                           |AND table_name IN (${IntRange(1, streamIds.size).joinToString { "?" }})  
        
                           |ORDER BY table_schema, table_name, ordinal_position;  
        
                           |""".trimMargin()

stephane-airbyte · 2024-09-26T20:14:16Z

closing, as it should be fixed by #45422

barakavra added area/connectors Connector related issues needs-triage type/bug Something isn't working labels Apr 14, 2024

octavia-squidington-iii added autoteam team/connectors-python community labels Apr 14, 2024

marcosmarxm changed the title ~~Snowflake executes excessive metadata queries~~ [destination-snowflake] executes excessive metadata queries Apr 15, 2024

marcosmarxm added connectors/destination/snowflake team/destinations Destinations team's backlog and removed needs-triage team/connectors-python autoteam labels Apr 15, 2024

evantahler added the good-first-issue label Jun 18, 2024

natikgadzhi added good first issue and removed good-first-issue labels Jul 9, 2024

github-project-automation bot added this to Community Contributions Connectors Jul 9, 2024

github-project-automation bot moved this to Todo in Community Contributions Connectors Jul 9, 2024

evantahler added good-first-issue and removed good first issue good-first-issue labels Jul 16, 2024

evantahler assigned Vee7574 Aug 8, 2024

Vee7574 mentioned this issue Aug 9, 2024

Snowflake Connector - Reduce computing resources used for metadata queries #43452

Open

1 task

evantahler assigned stephane-airbyte and unassigned Vee7574 Sep 9, 2024

stephane-airbyte closed this as completed Sep 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[destination-snowflake] executes excessive metadata queries #37311

[destination-snowflake] executes excessive metadata queries #37311

barakavra commented Apr 14, 2024 •

edited

Loading

marcosmarxm commented Apr 15, 2024

barakavra commented Apr 30, 2024 •

edited

Loading

marcosmarxm commented Apr 30, 2024

pcorbel commented Jun 17, 2024 •

edited

Loading

evantahler commented Jun 17, 2024

barakavra commented Jun 18, 2024

evantahler commented Jun 18, 2024 •

edited

Loading

barakavra commented Jun 18, 2024

dpolman07 commented Jul 9, 2024 •

edited

Loading

evantahler commented Jul 29, 2024

evantahler commented Jul 29, 2024

stephane-airbyte commented Sep 26, 2024

[destination-snowflake] executes excessive metadata queries #37311

[destination-snowflake] executes excessive metadata queries #37311

Comments

barakavra commented Apr 14, 2024 • edited Loading

Connector Name

Connector Version

What step the error happened?

Relevant information

Relevant log output

Contribute

marcosmarxm commented Apr 15, 2024

barakavra commented Apr 30, 2024 • edited Loading

marcosmarxm commented Apr 30, 2024

pcorbel commented Jun 17, 2024 • edited Loading

evantahler commented Jun 17, 2024

barakavra commented Jun 18, 2024

evantahler commented Jun 18, 2024 • edited Loading

barakavra commented Jun 18, 2024

dpolman07 commented Jul 9, 2024 • edited Loading

evantahler commented Jul 29, 2024

evantahler commented Jul 29, 2024

stephane-airbyte commented Sep 26, 2024

barakavra commented Apr 14, 2024 •

edited

Loading

barakavra commented Apr 30, 2024 •

edited

Loading

pcorbel commented Jun 17, 2024 •

edited

Loading

evantahler commented Jun 18, 2024 •

edited

Loading

dpolman07 commented Jul 9, 2024 •

edited

Loading