-
Notifications
You must be signed in to change notification settings - Fork 4.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Snowflake Connector - Reduce computing resources used for metadata queries #43452
base: master
Are you sure you want to change the base?
Conversation
The latest updates on your projects. Learn more about Vercel for Git ↗︎ 1 Skipped Deployment
|
@@ -44,4 +46,6 @@ integrationTestJava { | |||
dependencies { | |||
implementation 'net.snowflake:snowflake-jdbc:3.14.1' | |||
implementation 'org.apache.commons:commons-text:1.10.0' | |||
implementation 'org.json:json:20210307' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't need this. We already have Jackson dependency in our dependency chain.
return JdbcDatabase.Companion.toUnsafeStream<T>( | ||
var connection = dataSource.connection | ||
|
||
try { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
connection.use. Also, keep connection a val instead of a var
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I have changed the connection to a val. The code won't be changed to connection.use as we discussed since the connection needs to be open when the result set is returned
@@ -5,7 +5,7 @@ data: | |||
connectorSubtype: database | |||
connectorType: destination | |||
definitionId: 424892c4-daac-4491-b35d-c6688ba547ba | |||
dockerImageTag: 3.11.9 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why the change? If unneeded, I'd rather keep this out side of the current PR
@@ -293,4 +293,8 @@ object SnowflakeDatabaseUtils { | |||
AirbyteProtocolType.UNKNOWN -> "VARIANT" | |||
} | |||
} | |||
|
|||
fun fromIsNullableSnowflakeString(isNullable: String?): Boolean { | |||
return "true".equals(isNullable, ignoreCase = true) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove and use String.toBoolean()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed the function and used String.toBoolean
rowCount | ||
|
||
val tableRowCountsFromShowQuery = LinkedHashMap<String, LinkedHashMap<String, Int>>() | ||
var showColumnsResult: List<JsonNode> = listOf() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove this and set the val inside your try block
val tableName = result["name"].asText() | ||
val rowCount = result["rows"].asText() | ||
|
||
tableRowCountsFromShowQuery |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The indentation is super confusing here (probably enforced by our format command). Any way to change that, or is our formatter going to bark at you?
Also, you can simplify with map.computeIfAbsent(tableSchema) { LinkedHashMap() }
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can also use a linkedlist.withDefault, which would simplify this further
) | ||
.isNotEmpty() | ||
|
||
showSchemaResult = database.queryJsons( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
val here. There's no need for a var. I'd expect there's no need to close the stream either
WHERE schema_name = ? | ||
AND catalog_name = ?; | ||
|
||
SHOW SCHEMAS LIKE '%s' IN DATABASE "%s"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use kotlin templates
ORDER BY ordinal_position; | ||
|
||
""".trimIndent(), | ||
SHOW COLUMNS IN TABLE "%s"."%s"."%s"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use kotlin templates
showColumnsResult = database.queryJsons( | ||
showColumnsQuery | ||
) | ||
val columnsFromShowQuery = showColumnsResult | ||
.stream() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove the call to stream() and you don't need to close it
row["DATA_TYPE"].asText(), | ||
row["column_name"].asText(), | ||
//row["data_type"].asText(), | ||
JSONObject(row["data_type"].asText()).getString("type"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there a difference introduced by the move to SHOW COLUMNS (similar to the call above)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, there are differences in the output of the information_schema query and SHOW query
row["COLUMN_NAME"].asText(), | ||
row["DATA_TYPE"].asText(), | ||
row["column_name"].asText(), | ||
changeDataTypeFromShowQuery(ObjectMapper().readTree(row["data_type"].asText()).path("type").asText()), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why the double conversion with .asText()
and then readTree
? row["data_type"]
should already return a JsonNode
so cast it to ObjectNode
if it a struct. If the returned value is a serialized Json string, then use Jsons.<relevantMethod>
to construct ObjectNode
. avoid re-init'ing an ObjectMapper
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, this seems odd
...io/airbyte/integrations/destination/snowflake/typing_deduping/SnowflakeDestinationHandler.kt
Outdated
Show resolved
Hide resolved
} | ||
return existingTablesFromShowQuery | ||
} catch (e: SnowflakeSQLException) { | ||
if(e.message != null && e.message!!.contains("does not exist")) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure we should catch the exception outside of the loop that iterates over streams.
import java.nio.file.Path | ||
import java.nio.file.Paths |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should break format
What
The Snowflake destination connector uses metadata queries to get the details of schemas, tables and columns. Customers have highlighted that the total computing resources used by these queries are adding up to multiple hours of compute time used per day on Snowflake cloud. It would be helpful to reduce the computing resources used for metadata queries by improving the way the connector gets metadata.
For more detailed context about the issue, please check the github ticket:
[destination-snowflake] executes excessive metadata queries #37311
#37311
How
To reduce the computing resources used for snowflake metadata queries, this PR includes changes to use SHOW queries to replace the information_schema queries.
After reviewing the improvement, additional optimizations may be implemented to review the minimum number of queries needed. This PR description will get updated after reviewing the improvements in the snowflake metadata queries.
Review guide
Note: This PR is still in draft mode, it is not ready for a formal review yet. Creating an initial PR to provide a preview into the type of changes being done.
User Impact
Users would see a reduction in the number of information_schema queries being executed and also see a reduction in the amount of computing resources used by metadata queries.
Can this PR be safely reverted and rolled back?