Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Snowflake Connector - Reduce computing resources used for metadata queries #43452

Open
wants to merge 76 commits into
base: master
Choose a base branch
from

Conversation

Vee7574
Copy link
Contributor

@Vee7574 Vee7574 commented Aug 9, 2024

What

The Snowflake destination connector uses metadata queries to get the details of schemas, tables and columns. Customers have highlighted that the total computing resources used by these queries are adding up to multiple hours of compute time used per day on Snowflake cloud. It would be helpful to reduce the computing resources used for metadata queries by improving the way the connector gets metadata.

For more detailed context about the issue, please check the github ticket:

[destination-snowflake] executes excessive metadata queries #37311
#37311

How

To reduce the computing resources used for snowflake metadata queries, this PR includes changes to use SHOW queries to replace the information_schema queries.

After reviewing the improvement, additional optimizations may be implemented to review the minimum number of queries needed. This PR description will get updated after reviewing the improvements in the snowflake metadata queries.

Review guide

Note: This PR is still in draft mode, it is not ready for a formal review yet. Creating an initial PR to provide a preview into the type of changes being done.

User Impact

Users would see a reduction in the number of information_schema queries being executed and also see a reduction in the amount of computing resources used by metadata queries.

Can this PR be safely reverted and rolled back?

  • [ YES] YES 💚
  • NO ❌

@Vee7574 Vee7574 requested a review from a team as a code owner August 9, 2024 19:21
Copy link

vercel bot commented Aug 9, 2024

The latest updates on your projects. Learn more about Vercel for Git ↗︎

1 Skipped Deployment
Name Status Preview Comments Updated (UTC)
airbyte-docs ⬜️ Ignored (Inspect) Visit Preview Sep 6, 2024 7:03pm

@octavia-squidington-iii octavia-squidington-iii added the CDK Connector Development Kit label Aug 19, 2024
@@ -44,4 +46,6 @@ integrationTestJava {
dependencies {
implementation 'net.snowflake:snowflake-jdbc:3.14.1'
implementation 'org.apache.commons:commons-text:1.10.0'
implementation 'org.json:json:20210307'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need this. We already have Jackson dependency in our dependency chain.

return JdbcDatabase.Companion.toUnsafeStream<T>(
var connection = dataSource.connection

try {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

connection.use. Also, keep connection a val instead of a var

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I have changed the connection to a val. The code won't be changed to connection.use as we discussed since the connection needs to be open when the result set is returned

@@ -5,7 +5,7 @@ data:
connectorSubtype: database
connectorType: destination
definitionId: 424892c4-daac-4491-b35d-c6688ba547ba
dockerImageTag: 3.11.9
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why the change? If unneeded, I'd rather keep this out side of the current PR

@@ -293,4 +293,8 @@ object SnowflakeDatabaseUtils {
AirbyteProtocolType.UNKNOWN -> "VARIANT"
}
}

fun fromIsNullableSnowflakeString(isNullable: String?): Boolean {
return "true".equals(isNullable, ignoreCase = true)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove and use String.toBoolean()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed the function and used String.toBoolean

rowCount

val tableRowCountsFromShowQuery = LinkedHashMap<String, LinkedHashMap<String, Int>>()
var showColumnsResult: List<JsonNode> = listOf()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove this and set the val inside your try block

val tableName = result["name"].asText()
val rowCount = result["rows"].asText()

tableRowCountsFromShowQuery
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The indentation is super confusing here (probably enforced by our format command). Any way to change that, or is our formatter going to bark at you?
Also, you can simplify with map.computeIfAbsent(tableSchema) { LinkedHashMap() }

Copy link
Contributor

@stephane-airbyte stephane-airbyte Aug 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can also use a linkedlist.withDefault, which would simplify this further

)
.isNotEmpty()

showSchemaResult = database.queryJsons(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

val here. There's no need for a var. I'd expect there's no need to close the stream either

WHERE schema_name = ?
AND catalog_name = ?;

SHOW SCHEMAS LIKE '%s' IN DATABASE "%s";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use kotlin templates

ORDER BY ordinal_position;

""".trimIndent(),
SHOW COLUMNS IN TABLE "%s"."%s"."%s";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use kotlin templates

showColumnsResult = database.queryJsons(
showColumnsQuery
)
val columnsFromShowQuery = showColumnsResult
.stream()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove the call to stream() and you don't need to close it

row["DATA_TYPE"].asText(),
row["column_name"].asText(),
//row["data_type"].asText(),
JSONObject(row["data_type"].asText()).getString("type"),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there a difference introduced by the move to SHOW COLUMNS (similar to the call above)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, there are differences in the output of the information_schema query and SHOW query

row["COLUMN_NAME"].asText(),
row["DATA_TYPE"].asText(),
row["column_name"].asText(),
changeDataTypeFromShowQuery(ObjectMapper().readTree(row["data_type"].asText()).path("type").asText()),
Copy link
Contributor

@gisripa gisripa Aug 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why the double conversion with .asText() and then readTree ? row["data_type"] should already return a JsonNode so cast it to ObjectNode if it a struct. If the returned value is a serialized Json string, then use Jsons.<relevantMethod> to construct ObjectNode. avoid re-init'ing an ObjectMapper

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, this seems odd

}
return existingTablesFromShowQuery
} catch (e: SnowflakeSQLException) {
if(e.message != null && e.message!!.contains("does not exist")) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure we should catch the exception outside of the loop that iterates over streams.

import java.nio.file.Path
import java.nio.file.Paths
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should break format

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants