Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

source-mysql: datatype fixes #49918

Merged
merged 3 commits into from
Dec 20, 2024
Merged

Conversation

postamar
Copy link
Contributor

@postamar postamar commented Dec 18, 2024

What

This PR tries to get source-mysql's datatypes correct by leveraging the improved MySqlDatatypeIntegrationTest.

How

n/a

Review guide

n/a

User Impact

Source-mysql will handle some datatypes differently:

  • very large DECIMAL values are now rendered properly without loss of precision
  • BIT(n) where n > 1 are now properly handled as byte arrays instead of as integers
  • FLOAT(?), FLOAT(?,?), DOUBLE(?), DOUBLE(?,?) might render very slightly differently

Can this PR be safely reverted and rolled back?

  • YES 💚
  • NO ❌

Copy link

vercel bot commented Dec 18, 2024

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Comments Updated (UTC)
airbyte-docs ✅ Ready (Inspect) Visit Preview 💬 Add feedback Dec 20, 2024 5:35pm

@octavia-squidington-iii octavia-squidington-iii added area/connectors Connector related issues CDK Connector Development Kit connectors/source/mysql area/documentation Improvements or additions to documentation labels Dec 18, 2024
with("${key}.type", converterClass.canonicalName)
}
}
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

syntactic sugar for registering converters

airbyte-integrations/connectors/source-mysql/metadata.yaml Outdated Show resolved Hide resolved
MysqlType.ENUM,
MysqlType.SET -> StringFieldType
MysqlType.JSON -> JsonStringFieldType
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

were json values always rendered as strings? or is this a recent regression?

Copy link
Contributor

@rodireich rodireich Dec 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in old mysql's MySqlSourceOperations.copyToJsonField:

case … JSON, … -> putString(json, columnName, resultSet, colIndex);

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oof, OK

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm thinking that we should fix this in a different PR

}
MysqlType.VECTOR,
MysqlType.UNKNOWN,
null -> PokemonFieldType
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

might as well list all values explicitly for safety

// This is to make sure that numbers are represented as strings.
.with("decimal.handling.mode", "string")
// This is to make sure that temporal data is represented without loss of precision.
.with("time.precision.mode", "adaptive_time_microseconds")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is the default value for this setting, but I prefer to declare it explicitly

@@ -373,6 +391,10 @@ class MySqlDebeziumOperations(
// This to make sure that binary data represented as a base64-encoded String.
// https://debezium.io/documentation/reference/2.2/connectors/mysql.html#mysql-property-binary-handling-mode
.with("binary.handling.mode", "base64")
// This is to make sure that numbers are represented as strings.
.with("decimal.handling.mode", "string")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this wraps large numbers in quotes, which will get turned into BigDecimal again by the snippet I added further above

@@ -175,7 +174,8 @@ object MySqlDatatypeTestOperations :
val dateTimeValues =
mapOf(
"'2024-09-13 14:30:00'" to """"2024-09-13T14:30:00.000000"""",
"'2024-09-13T14:40:00+00:00'" to """"2024-09-13T14:40:00.000000""""
"'2024-09-13T14:40:00+00:00'" to """"2024-09-13T14:40:00.000000"""",
"'1752-09-01 14:30:00'" to """"1752-09-01T14:30:00.000000"""",
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think I have enough edge cases in these temporal test cases.
In any case mysql doesn't do BC DATEs, and doesn't do pre-year-1000 DATETIMEs and doesn't do TIMESTAMPs before unix epoch 0.

@postamar postamar force-pushed the postamar/cute-debezium-converters branch from b464f37 to 2e13b9a Compare December 18, 2024 17:13
@postamar postamar changed the base branch from master to postamar/fix-json-deserialization-bug-in-cdk December 18, 2024 17:13
@postamar postamar force-pushed the postamar/fix-json-deserialization-bug-in-cdk branch 2 times, most recently from 66a505d to 233cabf Compare December 18, 2024 18:32
@postamar postamar force-pushed the postamar/cute-debezium-converters branch 2 times, most recently from e1a82df to 4576060 Compare December 18, 2024 18:38
@postamar postamar changed the base branch from postamar/fix-json-deserialization-bug-in-cdk to master December 18, 2024 18:40
@postamar postamar force-pushed the postamar/cute-debezium-converters branch from 4576060 to 7a9c03f Compare December 18, 2024 19:48
@postamar postamar changed the base branch from master to postamar/fix-json-deserialization-bug-in-cdk December 18, 2024 19:48
@postamar postamar changed the base branch from postamar/fix-json-deserialization-bug-in-cdk to postamar/cdk-cute-debezium-converters December 18, 2024 19:54
@postamar postamar changed the title bulk-cdk-extract,source-mysql: better CustomConverter implementations source-mysql: datatype fixes Dec 18, 2024
@postamar postamar force-pushed the postamar/cute-debezium-converters branch from 7a9c03f to ebbbb9d Compare December 18, 2024 22:57
@postamar postamar force-pushed the postamar/cdk-cute-debezium-converters branch from fd09283 to 1805e08 Compare December 18, 2024 22:58
@octavia-squidington-iii octavia-squidington-iii removed the area/documentation Improvements or additions to documentation label Dec 18, 2024
@postamar postamar marked this pull request as ready for review December 19, 2024 12:09
@postamar postamar requested a review from a team as a code owner December 19, 2024 12:09
@postamar postamar marked this pull request as draft December 19, 2024 21:01
@postamar postamar force-pushed the postamar/cdk-cute-debezium-converters branch from 1805e08 to 3f30b8c Compare December 19, 2024 21:14
@postamar postamar force-pushed the postamar/cute-debezium-converters branch from ebbbb9d to 0aa2030 Compare December 19, 2024 21:21
@octavia-squidington-iii octavia-squidington-iii added area/documentation Improvements or additions to documentation and removed CDK Connector Development Kit labels Dec 19, 2024
@postamar postamar force-pushed the postamar/cdk-cute-debezium-converters branch from 3f30b8c to 9818c9a Compare December 19, 2024 21:25
@octavia-squidington-iii octavia-squidington-iii added the CDK Connector Development Kit label Dec 19, 2024
@postamar postamar marked this pull request as ready for review December 19, 2024 22:09
Base automatically changed from postamar/cdk-cute-debezium-converters to master December 20, 2024 00:27
data.set<JsonNode>(field.id, Jsons.readTree(textNode.textValue()))
}
else -> continue
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this feels like a bad abstraction here, the fact that we (the implementor of DebeziumOperations) have to iterate through the schema types. Just some food for thoughts

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it's not great. Perhaps Stream should have pre-computed groupings of Fields per Airbyte type. Or per FieldType. Not sure.

@postamar
Copy link
Contributor Author

@rodireich I looked into the debezium-connector-mysql code and all of the weird edge cases in the legacy converter logic apply only to debezium snapshots, as far as I can tell, which we just don't do. When using debezium in regular, non-snapshot mode the Debezium engine stays on the happy path with no weird java types.

@postamar postamar force-pushed the postamar/cute-debezium-converters branch from e81d3dc to 38639e5 Compare December 20, 2024 14:45
@octavia-squidington-iii octavia-squidington-iii removed the CDK Connector Development Kit label Dec 20, 2024
@postamar
Copy link
Contributor Author

@rodireich actually there is one edge case which this PR doesn't handle correctly: default values! I'm working on adding test cases for these.

@postamar postamar force-pushed the postamar/cute-debezium-converters branch from 38639e5 to 1ec2728 Compare December 20, 2024 17:29
@postamar
Copy link
Contributor Author

I added test cases for default values, which proved very useful.

@postamar postamar enabled auto-merge (squash) December 20, 2024 17:33
@postamar postamar merged commit 1d6b34c into master Dec 20, 2024
30 checks passed
@postamar postamar deleted the postamar/cute-debezium-converters branch December 20, 2024 17:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/connectors Connector related issues area/documentation Improvements or additions to documentation connectors/source/mysql
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants