-
Notifications
You must be signed in to change notification settings - Fork 4.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
🎉 Extend logic for JDBC connectors to provide additional properties in JSON schema #7859
Conversation
vmaltsev seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account. You have signed the CLA already but the status is still pending? Let us recheck it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
But please make sure you pass all PR's check-points (for example to match the PR's name convention) and it would be good to check some destinations to make sure nothing will be broken by such schema changing.
currently destinations can not handle json schemas type with keywords and just skip the "format". So everything should be as previously |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is nice! Such a concise approach.
I have two questions / comments:
JsonSchemaPrimitive
is widely used in so many connectors. It's surprising that this change does not affect any of them. Even though theoretically no destination will be affected, it's still safer to run integration tests on a few sources and destinations.- Have you thought about any alternative approach and compared them? E.g. replace
JsonSchemaPrimitive
with a class, which allows more flexible specifications. One concern about this approach is that to support more formats, there will be many more enums to add, and it does not handle things likepattern
,max
, etc. Using a class seems more flexible and extensible.
|
||
STRING_DATE(ImmutableMap.of("type", "string", "format", "date")), | ||
STRING_TIME(ImmutableMap.of("type", "string", "format", "time")), | ||
STRING_TIMESTAMP(ImmutableMap.of("type", "string", "format", "date-time")), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i recommend changing this to STRING_DATETIME
as it matches the json primitive it represents
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i recommend changing this to
STRING_DATETIME
as it matches the json primitive it represents
done
@@ -4,11 +4,28 @@ | |||
|
|||
package io.airbyte.protocol.models; | |||
|
|||
import com.google.common.collect.ImmutableMap; | |||
|
|||
public enum JsonSchemaPrimitive { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
might make sense to call this JsonSchemaType
as these no longer represent primitives
case DATE -> JsonSchemaPrimitive.STRING; | ||
case TIME -> JsonSchemaPrimitive.STRING; | ||
case TIMESTAMP -> JsonSchemaPrimitive.STRING; | ||
case DATE -> JsonSchemaPrimitive.STRING_DATE; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need to be super careful about how MySQL/Postgres/MSSQL/Oracle handle timezones. For example, from the MySQL docs:
MySQL converts TIMESTAMP values from the current time zone to UTC for storage, and back from UTC to the current time zone for retrieval. (This does not occur for other types such as DATETIME.) By default, the current time zone for each connection is the server's time. The time zone can be set on a per-connection basis. As long as the time zone setting remains constant, you get back the same value you store. If you store a TIMESTAMP value, and then change the time zone and retrieve the value, the retrieved value is different from the value you stored. This occurs because the same time zone was not used for conversion in both directions. The current time zone is available as the value of the time_zone system variable. For more information, see Section 5.1.15, “MySQL Server Time Zone Support”.
So for MySQL to correctly express timestamps we would need to set the connection timezone to UTC whenever we're reading the date types in UTC (we should also have tests for this)
wdyt?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good point. But currently jdbc destinations can not handle different date-time datatypes for databases and just write value as string
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
currently jdbc destinations can not handle different date-time datatypes for databases and just write value as string
I believe this is inaccurate - normalization writes appropriately typed date/time types depending on the value in the format
field
@sherifnada, what do you think about changing |
@tuliren good idea, we can even have the current enum be a set of public static constants in that class, but still make it flexible enough to have any valid fields in an object's definition |
Given the potential complicity of changing But it is still important to run integration tests on a few sources and destinations to make sure nothing is broken. |
/test connector=connectors/source-oracle
|
/test connector=connectors/source-cockroachdb
|
@tuliren i ran some tests for Oracle Source, CockroachDB Source, MySQL Source, Postgres Destination and MSSQL Destination. And also created a separate ticket for changing JsonSchemaPrimitive to a class #7944. Can i merge this PR ? |
@VitaliiMaltsev, yes. Thank you for running the tests, and creating the issue. Feel free to merge it. |
.withSupportedSyncModes(Lists.newArrayList(SyncMode.FULL_REFRESH, SyncMode.INCREMENTAL)) | ||
.withSourceDefinedPrimaryKey( | ||
List.of(List.of(COL_FIRST_NAME), List.of(COL_LAST_NAME))))); | ||
} | ||
|
||
private JsonSchemaPrimitive resolveJsonSchemaType() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shouldn't the oracle test instead override this method, instead of hardcoding a connector-specific edge case in the abstract test class? also, the name of this method should be clearer as resolveJsonSchemaType
does not reflect it is only concerned with date types
case DATE -> JsonSchemaPrimitive.STRING; | ||
case TIME -> JsonSchemaPrimitive.STRING; | ||
case TIMESTAMP -> JsonSchemaPrimitive.STRING; | ||
case DATE -> JsonSchemaPrimitive.STRING_DATE; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
currently jdbc destinations can not handle different date-time datatypes for databases and just write value as string
I believe this is inaccurate - normalization writes appropriately typed date/time types depending on the value in the format
field
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@VitaliiMaltsev there's a few points we need to make sure are taken into account here:
- If I'm reading this correctly, the current tests don't cover all newly introduced types i.e: we must test that all 3 newly introduced types (date, time, datetime) are used appropriately. It's crucial that we maximize test coverage, otherwise customers will find issues before we do.
- JDBC connectors were not published in this PR, is this intended? in general whenever we make a change to a connector, we should publish it asap to make sure release notes are accurate and issues are caught quickly
- I don't think we resolved the discussion around timezone. It is critical that we ensure (via comprehensive testing) there are no issues here. If I'm understanding this correctly, the MySQL example I mentioned could result in inaccurate date information being persisted in the destination, because depending on the timezone used in the JDBC connection, we might be formatting the output value incorrectly, which when writing normalized tables could result in the date format being misinterpreted as UTC when it is not (which would change the actual value of the date object)
@sherifnada this PR has already been merged with the master branch. Is it necessary to revert the changes in this PR before these 3 points are resolved? |
@sherifnada please advise. I'm not sure - do we need to publish all jdbc connectors if changes were not in connectors code? |
|
…n JSON schema (airbytehq#7859) * add date-time formats to json schema creation * add jsonSchemaMap to enum * fix tests and checkstyle * remove python changes from PR * remove star import * Rename String_timestamp to String_Datetime * Rename String_timestamp to String_Datetime * fix checkstyle * fix jdbc source tests Co-authored-by: vmaltsev <vitalii.maltsev@globallogic.com>
… properties in JSON schema (airbytehq#7859)" (airbytehq#7969) This reverts commit 2fe927b.
What
Currently all jdbc connectors can provide only primitive json types
( STRING, NUMBER, OBJECT, ARRAY, BOOLEAN, NULL)
and json schema looks like
{ "type": "string" },
{ "type": "number" }
How
JsonSchemaPrimitive class was updated with jsonSchemaTypeMap
Recommended reading order
x.java
y.python
Pre-merge Checklist
Expand the relevant checklist and delete the others.
New Connector
Community member or Airbyter
airbyte_secret
./gradlew :airbyte-integrations:connectors:<name>:integrationTest
.README.md
bootstrap.md
. See description and examplesdocs/SUMMARY.md
docs/integrations/<source or destination>/<name>.md
including changelog. See changelog exampledocs/integrations/README.md
airbyte-integrations/builds.md
Airbyter
If this is a community PR, the Airbyte engineer reviewing this PR is responsible for the below items.
/test connector=connectors/<name>
command is passing./publish
command described hereUpdating a connector
Community member or Airbyter
airbyte_secret
./gradlew :airbyte-integrations:connectors:<name>:integrationTest
.README.md
bootstrap.md
. See description and examplesdocs/integrations/<source or destination>/<name>.md
including changelog. See changelog exampleAirbyter
If this is a community PR, the Airbyte engineer reviewing this PR is responsible for the below items.
/test connector=connectors/<name>
command is passing./publish
command described hereConnector Generator
-scaffold
in their name) have been updated with the latest scaffold by running./gradlew :airbyte-integrations:connector-templates:generator:testScaffoldTemplates
then checking in your changes