Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🎉 Extend logic for JDBC connectors to provide additional properties in JSON schema #7859

Merged
merged 10 commits into from
Nov 13, 2021

Conversation

VitaliiMaltsev
Copy link
Contributor

@VitaliiMaltsev VitaliiMaltsev commented Nov 11, 2021

What

Currently all jdbc connectors can provide only primitive json types
( STRING, NUMBER, OBJECT, ARRAY, BOOLEAN, NULL)

and json schema looks like

{ "type": "string" },
{ "type": "number" }

How

JsonSchemaPrimitive class was updated with jsonSchemaTypeMap

Recommended reading order

  1. x.java
  2. y.python

Pre-merge Checklist

Expand the relevant checklist and delete the others.

New Connector

Community member or Airbyter

  • Community member? Grant edit access to maintainers (instructions)
  • Secrets in the connector's spec are annotated with airbyte_secret
  • Unit & integration tests added and passing. Community members, please provide proof of success locally e.g: screenshot or copy-paste unit, integration, and acceptance test output. To run acceptance tests for a Python connector, follow instructions in the README. For java connectors run ./gradlew :airbyte-integrations:connectors:<name>:integrationTest.
  • Code reviews completed
  • Documentation updated
    • Connector's README.md
    • Connector's bootstrap.md. See description and examples
    • docs/SUMMARY.md
    • docs/integrations/<source or destination>/<name>.md including changelog. See changelog example
    • docs/integrations/README.md
    • airbyte-integrations/builds.md
  • PR name follows PR naming conventions

Airbyter

If this is a community PR, the Airbyte engineer reviewing this PR is responsible for the below items.

  • Create a non-forked branch based on this PR and test the below items on it
  • Build is successful
  • Credentials added to Github CI. Instructions.
  • /test connector=connectors/<name> command is passing.
  • New Connector version released on Dockerhub by running the /publish command described here
  • After the connector is published, connector added to connector index as described here
  • Seed specs have been re-generated by building the platform and committing the changes to the seed spec files, as described here

Updating a connector

Community member or Airbyter

  • Grant edit access to maintainers (instructions)
  • Secrets in the connector's spec are annotated with airbyte_secret
  • Unit & integration tests added and passing. Community members, please provide proof of success locally e.g: screenshot or copy-paste unit, integration, and acceptance test output. To run acceptance tests for a Python connector, follow instructions in the README. For java connectors run ./gradlew :airbyte-integrations:connectors:<name>:integrationTest.
  • Code reviews completed
  • Documentation updated
    • Connector's README.md
    • Connector's bootstrap.md. See description and examples
    • Changelog updated in docs/integrations/<source or destination>/<name>.md including changelog. See changelog example
  • PR name follows PR naming conventions

Airbyter

If this is a community PR, the Airbyte engineer reviewing this PR is responsible for the below items.

  • Create a non-forked branch based on this PR and test the below items on it
  • Build is successful
  • Credentials added to Github CI. Instructions.
  • /test connector=connectors/<name> command is passing.
  • New Connector version released on Dockerhub by running the /publish command described here
  • After the new connector version is published, connector version bumped in the seed directory as described here
  • Seed specs have been re-generated by building the platform and committing the changes to the seed spec files, as described here

Connector Generator

  • Issue acceptance criteria met
  • PR name follows PR naming conventions
  • If adding a new generator, add it to the list of scaffold modules being tested
  • The generator test modules (all connectors with -scaffold in their name) have been updated with the latest scaffold by running ./gradlew :airbyte-integrations:connector-templates:generator:testScaffoldTemplates then checking in your changes
  • Documentation which references the generator is updated as needed.

@CLAassistant
Copy link

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.


vmaltsev seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account.
You have signed the CLA already but the status is still pending? Let us recheck it.

@github-actions github-actions bot added area/connectors Connector related issues area/platform issues related to the platform area/protocol labels Nov 11, 2021
@VitaliiMaltsev VitaliiMaltsev temporarily deployed to more-secrets November 11, 2021 10:19 Inactive
@VitaliiMaltsev VitaliiMaltsev temporarily deployed to more-secrets November 11, 2021 10:24 Inactive
@VitaliiMaltsev VitaliiMaltsev temporarily deployed to more-secrets November 11, 2021 10:26 Inactive
@VitaliiMaltsev VitaliiMaltsev changed the title Vmaltsev/7777 extend json schema Extend logic for JDBC connectors to provide additional properties in JSON schema Nov 11, 2021
Copy link
Contributor

@etsybaev etsybaev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.
But please make sure you pass all PR's check-points (for example to match the PR's name convention) and it would be good to check some destinations to make sure nothing will be broken by such schema changing.

@VitaliiMaltsev
Copy link
Contributor Author

LGTM. But please make sure you pass all PR's check-points (for example to match the PR's name convention) and it would be good to check some destinations to make sure nothing will be broken by such schema changing.

currently destinations can not handle json schemas type with keywords and just skip the "format". So everything should be as previously

@VitaliiMaltsev VitaliiMaltsev marked this pull request as ready for review November 11, 2021 10:40
@VitaliiMaltsev VitaliiMaltsev changed the title Extend logic for JDBC connectors to provide additional properties in JSON schema 🎉 Extend logic for JDBC connectors to provide additional properties in JSON schema Nov 11, 2021
@VitaliiMaltsev VitaliiMaltsev temporarily deployed to more-secrets November 11, 2021 10:47 Inactive
Copy link
Contributor

@tuliren tuliren left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is nice! Such a concise approach.

I have two questions / comments:

  1. JsonSchemaPrimitive is widely used in so many connectors. It's surprising that this change does not affect any of them. Even though theoretically no destination will be affected, it's still safer to run integration tests on a few sources and destinations.
  2. Have you thought about any alternative approach and compared them? E.g. replace JsonSchemaPrimitive with a class, which allows more flexible specifications. One concern about this approach is that to support more formats, there will be many more enums to add, and it does not handle things like pattern, max, etc. Using a class seems more flexible and extensible.

@tuliren tuliren requested a review from sherifnada November 12, 2021 08:22

STRING_DATE(ImmutableMap.of("type", "string", "format", "date")),
STRING_TIME(ImmutableMap.of("type", "string", "format", "time")),
STRING_TIMESTAMP(ImmutableMap.of("type", "string", "format", "date-time")),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i recommend changing this to STRING_DATETIME as it matches the json primitive it represents

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i recommend changing this to STRING_DATETIME as it matches the json primitive it represents

done

@@ -4,11 +4,28 @@

package io.airbyte.protocol.models;

import com.google.common.collect.ImmutableMap;

public enum JsonSchemaPrimitive {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

might make sense to call this JsonSchemaType as these no longer represent primitives

case DATE -> JsonSchemaPrimitive.STRING;
case TIME -> JsonSchemaPrimitive.STRING;
case TIMESTAMP -> JsonSchemaPrimitive.STRING;
case DATE -> JsonSchemaPrimitive.STRING_DATE;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to be super careful about how MySQL/Postgres/MSSQL/Oracle handle timezones. For example, from the MySQL docs:

MySQL converts TIMESTAMP values from the current time zone to UTC for storage, and back from UTC to the current time zone for retrieval. (This does not occur for other types such as DATETIME.) By default, the current time zone for each connection is the server's time. The time zone can be set on a per-connection basis. As long as the time zone setting remains constant, you get back the same value you store. If you store a TIMESTAMP value, and then change the time zone and retrieve the value, the retrieved value is different from the value you stored. This occurs because the same time zone was not used for conversion in both directions. The current time zone is available as the value of the time_zone system variable. For more information, see Section 5.1.15, “MySQL Server Time Zone Support”.

So for MySQL to correctly express timestamps we would need to set the connection timezone to UTC whenever we're reading the date types in UTC (we should also have tests for this)

wdyt?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point. But currently jdbc destinations can not handle different date-time datatypes for databases and just write value as string

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@VitaliiMaltsev

currently jdbc destinations can not handle different date-time datatypes for databases and just write value as string

I believe this is inaccurate - normalization writes appropriately typed date/time types depending on the value in the format field

@tuliren
Copy link
Contributor

tuliren commented Nov 12, 2021

@sherifnada, what do you think about changing JsonSchemaPrimitive from an enum to a class? It looks like enum won't be able to support more formats. I think eventually we need to switch to a class.

@sherifnada
Copy link
Contributor

@tuliren good idea, we can even have the current enum be a set of public static constants in that class, but still make it flexible enough to have any valid fields in an object's definition

@tuliren
Copy link
Contributor

tuliren commented Nov 13, 2021

Given the potential complicity of changing JsonSchemaPrimitive to a class, I am fine with merging this PR as is, and do that in a separate PR.

But it is still important to run integration tests on a few sources and destinations to make sure nothing is broken.

@jrhizor jrhizor temporarily deployed to more-secrets November 13, 2021 12:46 Inactive
@jrhizor jrhizor temporarily deployed to more-secrets November 13, 2021 17:42 Inactive
@jrhizor jrhizor temporarily deployed to more-secrets November 13, 2021 17:42 Inactive
@VitaliiMaltsev VitaliiMaltsev temporarily deployed to more-secrets November 13, 2021 19:47 Inactive
@VitaliiMaltsev
Copy link
Contributor Author

VitaliiMaltsev commented Nov 13, 2021

/test connector=connectors/source-oracle

🕑 connectors/source-oracle https://github.com/airbytehq/airbyte/actions/runs/1457260058
✅ connectors/source-oracle https://github.com/airbytehq/airbyte/actions/runs/1457260058
No Python unittests run

@jrhizor jrhizor temporarily deployed to more-secrets November 13, 2021 20:07 Inactive
@VitaliiMaltsev
Copy link
Contributor Author

VitaliiMaltsev commented Nov 13, 2021

/test connector=connectors/source-cockroachdb

🕑 connectors/source-cockroachdb https://github.com/airbytehq/airbyte/actions/runs/1457298477
✅ connectors/source-cockroachdb https://github.com/airbytehq/airbyte/actions/runs/1457298477
No Python unittests run

@jrhizor jrhizor temporarily deployed to more-secrets November 13, 2021 20:29 Inactive
@VitaliiMaltsev
Copy link
Contributor Author

Given the potential complicity of changing JsonSchemaPrimitive to a class, I am fine with merging this PR as is, and do that in a separate PR.

But it is still important to run integration tests on a few sources and destinations to make sure nothing is broken.

@tuliren i ran some tests for Oracle Source, CockroachDB Source, MySQL Source, Postgres Destination and MSSQL Destination. And also created a separate ticket for changing JsonSchemaPrimitive to a class #7944. Can i merge this PR ?

@tuliren
Copy link
Contributor

tuliren commented Nov 13, 2021

@VitaliiMaltsev, yes. Thank you for running the tests, and creating the issue. Feel free to merge it.

@VitaliiMaltsev VitaliiMaltsev merged commit 2fe927b into master Nov 13, 2021
@VitaliiMaltsev VitaliiMaltsev deleted the vmaltsev/7777-extend-json-schema branch November 13, 2021 20:56
.withSupportedSyncModes(Lists.newArrayList(SyncMode.FULL_REFRESH, SyncMode.INCREMENTAL))
.withSourceDefinedPrimaryKey(
List.of(List.of(COL_FIRST_NAME), List.of(COL_LAST_NAME)))));
}

private JsonSchemaPrimitive resolveJsonSchemaType() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't the oracle test instead override this method, instead of hardcoding a connector-specific edge case in the abstract test class? also, the name of this method should be clearer as resolveJsonSchemaType does not reflect it is only concerned with date types

case DATE -> JsonSchemaPrimitive.STRING;
case TIME -> JsonSchemaPrimitive.STRING;
case TIMESTAMP -> JsonSchemaPrimitive.STRING;
case DATE -> JsonSchemaPrimitive.STRING_DATE;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@VitaliiMaltsev

currently jdbc destinations can not handle different date-time datatypes for databases and just write value as string

I believe this is inaccurate - normalization writes appropriately typed date/time types depending on the value in the format field

Copy link
Contributor

@sherifnada sherifnada left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@VitaliiMaltsev there's a few points we need to make sure are taken into account here:

  1. If I'm reading this correctly, the current tests don't cover all newly introduced types i.e: we must test that all 3 newly introduced types (date, time, datetime) are used appropriately. It's crucial that we maximize test coverage, otherwise customers will find issues before we do.
  2. JDBC connectors were not published in this PR, is this intended? in general whenever we make a change to a connector, we should publish it asap to make sure release notes are accurate and issues are caught quickly
  3. I don't think we resolved the discussion around timezone. It is critical that we ensure (via comprehensive testing) there are no issues here. If I'm understanding this correctly, the MySQL example I mentioned could result in inaccurate date information being persisted in the destination, because depending on the timezone used in the JDBC connection, we might be formatting the output value incorrectly, which when writing normalized tables could result in the date format being misinterpreted as UTC when it is not (which would change the actual value of the date object)

@VitaliiMaltsev VitaliiMaltsev restored the vmaltsev/7777-extend-json-schema branch November 15, 2021 08:05
@VitaliiMaltsev VitaliiMaltsev temporarily deployed to more-secrets November 15, 2021 08:07 Inactive
VitaliiMaltsev added a commit that referenced this pull request Nov 15, 2021
… properties in JSON schema (#7859)"

This reverts commit 2fe927b.
@VitaliiMaltsev
Copy link
Contributor Author

VitaliiMaltsev commented Nov 15, 2021

@VitaliiMaltsev there's a few points we need to make sure are taken into account here:

  1. If I'm reading this correctly, the current tests don't cover all newly introduced types i.e: we must test that all 3 newly introduced types (date, time, datetime) are used appropriately. It's crucial that we maximize test coverage, otherwise customers will find issues before we do.
  2. JDBC connectors were not published in this PR, is this intended? in general whenever we make a change to a connector, we should publish it asap to make sure release notes are accurate and issues are caught quickly
  3. I don't think we resolved the discussion around timezone. It is critical that we ensure (via comprehensive testing) there are no issues here. If I'm understanding this correctly, the MySQL example I mentioned could result in inaccurate date information being persisted in the destination, because depending on the timezone used in the JDBC connection, we might be formatting the output value incorrectly, which when writing normalized tables could result in the date format being misinterpreted as UTC when it is not (which would change the actual value of the date object)

@sherifnada this PR has already been merged with the master branch. Is it necessary to revert the changes in this PR before these 3 points are resolved?

@VitaliiMaltsev
Copy link
Contributor Author

@sherifnada please advise. I'm not sure - do we need to publish all jdbc connectors if changes were not in connectors code?
Changes in this PR for airbyte-protocol and airbyte-db modules

@sherifnada
Copy link
Contributor

JdbcSourceOperations (and others in this class) are leveraged by all JDBC connectors which means changing any such common class should trigger a publication of all JDBC connectors

VitaliiMaltsev added a commit that referenced this pull request Nov 15, 2021
… properties in JSON schema (#7859)" (#7969)

This reverts commit 2fe927b.
schlattk pushed a commit to schlattk/airbyte that referenced this pull request Jan 4, 2022
…n JSON schema (airbytehq#7859)

* add date-time formats to json schema creation

* add jsonSchemaMap to enum

* fix tests and checkstyle

* remove python changes from PR

* remove star import

* Rename String_timestamp to String_Datetime

* Rename String_timestamp to String_Datetime

* fix checkstyle

* fix jdbc source tests

Co-authored-by: vmaltsev <vitalii.maltsev@globallogic.com>
schlattk pushed a commit to schlattk/airbyte that referenced this pull request Jan 4, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/connectors Connector related issues area/platform issues related to the platform area/protocol
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Extend SourceOperations logic for JDBC connectors to provide additional properties in JSON schema
7 participants