Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🐛Source-mssql: added support for some datatypes #14449

Conversation

etsybaev
Copy link
Contributor

@etsybaev etsybaev commented Jul 6, 2022

What

Currently, we have 2 different classes for integration datatype tests CdcMssqlSourceDatatypeTest and MssqlSourceDatatypeTest. They have the same input args, but differ "expected result" for some types.
Some of Cdc types were not supported at all.

How

Aligned tests for normal and CDC datatype tests to make those tests ready for merging.
This PR is basically the first part of splitting this #14379 into 2 PRs as it had been requested. Once it appears to be merged, I will create a second PR for tests merging to have the same tests for both sync ways (to verify that output is the same for both).

Normal sync:

  • Aligned time format to what we have in other connectors.

CDC:

  • Added proper handling: binary, varbinary, geometry, geography,
  • Fixed: time, datetimeoffset,

Recommended reading order

  1. airbyte-integrations/bases/debezium-v1-4-2/src/main/java/io/airbyte/integrations/debezium/internals/MSSQLConverter.java

🚨 User Impact 🚨

Shouldn't introduce any breaking changes.

Pre-merge Checklist

Expand the relevant checklist and delete the others.

New Connector

Community member or Airbyter

  • Community member? Grant edit access to maintainers (instructions)
  • Secrets in the connector's spec are annotated with airbyte_secret
  • Unit & integration tests added and passing. Community members, please provide proof of success locally e.g: screenshot or copy-paste unit, integration, and acceptance test output. To run acceptance tests for a Python connector, follow instructions in the README. For java connectors run ./gradlew :airbyte-integrations:connectors:<name>:integrationTest.
  • Code reviews completed
  • Documentation updated
    • Connector's README.md
    • Connector's bootstrap.md. See description and examples
    • docs/integrations/<source or destination>/<name>.md including changelog. See changelog example
    • docs/integrations/README.md
    • airbyte-integrations/builds.md
  • PR name follows PR naming conventions

Airbyter

If this is a community PR, the Airbyte engineer reviewing this PR is responsible for the below items.

  • Create a non-forked branch based on this PR and test the below items on it
  • Build is successful
  • If new credentials are required for use in CI, add them to GSM. Instructions.
  • /test connector=connectors/<name> command is passing
  • New Connector version released on Dockerhub by running the /publish command described here
  • After the connector is published, connector added to connector index as described here
  • Seed specs have been re-generated by building the platform and committing the changes to the seed spec files, as described here
Updating a connector

Community member or Airbyter

  • Grant edit access to maintainers (instructions)
  • Secrets in the connector's spec are annotated with airbyte_secret
  • Unit & integration tests added and passing. Community members, please provide proof of success locally e.g: screenshot or copy-paste unit, integration, and acceptance test output. To run acceptance tests for a Python connector, follow instructions in the README. For java connectors run ./gradlew :airbyte-integrations:connectors:<name>:integrationTest.
  • Code reviews completed
  • Documentation updated
    • Connector's README.md
    • Connector's bootstrap.md. See description and examples
    • Changelog updated in docs/integrations/<source or destination>/<name>.md including changelog. See changelog example
  • PR name follows PR naming conventions

Airbyter

If this is a community PR, the Airbyte engineer reviewing this PR is responsible for the below items.

  • Create a non-forked branch based on this PR and test the below items on it
  • Build is successful
  • If new credentials are required for use in CI, add them to GSM. Instructions.
  • /test connector=connectors/<name> command is passing
  • New Connector version released on Dockerhub and connector version bumped by running the /publish command described here
Connector Generator
  • Issue acceptance criteria met
  • PR name follows PR naming conventions
  • If adding a new generator, add it to the list of scaffold modules being tested
  • The generator test modules (all connectors with -scaffold in their name) have been updated with the latest scaffold by running ./gradlew :airbyte-integrations:connector-templates:generator:testScaffoldTemplates then checking in your changes
  • Documentation which references the generator is updated as needed

Tests

Unit

Put your unit tests output here.

Integration

Put your integration tests output here.

Acceptance

Put your acceptance tests output here.

@etsybaev etsybaev requested a review from grishick July 6, 2022 14:13
@github-actions github-actions bot added the area/connectors Connector related issues label Jul 6, 2022
@etsybaev
Copy link
Contributor Author

etsybaev commented Jul 6, 2022

/test connector=connectors/source-mssql

🕑 connectors/source-mssql https://github.com/airbytehq/airbyte/actions/runs/2623415776
✅ connectors/source-mssql https://github.com/airbytehq/airbyte/actions/runs/2623415776
No Python unittests run

Build Passed

Test summary info:

All Passed

Copy link
Contributor

@grishick grishick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please specify which data types are added in this PR as part of the commit message and PR description.

I see that you removed trailing "0"s from one of the tests - why?

@etsybaev etsybaev temporarily deployed to more-secrets July 7, 2022 08:37 Inactive
@etsybaev etsybaev requested a review from grishick July 7, 2022 08:48
@etsybaev
Copy link
Contributor Author

etsybaev commented Jul 7, 2022

@grishick the initial idea was to do not transform the user's data, i.e. if user had time with millis - then we would also add it, but if user's data was without millis - then also not add it. But I got your concern and updated PR to have millis by default (re-used pattern that we already use in other connectors).

Also updated description as per your request.
Thanks

@etsybaev
Copy link
Contributor Author

etsybaev commented Jul 7, 2022

/test connector=connectors/source-mssql

🕑 connectors/source-mssql https://github.com/airbytehq/airbyte/actions/runs/2628350936
✅ connectors/source-mssql https://github.com/airbytehq/airbyte/actions/runs/2628350936
No Python unittests run

Build Passed

Test summary info:

All Passed

@gsmith-schlesinger
Copy link

Hey, so this is actually related to work I'm currently trying to do in a fork.

I know your PR doesn't capture "MONEY", "NUMERIC", and "DECIMAL". But could we tack those changes in?

#12949

private final Set<String> BINARY = Set.of("VARBINARY", "BINARY");
private static final String DATETIMEOFFSET = "DATETIMEOFFSET";
private static final String TIME_TYPE = "TIME";
private static final String SMALLMONEY_TYPE = "SMALLMONEY";
Copy link

@gsmith-schlesinger gsmith-schlesinger Jul 7, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
private static final String SMALLMONEY_TYPE = "SMALLMONEY";
private final Set<String> DECIMAL_TYPES = Set.of("SMALLMONEY", "MONEY", "NUMERIC", "DECIMAL");

and change the if else to use contains.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @gsmith-schlesinger .
Could you please clarify what's wrong with those types?
If we update it a is then some dataType tests started failing.
Also if we update something for CDD migration we need to make sure we do the same for normal sync. In the nearest future we'll have the same tests for both Normal and CDC syncs.

Ex.:
org.opentest4j.AssertionFailedError: Returned value '999.0' from stream dbo_1_decimal is not in the expected list: [999.00, 5.10, 0.00, null] ==> expected:

Thanks

Copy link

@gsmith-schlesinger gsmith-schlesinger Jul 8, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So currently as mentioned in #12949 and from this issues #5609 from are exactly the same issue we're currently experiencing the difference is the column type from the database we're reading from is use a MONEY.

There's a PR which refers to this area of code that "fixed" the issue for SMALLMONEY but it didn't encapsulate all decimal type columns from MSSQL.

You'll see the error logs in both tickets, and those error logs are the same thing we're seeing in ours.

Failed to convert JSON to Avro: Could not evaluate union, field CTR is expected to be one of these: NULL, DOUBLE. If this is a complex type, check if offending field (path: CTR) adheres to schema: 0.00

There's a workaround using CSV or JSON. But that's not sufficient for our use case.

I could be wrong, that this isn't the right area of the code. But I'm still attempting to get airbyte running in and fighting intellij to verify this "fixes" our issue.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should note as well this issue comes up only on CDC / Incremental airbyte tasks. The Full Refresh / Overwrite task works successfully.

Copy link
Contributor Author

@etsybaev etsybaev Jul 8, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you have any PR where you've fixed and aligned it with regular sync?
Just asking as this particular PR basically is the first step of merging tests for normal and CDC sync's to make sure that behavior (output) are the same - #14379.
I just had been asked to split it into 2 different PRs.

Thanks

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not yet, but I'll honestly have to wait for your PR to be merged before I can submit mine anyways as you've made a lot of changes I was already planning on making in my own fork. I only started working towards this two days ago.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main idea here is that if you change anything in CDC then you also need to make sure that normal sync will return exactly the same. Unfortunately, there are completely different approaches used here, so it may be pretty tricky to alight it. Especially when it comes to alight zero decimal values (ex. 10 vs 10.0 vs 10.00 etc)

@@ -34,11 +46,61 @@ public void converterFor(final RelationalColumn field,
registerDate(field, registration);
} else if (SMALLMONEY_TYPE.equalsIgnoreCase(field.typeName())) {
Copy link

@gsmith-schlesinger gsmith-schlesinger Jul 7, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
} else if (SMALLMONEY_TYPE.equalsIgnoreCase(field.typeName())) {
} else if (DECIMAL_TYPES.contains(field.typeName().toUpperCase())) {

@grishick grishick requested a review from subodh1810 July 9, 2022 00:00
@etsybaev etsybaev requested a review from a team as a code owner July 11, 2022 06:38
@etsybaev etsybaev temporarily deployed to more-secrets July 11, 2022 06:41 Inactive
Copy link
Contributor

@subodh1810 subodh1810 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are 2 versions of MSSQLConverter, can you replicate your changes to airbyte-integrations/bases/debezium-v1-9-2/src/main/java/io/airbyte/integrations/debezium/internals/MSSQLConverter.java as well.

Also please take a look at the issue #14628, we want to avoid mutating the data types in any way possible so please verify that its not happening in this PR

@@ -391,32 +391,27 @@ protected void initTests() {
.sourceType("datetimeoffset")
.airbyteType(JsonSchemaType.STRING)
.addInsertValues("'0001-01-10 00:00:00 +01:00'", "'9999-01-10 00:00:00 +01:00'", "null")
.addExpectedValues("0001-01-10 00:00:00 +01:00", "9999-01-10 00:00:00 +01:00", null)
.addExpectedValues("0001-01-10 00:00:00.0000000 +01:00",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we expect to have milliseconds when insert values do not have them?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought this was your concern in the comment above #14449 (review)
Previously we had millis in any case. I removed it for cases when source didn't have it. I thought you had a concern about it so I added it back, but aligned it to the format that we have for other connectors. And also aligned between normal and CDC cync. Because for now we have different behavior even inside mssql connector. If it's not OK, kindly ask to specify your expected behavior. Thanks

@@ -225,7 +225,7 @@ protected void initTests() {
.sourceType("time")
.airbyteType(JsonSchemaType.STRING)
.addInsertValues("null", "'13:00:01'", "'13:00:04Z'", "'13:00:04.123456Z'")
.addExpectedValues(null, "13:00:01.0000000", "13:00:04.0000000", "13:00:04.1234560")
.addExpectedValues(null, "13:00:01.000000", "13:00:04.000000", "13:00:04.123456")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here - is it the correct behavior to expect milliseconds when inserting data w/o milliseconds?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought this was your concern in the comment above #14449 (review)
Previously we had millis in any case. I removed it for cases when source didn't have it. I thought you had a concern about it so I added it back, but aligned it to the format that we have for other connectors. If it's not OK, kindly ask to specify your expected behavior. Thanks

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was a misunderstanding. I wasn't expressing a concern, but asking for the underlying logic/reason for removing trailing zeros in some of the test cases. Generally speaking, we should not be changing source data. If source has time w/o millis we should not be adding them, if source has time with millis, we should not be removing them.

@@ -225,7 +225,7 @@ protected void initTests() {
.sourceType("time")
.airbyteType(JsonSchemaType.STRING)
.addInsertValues("null", "'13:00:01'", "'13:00:04Z'", "'13:00:04.123456Z'")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like we expect to lose "Z" - is this the correct behavior?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just aligned it to what we already use in other connectors (re-used existing time format). Should I add an additional format to add it back? Thanks

@grishick grishick requested a review from subodh1810 July 13, 2022 18:50
@etsybaev etsybaev temporarily deployed to more-secrets July 14, 2022 14:55 Inactive
@@ -381,8 +381,8 @@ protected void initTests() {
TestDataHolder.builder()
.sourceType("time")
.airbyteType(JsonSchemaType.STRING)
.addInsertValues("'00:00:00.0000000'", "'23:59:59.9999999'", "'00:00:00'", "'23:58'", "null")
.addExpectedValues("00:00:00", "23:59:59.9999999", "00:00:00", "23:58:00", null)
.addInsertValues("'00:00:00.000000'", "'23:59:59.999999'", "'00:00:00'", "'23:58'", "null")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe I'm missing something - do we no longer support 7-decimal-digit precision in timestamps?

@duytp
Copy link

duytp commented Jul 15, 2022

@gsmith-schlesinger, you save my days.
I got the same errors when replicating MSSQL to S3 (incremental | append).

Your sugesstion resovled the issue.


Additional Failure Information: tech.allegro.schema.json2avro.converter.AvroConversionException: Failed to convert JSON to Avro: Could not evaluate union, field TotalAmount is expected to be one of these: NULL, DOUBLE. If this is a complex type, check if offending field (path: TotalAmount) adheres to schema: 0.00

A PR is recommended.

@etsybaev etsybaev deleted the etsybaev/13606-source-mssql-cdc-added-support-for-some-datatypes branch December 10, 2022 13:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants