Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🎉 JDBC sources: emit cursor counts #15535

Merged
merged 49 commits into from
Oct 14, 2022
Merged

🎉 JDBC sources: emit cursor counts #15535

merged 49 commits into from
Oct 14, 2022

Conversation

tuliren
Copy link
Contributor

@tuliren tuliren commented Aug 11, 2022

What

🚨 User Impact 🚨

  • This PR fixes a bug when new data is inserted at the same time as the sync starts, (non-dedup) incremental syncs can miss the records whose cursor value fall on the boundary.
  • The downside is such incremental syncs will now duplicate records with the boundary cursor values.

Review order

  • db_models.yaml
  • AbstractJdbcSource.java
  • AbstractJdbcSourceTest.java - specifically testIncrementalWithConcurrentInsertion
  • StateManager.java and implementations of the abstract state manager
  • StateDecoratingIterator.java
  • The reset of the changes

TODO

  • Switch count field in CursorInfo to long to avoid the type conversion.
  • Update changelogs and mark this change as unpublished.
  • Update MS SQL which currently does not sort incremental streams based on the cursor field (because it overrides the query method).
  • Think about more unit tests.

Integration tests

  • source-scaffold-java-jdbc
    • Already broken. Will not fix.
  • source-tidb
  • source-jdbc
  • source-mongodb-v2
  • source-mongodb-strict-encrypt
  • source-clickhouse
  • source-clickhouse-strict-encrypt
  • source-mysql
  • source-mysql-strict-encrypt
  • source-mssql
  • source-mssql-strict-encrypt
  • source-db2
  • source-db2-strict-encrypt
  • source-postgres
  • source-postgres-strict-encrypt
  • source-alloydb
  • source-alloydb-strict-encrypt
  • source-oracle
  • source-oracle-strict-encrypt
  • source-cockroachdb
  • source-cockroachdb-strict-encrypt
  • source-bigquery
  • source-redshift
  • source-snowflake

@tuliren tuliren force-pushed the liren/emit-record-count branch from 16b9263 to c80eaee Compare August 11, 2022 22:11
@tuliren
Copy link
Contributor Author

tuliren commented Aug 12, 2022

/test connector=connectors/source-postgres

🕑 connectors/source-postgres https://github.com/airbytehq/airbyte/actions/runs/2843898744
✅ connectors/source-postgres https://github.com/airbytehq/airbyte/actions/runs/2843898744
No Python unittests run

Build Passed

Test summary info:

All Passed

@tuliren tuliren force-pushed the liren/emit-record-count branch from 3a2d8c9 to 7b2f949 Compare October 10, 2022 13:20
@airbytehq airbytehq deleted a comment from github-actions bot Oct 10, 2022
@tuliren tuliren temporarily deployed to more-secrets October 14, 2022 06:50 Inactive
@tuliren
Copy link
Contributor Author

tuliren commented Oct 14, 2022

/test connector=connectors/source-oracle-strict-encrypt

🕑 connectors/source-oracle-strict-encrypt https://github.com/airbytehq/airbyte/actions/runs/3247985833
✅ connectors/source-oracle-strict-encrypt https://github.com/airbytehq/airbyte/actions/runs/3247985833
No Python unittests run

Build Passed

Test summary info:

All Passed

@airbytehq airbytehq deleted a comment from github-actions bot Oct 14, 2022
@tuliren
Copy link
Contributor Author

tuliren commented Oct 14, 2022

/publish connector=connectors/source-postgres run-tests=false

🕑 Publishing the following connectors:
connectors/source-postgres
https://github.com/airbytehq/airbyte/actions/runs/3248058630


Connector Did it publish? Were definitions generated?
connectors/source-postgres

if you have connectors that successfully published but failed definition generation, follow step 4 here ▶️

@tuliren
Copy link
Contributor Author

tuliren commented Oct 14, 2022

/publish connector=connectors/source-postgres-strict-encrypt run-tests=false

🕑 Publishing the following connectors:
connectors/source-postgres-strict-encrypt
https://github.com/airbytehq/airbyte/actions/runs/3248077130


Connector Did it publish? Were definitions generated?
connectors/source-postgres-strict-encrypt

if you have connectors that successfully published but failed definition generation, follow step 4 here ▶️

@tuliren
Copy link
Contributor Author

tuliren commented Oct 14, 2022

/test connector=connectors/source-mssql

🕑 connectors/source-mssql https://github.com/airbytehq/airbyte/actions/runs/3248121570
✅ connectors/source-mssql https://github.com/airbytehq/airbyte/actions/runs/3248121570
No Python unittests run

Build Passed

Test summary info:

All Passed

@tuliren tuliren merged commit 888347a into master Oct 14, 2022
@tuliren tuliren deleted the liren/emit-record-count branch October 14, 2022 08:09
letiescanciano added a commit that referenced this pull request Oct 14, 2022
…vation

* master: (98 commits)
  🐛 Source Bing Ads - Fix Campaigns stream misses Audience and Shopping (#17873)
  Source S3 - fix schema inference (#17991)
  🎉 JDBC sources: store cursor record count in db state (#15535)
  Introduce webhook configs into workspace api and persistence (#17950)
  ci: upload test results to github for analysis (#17953)
  Trigger the connectors build if there are worker changes. (#17976)
  Add additional sync timing information (#17643)
  Use page_token_option instead of page_token (#17892)
  capture metrics around json messages size (#17973)
  🐛 Correct kube annotations variable as per the docs. (#17972)
  🪟 🎉 Add /connector-builder page with embedded YAML editor (#17482)
  fix `est_num_metrics_emitted_by_reporter` not being emitted (#17929)
  Update schema dumps (#17960)
  Remove the bump in the value.yml (#17959)
  Ensure database initialization in test container (#17697)
  Remove typo line from incremental reads docs (#17920)
  DocS: Update authentication.md (#17931)
  Use MessageMigration for Source Connection Check. (#17656)
  fixed links (#17949)
  remove usages of YamlSeedConfigPersistence (#17895)
  ...
@gosusnp gosusnp mentioned this pull request Oct 14, 2022
jhammarstedt pushed a commit to jhammarstedt/airbyte that referenced this pull request Oct 31, 2022
* Add cursor_record_count to db stream state

* Add cursor record count to cursor info

* Emit max cursor record count

* Add original cursor record count

* Unify logging format

* Add backward compatible methods

* Update unit tests for state decorating iterator

* Update test (not done yet)

* Fix one more unit test

* Change where clause operator according to record count

* Add branch for null cursor

* Skip saving record count when it is 0

* Fix log wording

* Set mock record count in test

* Check cursor value instead of cursor info

* Fix source jdbc test

* Read record count from state

* Fix tests

* Add an acceptance test case

* Fix npe

* Change record count from int to long to avoid type conversion

* Fix references

* Fix oracle container

* Use uppercase for snowflake

* Use uppercase for db2

* Fix and use uppercase

* Update test case to include the edge case

* Format code

* Remove extra assertion in clickhouse

* Merge ms sql incremental query method

* Log query for debugging

* Clean up name_and_timestamp table

* Fix db2 tests

* Fix mssql tests

* Fix oracle tests

* Fix oracle tests

* Fix cockroachdb tests

* Fix snowflake tests

* Add changelog

* Fix mssql tests

* Fix db2-strict-encrypt tests

* Fix oracle-strict-encrypt tests

* Bump postgres version

* Fix oracle-strict-encrypt tests

* auto-bump connector version [ci skip]

Co-authored-by: Octavia Squidington III <octavia-squidington-iii@users.noreply.github.com>
@Kopiczek
Copy link

Kopiczek commented Nov 3, 2022

@jdbranham I think this missed a bunch of version updates? I only see an update to postgres. Meaning all the other connectors are actually missing this change until a new version is deployed?

@Kopiczek
Copy link

Kopiczek commented Nov 3, 2022

@tuliren ^

@tuliren
Copy link
Contributor Author

tuliren commented Nov 3, 2022

all the other connectors are actually missing this change until a new version is deployed?

@Kopiczek Yes. All the other connectors are tested, but not published. The change will be available with the next version bump.

Do you need this for another connector? We can publish that connector immediately.

This is intentional because I think this record count is not a critical feature. It is only useful for a subset of an edge case.

  • The edge case is that when some records are inserted into the database at the same time as the sync kicks off, and the insertion timestamp is used as the cursor.
  • It only solves a subset of the edge case because if there are deletions together with insertions, the solution won't work.

So at the end of the day, only a small number of users will benefit from it, and does not worth the effort of publishing all the connectors.

To guarantee zero data loss, CDC is a better way.

@Kopiczek
Copy link

Kopiczek commented Nov 3, 2022

@tuliren
Yes, we would need this for: BigQuery, Snowflake and RedShift connectors
We have some unfortunate tables that are syncs that have a very bad cursor field (some with hour granularity and some even a DATE rather then DATETIME). We currently support those doing a more sparse full syncs on them rather then incremental ones (since current versions simply skip data on cursor clash) but want to move to incremental as it is cheaper.

@tuliren
Copy link
Contributor Author

tuliren commented Nov 4, 2022

@Kopiczek, got it. Thank you for this feedback. So this feature may be more useful than I thought. I am publishing these connectors in #18954 and #18955.

akashkulk added a commit that referenced this pull request Nov 10, 2022
Changes were made in this PR to fix a bug when new data is inserted at the same time the sync starts. However, they were never published

#15535
akashkulk added a commit that referenced this pull request Nov 10, 2022
* Bump up snowflake source version

Changes were made in this PR to fix a bug when new data is inserted at the same time the sync starts. However, they were never published

#15535

* auto-bump connector version

Co-authored-by: Octavia Squidington III <octavia-squidington-iii@users.noreply.github.com>
akashkulk added a commit that referenced this pull request Dec 2, 2022
* Bump up snowflake source version

Changes were made in this PR to fix a bug when new data is inserted at the same time the sync starts. However, they were never published

#15535

* auto-bump connector version

Co-authored-by: Octavia Squidington III <octavia-squidington-iii@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment