Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CDK: AbstractSource.read() skips syncing stream if its unavailable (add AvailabilityStrategy concept) #19977

Merged
merged 51 commits into from
Dec 12, 2022

Conversation

erohmensing
Copy link
Contributor

@erohmensing erohmensing commented Dec 1, 2022

What

How

  • Adds a check_availability method to the Stream class to check availability
  • Adds an AvailabilityStrategy class to the Stream class to perform this check
  • Adds an HttpAvailabilityStrategy class to the HTTPStream class, and sets it as the default AvailabilityStrategy for this class
    • This strategy tries to read the first record, excepts a 403:Forbidden HTTPError, and returns that the stream is unavailable, why, and where to learn more to fix this issue
  • Adds logic to AbstractSource.read() to skip syncing the stream if the stream is unavailable
  • Updates CheckStream ConnectionChecker class to first try using the availability strategy, if there is one

Recommended reading order

Implementation:

  1. airbyte-cdk/python/airbyte_cdk/sources/abstract_source.py
  2. airbyte-cdk/python/airbyte_cdk/sources/streams/core.py
  3. airbyte-cdk/python/airbyte_cdk/sources/availability_strategy.py
  4. airbyte-cdk/python/airbyte_cdk/sources/streams/http/http.py
  5. airbyte-cdk/python/airbyte_cdk/sources/streams/http/availability_strategy.py

Testing:

  1. airbyte-cdk/python/unit_tests/sources/streams/test_availability_strategy.py
  2. airbyte-cdk/python/unit_tests/sources/streams/http/test_availability_strategy.py
  3. airbyte-cdk/python/unit_tests/sources/test_source.py

Refactoring/small improvements:

  1. airbyte-cdk/python/airbyte_cdk/sources/utils/stream_helpers.py
  2. Remaining files (typos, deprecation warnings etc)

🚨 User Impact 🚨

End users:

  • For a given source, if CDK version is updated and source streams subclasses HttpStream, the HTTPAvailabilityStrategy will now be in effect for that stream, which means:
    • Syncs that fail due to an expected HTTP error (403) while reading a stream will no longer fail. Information about the stream that is skipped is currently only listed as a warning in the logs. The skipped stream will not sync, the other streams will.

Source developers:

  • Source developers will now be able to write AvailabilityStrategies if they update their CDK version.
  • If CDK version is updated and source streams only subclasses Stream, no change.
  • If CDK version is updated and source streams subclasses HttpStream, the HTTPAvailabilityStrategy will now be in effect for that stream.

Note: Sources that already handle 403 errors in their read() methods should move that handling to their own AvailabilityStrategy (if handling is more specific than the default implementation). That will be addressed in #17853

@octavia-squidington-iv octavia-squidington-iv added the CDK Connector Development Kit label Dec 1, 2022
@erohmensing erohmensing requested a review from a team December 5, 2022 19:41
Copy link
Contributor

@alafanechere alafanechere left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work here and great readability 👏

@erohmensing erohmensing changed the title CDK: AvailabilityStrategy source property class - check stream's available before syncing CDK: AvailabilityStrategy stream property class - check stream's available before syncing Dec 7, 2022
Copy link
Contributor

@alafanechere alafanechere left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work 👏 I left minor comments. LGTM, but please address @girarda questions before publishing the CDK and merging 😄

@erohmensing
Copy link
Contributor Author

erohmensing commented Dec 12, 2022

/publish-cdk dry-run=true

🕑 https://github.com/airbytehq/airbyte/actions/runs/3678891792
https://github.com/airbytehq/airbyte/actions/runs/3678891792

@erohmensing
Copy link
Contributor Author

erohmensing commented Dec 12, 2022

/publish-cdk dry-run=false

🕑 https://github.com/airbytehq/airbyte/actions/runs/3679077080
https://github.com/airbytehq/airbyte/actions/runs/3679077080

@erohmensing erohmensing merged commit 55a3288 into master Dec 12, 2022
@erohmensing erohmensing deleted the ella/availabilitystrategy branch December 12, 2022 19:32
erohmensing added a commit that referenced this pull request Dec 15, 2022
…ailable (add `AvailabilityStrategy` concept) (#19977)"

This reverts commit 55a3288.
erohmensing added a commit that referenced this pull request Dec 15, 2022
…hanges (#20523)

* Revert "source-github: move known error handling to GithubAvailabilityStrategy (#19978)"

This reverts commit f97db17.

* Revert "🐛 Python CDK: fix `StopIteration` error for `check_availability` (#20429)"

This reverts commit 4e9b014.

* Revert "CDK: `AbstractSource.read()` skips syncing stream if its unavailable (add `AvailabilityStrategy` concept) (#19977)"

This reverts commit 55a3288.

* Restore changelog entries

* bump CDK version

* Bump Github version

* Re-add removed dependencies

* auto-bump connector version

Co-authored-by: Octavia Squidington III <octavia-squidington-iii@users.noreply.github.com>
erohmensing added a commit that referenced this pull request Dec 27, 2022
…(add `AvailabilityStrategy` concept) (#19977)

* Rough first implememtation of AvailabilityStrategy s

* Basic unit tests for AvailabilityStrategy and ScopedAvailabilityStrategy

* Make availability_strategy a property, separate out tests

* Remove from DeclarativeSource, remove Source parameter from methods, make default no AvailabilityStrategy

* Add skip stream if not available to read()

* Changes to CDK to get source-github working using AvailabilityStrategy, flakecheck

* reorganize cdk class, add HTTPAvailabilityStrategy test

* cleanup, docstrings

* pull out error handling into separate method

* Pass source and logger to check_connection method

* Add documentation links, handle 403 specifically

* Fix circular import

* Add AvailabilityStrategy to Stream and HTTPStream classes

* Remove AS from abstract_source, add to Stream, HTTPStream, AvailabilityStrategy unit tests passing for per-stream strategies

* Modify MockHttpStream to set no AvailabilityStrategy since source test mocking doesn't support this

* Move AvailabilityStrategy class to sources.streams

* Move HTTPAvailabilityStrategy to http module

* Use pascal case for HttpAvailabilityStrategy

* Remove docs message method :( and default to True availability on unhandled HTTPErrors

* add check_availability method to stream class

* Add optional source parameter

* Add test for connector-specific documentation, small tests refactor

* Add test that performs the read() function for stream with default availability strategy

* Add test for read function behavior when stream is unavailable

* Add 403 info in logger message

* Don't return error for other HTTPErrors

* Split up error handling into methods 'unavailable_error_codes' and 'get_reason_for_error'

* rework overrideable list of status codes to be a dict with reasons, to enforce that users provide reasons for all listed errors

* Fix incorrect typing

* Move HttpAvailability to its own module, fix flake errors

* Fix ScopedAvailabilityStrategy, docstrings and types for streams/availability_strategy.py

* Docstrings and types for core.py and http/availability_strategy.py

* Move _get_stream_slices to a StreamHelper class

* Docstrings + types for stream_helpers.py, cleanup test_availability.py

* Clean up test_source.py

* Move logic of getting the initial record from a stream to StreamHelper class

* Add changelog and bump minor version

* change 'is True' and 'is False' behavior

* use mocker.MagicMock

* Remove ScopedAvailabilityStrategy

* Don't except non-403 errors, check_stream uses availability_strategy if possible

* CDK: pass error to reasons_for_error_codes

* make get_stream_slice public

* Add tests for raising unhandled errors and retries are handled

* Add tests for CheckStream via AvailabilityStrategy

* Add documentation for stream availability of http streams

* Move availability unit tests to correct modules, report error message if possible

* Add test for reporting specific error if available
erohmensing added a commit that referenced this pull request Jan 17, 2023
…(add `AvailabilityStrategy` concept) (#19977)

* Rough first implememtation of AvailabilityStrategy s

* Basic unit tests for AvailabilityStrategy and ScopedAvailabilityStrategy

* Make availability_strategy a property, separate out tests

* Remove from DeclarativeSource, remove Source parameter from methods, make default no AvailabilityStrategy

* Add skip stream if not available to read()

* Changes to CDK to get source-github working using AvailabilityStrategy, flakecheck

* reorganize cdk class, add HTTPAvailabilityStrategy test

* cleanup, docstrings

* pull out error handling into separate method

* Pass source and logger to check_connection method

* Add documentation links, handle 403 specifically

* Fix circular import

* Add AvailabilityStrategy to Stream and HTTPStream classes

* Remove AS from abstract_source, add to Stream, HTTPStream, AvailabilityStrategy unit tests passing for per-stream strategies

* Modify MockHttpStream to set no AvailabilityStrategy since source test mocking doesn't support this

* Move AvailabilityStrategy class to sources.streams

* Move HTTPAvailabilityStrategy to http module

* Use pascal case for HttpAvailabilityStrategy

* Remove docs message method :( and default to True availability on unhandled HTTPErrors

* add check_availability method to stream class

* Add optional source parameter

* Add test for connector-specific documentation, small tests refactor

* Add test that performs the read() function for stream with default availability strategy

* Add test for read function behavior when stream is unavailable

* Add 403 info in logger message

* Don't return error for other HTTPErrors

* Split up error handling into methods 'unavailable_error_codes' and 'get_reason_for_error'

* rework overrideable list of status codes to be a dict with reasons, to enforce that users provide reasons for all listed errors

* Fix incorrect typing

* Move HttpAvailability to its own module, fix flake errors

* Fix ScopedAvailabilityStrategy, docstrings and types for streams/availability_strategy.py

* Docstrings and types for core.py and http/availability_strategy.py

* Move _get_stream_slices to a StreamHelper class

* Docstrings + types for stream_helpers.py, cleanup test_availability.py

* Clean up test_source.py

* Move logic of getting the initial record from a stream to StreamHelper class

* Add changelog and bump minor version

* change 'is True' and 'is False' behavior

* use mocker.MagicMock

* Remove ScopedAvailabilityStrategy

* Don't except non-403 errors, check_stream uses availability_strategy if possible

* CDK: pass error to reasons_for_error_codes

* make get_stream_slice public

* Add tests for raising unhandled errors and retries are handled

* Add tests for CheckStream via AvailabilityStrategy

* Add documentation for stream availability of http streams

* Move availability unit tests to correct modules, report error message if possible

* Add test for reporting specific error if available
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CDK Connector Development Kit
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Skip stream syncing when stream is unavailable
6 participants