-
Notifications
You must be signed in to change notification settings - Fork 4.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
✨ concurrent cdk: Read multiple streams concurrently #32411
Conversation
The latest updates on your projects. Learn more about Vercel for Git ↗︎
|
self._raise_exception_on_missing_stream = raise_exception_on_missing_stream | ||
|
||
@property | ||
def message_repository(self) -> MessageRepository: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need this now? I think it might have been necessary before because it inherited from AbstractSource
but it seems like it has been removed.
I feel like a snippet of code of how Stripe or Salesforce would look like using this source would help understand the usage and end-to-end setup
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is still needed since it is passed to the ConcurrenttreamProcessor
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I see that it is being passed to ConcurrentStreamProcessor
here but it's done using the private field. Do we need to expose it publicly using a method?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh derp. no, the repository shouldn't be public. deleted the property method
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I checked the test this morning and added a couple comments. Very thorough and clean code! I think I'm just missing the "how it'll be configured for a source" to make sure everything makes sense
airbyte-cdk/python/unit_tests/sources/streams/concurrent/test_adapters.py
Outdated
Show resolved
Hide resolved
airbyte-cdk/python/unit_tests/sources/streams/concurrent/test_adapters.py
Outdated
Show resolved
Hide resolved
self._message_repository, | ||
self._partition_reader, | ||
) | ||
handler._streams_currently_generating_partitions = {_STREAM_NAME} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel these tests should be driven by the public interface only when possible. Could we do something like:
def test_handle_partition_done_no_other_streams_to_generate_partitions_for(self):
stream_instances_to_read_from = [self._stream]
handler = ConcurrentStreamProcessor(
stream_instances_to_read_from,
self._partition_enqueuer,
self._thread_pool_manager,
self._logger,
self._slice_logger,
self._message_repository,
self._partition_reader,
)
handler.start_next_partition_generator()
handler.on_partition(self._stream_partition)
handler._streams_currently_generating_partitions = {_STREAM_NAME}
handler._streams_to_partitions = {_STREAM_NAME: {self._an_open_partition}}
sentinel = PartitionGenerationCompletedSentinel(self._stream)
messages = list(handler.on_partition_generation_completed(sentinel))
expected_messages = []
assert expected_messages == messages
My concerns are:
- When a user of this class will interact with
ConcurrentStreamProcessor
, it'll use the public interface. Hence, also using the public interface in the tests will mimic real usage - As a tester, I need to know the exact internal steps with the dependencies between private variables. This is more involved an error-prone than calling public method
- If the implementation changes, the tests will probably break. This is not true if we use the public interface as this is the contract we try to uphold and therefore should be more stable
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agreed. done!
@maxi297 here's an example usage with the stripe connector https://github.com/airbytehq/airbyte/pull/32438/files#diff-e3651091c66723be70e9ab4b702a078c07f0e18fdbf895bba1b287a161576f3fR38-R51 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! I'm eager to see this in action
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome @girarda, looks really nice! Just a few comments and suggestions.
airbyte-cdk/python/airbyte_cdk/sources/concurrent_source/concurrent_source.py
Show resolved
Hide resolved
airbyte-cdk/python/airbyte_cdk/sources/concurrent_source/concurrent_source.py
Outdated
Show resolved
Hide resolved
""" | ||
:param threadpool: The threadpool to submit tasks to | ||
:param message_repository: The repository to emit messages to | ||
:param max_number_of_partition_generator_in_progress: The initial number of concurrent partition generation tasks. Limiting this number ensures will limit the latency of the first records emitted. While the latency is not critical, emitting the records early allows the platform and the destination to process them as early as possible. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: some of the params are missing and/or have been renamed since.
Regarding the comment for max_number_of_partition_generator_in_progress
/ initial_number_partitions_to_generate
- looks like the idea is that we'll generate one partition per stream, and so this variable basically is there to say "you can start processing records after <initial_number_partitions_to_generate>
partitions have been created"; if there are tons of streams, it means that we're allowed to start processing records before all of the stream partitions have been generated. Is that right?
It's not obvious to me that we'd ever want this to be more than one - did you have a use case in mind or is this here for configurability just in case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this variable basically is there to say "you can start processing records after <initial_number_partitions_to_generate> partitions have been created"
that's not quite what it says. records can be started after N tasks to generate partitions were submitted to the threadpool.
If the N == one and generating the partitions takes a long time (eg if we need to read all records from a parent stream), then the workers might be idle for some time.
If the N > number of workers it might take some time for the first records to be polled from the queue since the first many items will be partitions to process. I mostly want to avoid partition generation tasks from using all the workers and preventing them from actually processing the partitions.
An alternative implementation might be to submit the first partition generation before polling from the queue, and submit the others when we read the first record.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay thanks for the explanation. Tbh I'm not entirely sold on the extra complexity that this requires. If we were using coroutines/asyncio here we wouldn't have this issue because the executor would be able to flit between tasks while I/O is happening, whereas the standard threads are blocking. Even though your code helps alleviate some of this pressure we're always going to have the problem of threads idling during I/O as long as we're using standard threads. We could revisit using asyncio, though I recognize it might require significant changes. Alternatively, I think we could get a similar efficiency boost by replacing standard threads with green threads, as they offer context switching during I/O operations with less overhead. IMO either of those are preferable to adding complexity that doesn't entirely solve the problem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We synced offline.
The crux of the issue comes from our threadpool based approach. If using N workers, and they are all processing long partition generation tasks, no workers are available for emitting records from those partitions, which affects the overall sync performance.
@clnoll's suggestion of using asyncio is great as it should allow us to simplify the flow and might also improve the performance. We'll prioritize a spike to evaluate the effort and go from there.
streams: List[AbstractStream], | ||
) -> Iterator[AirbyteMessage]: | ||
self._logger.info("Starting syncing") | ||
stream_instances_to_read_from = self._get_streams_to_read_from(streams) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given that availability checks can be time-consuming, is there any reason not to do this part concurrently? We can just put all the streams in the queue and let them be filtered out by the worker that's processing them.
That would make cost of the initial enqueuing of streams negligible and would allow us to simplify this method - we wouldn't need a separate _submit_initial_partition_generators
step.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Running the checks on the worker is a good idea. I'll do in a separate PR
What
AbstractStream
to the SourceThreadBasedConcurrentStream
used toHow
Concurrent Source
ThreadPoolManager
The ThreadPoolManager class is a wrapper on top of a threadpool executor. It is responsible for submitting tasks, throttling if there are too many pending tasks, and raising an exception if any of the tasks fails.
The logic is very similar to what we used to do in
ThreadBasedConcurrentStream
, but I extracted it in it's own class so it's easier to testConcurrentStreamProcessor
The
ConcurrentSource
polls items from the queue and sends them to theConcurrentStreamProcessor
which acts on them. It's responsible for submitting new tasks and creating theAirbyteMessage
s. I'm not a huge fan of the name and am open to any suggestion but I found it important to extract the logic out ofConcurrentSource
for testability reasonsPartition and AbstractStream
The interfaces of
Partition
andAbstractStream
changed a bit.Partition
is not responsible for closes itselfAbstractStream
is not responsible for reading records. Instead, it generatesPartition
s that can read recordsThreadBasedConcurrentStream
was stripped down of it's read method and renamed toDefaultStream
since nothing about it is concurrentRecord
holds its stream name. An alternative would be to give it a handler on itsAbstractStream
orPartition
Recommended reading order
airbyte-cdk/python/airbyte_cdk/sources/concurrent_source/concurrent_source.py
airbyte-cdk/python/airbyte_cdk/sources/concurrent_source/thread_pool_manager.py
airbyte-cdk/python/airbyte_cdk/sources/streams/concurrent/partitions/types.py
airbyte-cdk/python/airbyte_cdk/sources/concurrent_source/concurrent_stream_processor.py
airbyte-cdk/python/airbyte_cdk/sources/concurrent_source/partition_generation_completed_sentinel.py
airbyte-cdk/python/airbyte_cdk/sources/streams/concurrent/partitions/partition.py
airbyte-cdk/python/airbyte_cdk/sources/streams/concurrent/abstract_stream.py
airbyte-cdk/python/airbyte_cdk/sources/streams/concurrent/default_stream.py
airbyte-cdk/python/airbyte_cdk/sources/streams/concurrent/partitions/record.py
Note: source-stripe needs to be updated in a follow up PR