-
Notifications
You must be signed in to change notification settings - Fork 232
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
connectors: Add Kinesis Source and Sink #234
Merged
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
jacksonrnewhouse
force-pushed
the
kinesis
branch
3 times, most recently
from
August 9, 2023 00:17
68d4789
to
a091165
Compare
jacksonrnewhouse
force-pushed
the
kinesis
branch
from
August 9, 2023 23:06
76fa112
to
df21c77
Compare
mwylde
reviewed
Aug 10, 2023
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Initial review of just the source operator. I'll spent some more time later today trying to fully understand the control flow there, and digging into the sink.
jacksonrnewhouse
force-pushed
the
kinesis
branch
from
August 10, 2023 02:19
4daa688
to
df7b8b8
Compare
mwylde
reviewed
Aug 11, 2023
jacksonrnewhouse
force-pushed
the
kinesis
branch
from
August 14, 2023 16:46
20d8884
to
9b78e0c
Compare
jacksonrnewhouse
force-pushed
the
kinesis
branch
from
August 14, 2023 19:26
6bbf93a
to
29510e8
Compare
mwylde
approved these changes
Aug 14, 2023
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This adds support for Kinesis via a source and sink. Sources and Sinks are configured via the name of the stream.
Source also takes a "source.offset" of either "Earliest" or "Latest", similar to kafka.
Kinesis Source
Execution Flow
The source operates by having an active future for every open shard the operator owns. These futures are all advanced via a FuturesUnordered, and the futures are all just labeled bits of compute, without any data that should be owned by the overall control loop. This lets us take accurate snapshots without waiting for the futures to complete. Unfortunately, the futures produced by calling something like
async fn my_async_function(&self)
are always owned byself
. TheBoxedFuture
gets around this, at the cost of some complexity.In order to find the affiliated shard state, the futures are returned with a
name
, which corresponds to the shard_id.This approach is similar to what I implemented for the Filesystem Sink. Might be worth figuring out how to better structure it or if there's a replacement we prefer.
Semantics
The operator consumes each shard of data in order, similar to the Kafka source. The set of shards that a subtask is responsible for is determined solely by the hash of the shard_id. Flink defaults to this behavior, but also has an option to evenly divide the hash space.
There are no order guarantees for how data is read off of different shards. While we eagerly fetch from all live shards, it is possible that one might fall behind. Since watermarks are at the subtask level, it is possible that a shard that falls behind will have its data dropped. Finally, Kinesis has the notion of "parent" and "child" shards, with the child shard having data after the parents. Ensuring this while letting child shards be on different subtasks is not currently possible, and no effort was made to ensure this.
Kinesis Sink
Execution Flow
The Kinesis sink operates in batches. Right now it waits until one of the following conditions: 500 messages, 4.5MB of data, or 1s has passed. The first two are dictated by the constraints on the PutRecords method, while the timeout is there to ensure we don't have messages linger in the sink.
The PutRecords method allows for partial success of writes, often because of capacity limits on some but not all of the shards in the stream. Once we decide to flush we currently try repeatedly to finish the batch. While this is happening no new messages are consumed, leading to back-pressure.
Semantics
Because the batch puts can succeed only some of the records, the output order is not guaranteed. Additionally, because Kinesis requires a key, if there isn't one in the Pipeline DAG we add a UUID4.
Right now if there's a key over 256 bytes we won't be able to write. I think our SQL solution will, in general, use the UUID4, but when we add partitioning support, we should have a plan for this.