-
Notifications
You must be signed in to change notification settings - Fork 2.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Flink 1.17: Support Partition Commit notification #7638
Conversation
Co-authored-by: quanyingxue <yxuequan@163.com>
@hililiwei I thought PR #6253 is less intrusive than this PR and can achieve similar goals. The approach in this PR introduced a lot of complexities. we had some discussion before in PR #6253. we should probably start some high-level discussions on the direction. Can you listed the pros and cons of these two approaches? maybe create a design doc to describe the problems that we are trying to solve, different approaches, and the pros and cons of the approaches. |
@stevenzwu Thank you for taking the time to review. The purpose of this PR is to implement a partition commit policy based on Iceberg tables, which has the following differences and advantages compared to proposal #6253: • Proposal #6253 only writes the watermark of the Flink task to the table’s Summary, rather than actually committing partitions. This makes it impossible for downstream application to directly determine which partitions are visible, and they need to calculate it themselves based on the watermark and the each partition time. Moreover, the value of the watermark may decrease due to flink job restart or data replay, making some partitions that were already visible invisible again without this PR, which is unacceptable in a production environment. This PR commits partitions based on the watermark and the event time of the partition, so that downstream application can directly see which partitions are available without extra calculation and judgment. This can improve the query efficiency and experience of downstream application. In scenarios such as BI and ad hoc, it can also avoid the problem of developers forgetting to filter data based on the watermark, and not every developer knows that they need to use the watermark to filter data, this greatly increases the possibility of them getting illegal data. • This PR allows users to customize the partition commit policy, which can perform some high-level custom operations when committing partitions. In some scenarios, downstream applications need to not only process newly committed partitions, but also deal with data from old partitions that arrive late, and perform some custom operations on the table, such as remote API calls, event notifications, data deletion, etc., or even file merging. This is very useful for table management and task flow customization. Users can choose the appropriate commit strategy according to their own business needs and scenarios, achieving more flexible and efficient data processing. Our internal business developers have used this feature to develop some custom commit policies, which have brought us a lot of convenience and advantages. • This PR maintains high compatibility with the Flink ecosystem. It uses the Thank you again for your review and feedback. |
This can be fixed by checkpointing the watermark written to snapshot summary or during restore the committer can retrieve the latest committed watermark
I agree that a little bit of logic is needed to determine which partitions have complete data based on the published watermark in snapshot summary.
I am not sure this is a fair comparison. Flink filesystem connector is storing files on distributed file system (like S3) directly. there is no table format abstraction. hence success file is the only option. Can you also explain what does "partition commit" mean exactly?
where are those success files stored? How do downstream consumers find them? How does work with entropy enabled? |
Partition commit means that in the current task, we consider that the data of a partition is ready and can be opened to downstream applications. Here we can define what actions to perform when committing partitions. Is it to submit the partition to the metadata, create a marker file, or perform some other custom operations. For example, there is a table partitioned by hour, when the watermark based on event time reaches 02:00:00 we think that the data of the |
Of course, we can store the watermark, but this does not fundamentally solve the data visibility problem. Writing watermark to Iceberg table is simple, but it increases the complexity of the whole task chain. And in many scenarios, downstream does not have a suitable place to implement this set of processing logic. For example, ad hoc query, OLAP query. 'determine which partitions have complete data', isn’t this exactly what Flink(with watermark) is better at and should do? When the data of a partition is complete, flink submit this partition and make it visible to the downstream.
Regarding |
When we process data, we often have not only streaming jobs, but also many Spark batch jobs behind the streaming jobs. Flink+Hive+Spark is a very common combination. Similarly, when we switch Hive to Iceberg, we need the streaming job to tell the scheduling system when to start the Spark task. We use FileSystem very rarely. |
This file is stored under the partition directory, such as /path/iceberg_table/data/dt=20220101/_SUCCESS
The same way as HDFS, We use the following method to create this file:
I may not have fully understood what you mean. |
@hililiwei can you put everything in a quip doc? then we can send an email to dev@ and seek broader feedback. |
I concur with @stevenzwu on the necessity for a design document that delves into the issue and potential solutions. The stream-to-batch use case is prevalent among data lake users, and it warrants exploring diverse approaches. We can foster community consensus on the most suitable solution, paving the way for the subsequent coding phase. |
with entropy enabled, the file path will look like. see the random prefix
The success file pattern assumes a well known / predefined path, which may not be true with S3 object store. Hive is not much different with distributed file system (like HDFS), since it assumes file layout following folder pattern. |
https://docs.google.com/document/d/1Sobv8XbvsyPzHi1YWy_jSet1Wy7smXKDKeQrNZSFYCg |
@hililiwei can you start a discussion thread on dev@iceberg? A lot more people follow dev@ discussions than PR reviews/discussions. |
Ok, I thought maybe we could try to get rid of the |
Co-authored-by: quanyingxue yxuequan@163.com
Partition Commit
Iceberg flink writer’s default data commit relies on
Checkpoint
. WhenCheckpoint
is completed, the new data written is committed to the metadata, regardless of which partition the new data belongs to. However, after writing a partition, it is often necessary to notify downstream applications. For example, add the partition to metadata or writing a_SUCCESS
file in the directory(Even if stored in object storage, downstream applications may still need to rely on this file as a flag to drive the progress of the entire Job flow).For the current default partition commit mode, which depends on
Checkpoint
, we can understand it as using process time to determine table partition commit (this commit mode still has room for optimization, because developers may want to decouple it from the checkpoint cycle). And the partition commit strategy in this PR, we can understand it as using event time to decide whether to commit table partitions.NOTE: Partition Commit only works in dynamic partition inserting.
Partition commit trigger
To define when to commit a partition, providing partition commit trigger:
sink.partition-commit.enabled
is set to 'true'. If this option is not configured correctly, e.g. source rowtime is defined on TIMESTAMP_LTZ column, but this config is not configured, then users may see the partition committed after a few hours. The default value is 'UTC', which means the watermark is defined on TIMESTAMP column or not defined. If the watermark is defined on TIMESTAMP_LTZ column, the time zone of watermark is the session time zone. The option value is either a full name such as 'America/Los_Angeles', or a custom timezone id such as 'GMT-08:00'.Partition commit according to the time that extracted from partition values and watermark. This requires that your job has watermark generation, and the partition is divided according to time, such as hourly partition or daily partition.
If you want to let downstream see the partition only when its data is complete, and your job has watermark generation, and you can extract the time from partition values:
Late data processing: The record will be written into its partition when a record is supposed to be written into a partition that has already been committed, and then the committing of this partition will be triggered again.
Partition Time Extractor
Time extractors define extracting time from partition values.
partition.time-extractor.timestamp-formatter
to 'yyyyMMdd'. By default the formatter is 'yyyy-MM-dd HH:mm:ss'. The timestamp-formatter is compatible with Java's DateTimeFormatterThe default extractor is based on a timestamp pattern composed of your partition fields.
Partition Commit Policy
The partition commit policy defines what action is taken when partitions are committed.
You can extend the implementation of commit policy, The custom commit policy implementation like:
Full-example
To do