Flink: add more sink shuffling support #6303

stevenzwu · 2022-11-28T22:06:11Z

Feature Request / Improvement

Today, Flink Iceberg sink only supports simple keyBy hash distribution on partition columns. In practice, keyBy shuffle on partition values doesn't work very well.

We can make the following shuffling enhancements in Flink streaming writer. More details can be found in the design doc. This is an uber issue for tracking purpose. Here are the rough phases.

[hash distribution] custom partitioner on bucket values. PR 4228 demonstrated that keyBy on low-cardinality partitioning buckets resulted in skewed traffic distribution. Flink sink can add a custom partitioner that directly map the bucket value (integer) to the downstream writer tasks (integer) in round-robin fashion (mod). This is a relatively simple case.

This is a case when write.distribution-mode=hash and there is a bucketing partition column. Other partition columns (like hourly partition) will be ignored regarding shuffling. The assumption is that bucketing column is where we want to distribute/cluster the rows.

[hash distribution] bin packing based on traffic distribution statistics. This works well for skewed data on partition columns (like event time). This requires calculating traffic distribution statistics across partition columns and use the statistics to guide shuffling decision.

This is a case when write.distribution-mode=hash and there is NO bucketing partition column.

[range distribution] range partition based on traffic distribution statistics. It is a variant of 2 above. This works well for "sorting" non-partition columns (e.g. country code, event type). It can improve data clustering by creating data files with narrow value ranges. Note that Flink streaming writer probably won't sort rows within a file, as that would be very expensive (not impossible). Even without rows sorted within a file, the improved data clustering can result in effective file pruning. We just can't get the additional benefits of row group level skipping (for Parquet) with rows sorted within a file.

This is a case when write.distribution-mode=range and SortOrder is defined for non-partition columns. partition columns will be ignored for range shuffling as the assumption is that non-partition sort columns are what matter here.

[high cardinality columns] 2 and 3 above are mostly for low-cardinality columns (e.g. unique values in hundreds), where a simple dictionary of count per value can be used to track traffic distribution statistics. For high-cardinality column (like device or user id), we would need to use probabilistic data sketches algorithm to calculate traffic distribution.

Query engine

Flink

The text was updated successfully, but these errors were encountered:

stevenzwu · 2022-11-28T23:03:42Z

Created a new project as this is a relatively large scope overall: https://github.com/apache/iceberg/projects/27

hililiwei · 2023-03-03T07:25:10Z

Great design! I think we can continue adding new issues so that guys can choose the tasks they want to work on.

github-actions · 2024-08-24T00:12:55Z

This issue has been automatically marked as stale because it has been open for 180 days with no activity. It will be closed in next 14 days if no further activity occurs. To permanently prevent this issue from being considered stale, add the label 'not-stale', but commenting on the issue is preferred when possible.

bendevera · 2024-09-05T19:55:54Z

@stevenzwu was wondering the status of this project. We have faced issues with the performance of the default HASH distribution mode. This project looked promising and saw that some progress has been made with various related tasks.

stevenzwu · 2024-09-05T21:51:29Z

@bendevera range distribution has been added to the main branch and will be part of the next 1.7 release. you can also see the doc here: https://iceberg.apache.org/docs/nightly/flink-writes/#range-distribution-experimental

yegangy0718 mentioned this issue Dec 8, 2022

Flink: Implement data statistics operator to collect traffic distribution for guiding smart shuffling #6382

Merged

stevenzwu mentioned this issue Mar 1, 2023

Flink: IcebergTableSink to write data into multiple iceberg tables #2208

Closed

stevenzwu self-assigned this Mar 28, 2023

stevenzwu mentioned this issue Mar 31, 2023

Flink: Resolve writers object of PartitionedDeltaWriter will cause OOM when partition number is big #7217

Closed

yegangy0718 mentioned this issue Apr 3, 2023

Flink: Data statistics operator sends local data statistics to coordinator and receive aggregated data statistics from coordinator for smart shuffling #7269

Merged

yegangy0718 mentioned this issue Apr 17, 2023

Flink: Implement data statistics coordinator to aggregate data statistics from operator subtasks #7360

Merged

huyuanfeng2018 mentioned this issue Apr 21, 2023

The serialization problem caused by Flink shuffling design #7393

Closed

yegangy0718 mentioned this issue May 12, 2023

Flink DataStream Small file Issue And RewriteDataFiles Action #7568

Closed

huyuanfeng2018 mentioned this issue Feb 7, 2024

[Improvement]: Refact building shuffle policy in flink sink connector with SPI, making it more flexible and pluggable apache/amoro#2545

Closed

3 tasks

github-actions bot added the stale label Aug 24, 2024

github-actions bot removed the stale label Sep 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Flink: add more sink shuffling support #6303

Flink: add more sink shuffling support #6303

stevenzwu commented Nov 28, 2022

stevenzwu commented Nov 28, 2022

hililiwei commented Mar 3, 2023

github-actions bot commented Aug 24, 2024

bendevera commented Sep 5, 2024

stevenzwu commented Sep 5, 2024

Flink: add more sink shuffling support #6303

Flink: add more sink shuffling support #6303

Comments

stevenzwu commented Nov 28, 2022

Feature Request / Improvement

Query engine

stevenzwu commented Nov 28, 2022

hililiwei commented Mar 3, 2023

github-actions bot commented Aug 24, 2024

bendevera commented Sep 5, 2024

stevenzwu commented Sep 5, 2024