-
Notifications
You must be signed in to change notification settings - Fork 2.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Flink: add more sink shuffling support #6303
Comments
Created a new project as this is a relatively large scope overall: https://github.com/apache/iceberg/projects/27 |
Great design! I think we can continue adding new issues so that guys can choose the tasks they want to work on. |
This issue has been automatically marked as stale because it has been open for 180 days with no activity. It will be closed in next 14 days if no further activity occurs. To permanently prevent this issue from being considered stale, add the label 'not-stale', but commenting on the issue is preferred when possible. |
@stevenzwu was wondering the status of this project. We have faced issues with the performance of the default |
@bendevera range distribution has been added to the |
Feature Request / Improvement
Today, Flink Iceberg sink only supports simple keyBy hash distribution on partition columns. In practice, keyBy shuffle on partition values doesn't work very well.
We can make the following shuffling enhancements in Flink streaming writer. More details can be found in the design doc. This is an uber issue for tracking purpose. Here are the rough phases.
This is a case when
write.distribution-mode=hash
and there is a bucketing partition column. Other partition columns (like hourly partition) will be ignored regarding shuffling. The assumption is that bucketing column is where we want to distribute/cluster the rows.This is a case when
write.distribution-mode=hash
and there is NO bucketing partition column.This is a case when
write.distribution-mode=range
andSortOrder
is defined for non-partition columns. partition columns will be ignored for range shuffling as the assumption is that non-partition sort columns are what matter here.Query engine
Flink
The text was updated successfully, but these errors were encountered: