Flink: Custom partitioner for bucket partitions #7161

kengtin · 2023-03-21T17:15:23Z

References

Original problem (PR): Flink: The Data Skew Problem on FlinkSink #4228
Motivation: https://docs.google.com/document/d/13N8cMqPi-ZPSKbkXGOBMPOzbv2Fua59j8bIjjtxLWqo
Guidance by @stevenzwu

Description

This partitioner will redirect elements to writers deterministically so that each writer only targets 1 bucket.
If the number of writers > number of buckets each partitioner will keep a state of multiple writers per bucket as evenly as possible, and will round-robin the requests across them.
This is enabled by default in FlinkSink Hash mode when a bucket is detected in a PartitionSpec
A new BucketPartitionKeySelector was written, identical to the PartitionKeySelector but it extracts and returns an Integer BucketId as they Key
Extra: added a HadoopCatalogExtention, a ported implementation of the HadoopCatalogResource, for Junit5

api/src/main/java/org/apache/iceberg/PartitionSpec.java

flink/v1.16/flink/src/main/java/org/apache/iceberg/flink/sink/BucketPartitionKeySelector.java

flink/v1.16/flink/src/main/java/org/apache/iceberg/flink/sink/BucketPartitioner.java

flink/v1.16/flink/src/main/java/org/apache/iceberg/flink/sink/BucketPartitionKeySelector.java

flink/v1.16/flink/src/main/java/org/apache/iceberg/flink/sink/BucketPartitioner.java

…titionSpec with 1 and only 1 bucket.

- Factoring out the new partitioning utilities into BucketPartitionerUtils.java class - Relying on the PartitionSpecVisitor pattern/approach to get Bucket information (no more regex based extraction) - Parameterized the test and adding support to evaluate vs. schemas with different Bucket scenarios

flink/v1.16/flink/src/main/java/org/apache/iceberg/flink/sink/BucketPartitionKeySelector.java

flink/v1.16/flink/src/main/java/org/apache/iceberg/flink/sink/BucketPartitioner.java

...flink/src/test/java/org/apache/iceberg/flink/sink/TestBucketPartitionerFlinkIcebergSink.java

- Logic simplification of the TestBucketPartitionerFlinkIcebergSink - Migration to Junit5

- Clarifying the BucketPartitioner logic via Javadoc and better variable names.

- Cleaning up the test and general comments/javadoc

- Migrating the sink unit test to Junit5

…talogExtension.java, now handled internally.

… TestBucketPartitioner and TestBucketPartitionKeySelector. - Refactored and simplified the TestBucketPartitionerUtils.

flink/v1.16/flink/src/main/java/org/apache/iceberg/flink/sink/BucketPartitionKeySelector.java

flink/v1.16/flink/src/main/java/org/apache/iceberg/flink/sink/BucketPartitioner.java

flink/v1.16/flink/src/test/java/org/apache/iceberg/flink/sink/TestBucketPartitioner.java

...flink/src/test/java/org/apache/iceberg/flink/sink/TestBucketPartitionerFlinkIcebergSink.java

flink/v1.16/flink/src/test/java/org/apache/iceberg/flink/sink/TestBucketPartitionerUtil.java

flink/v1.16/flink/src/main/java/org/apache/iceberg/flink/sink/BucketPartitioner.java

stevenzwu · 2023-08-09T18:00:50Z

flink/v1.16/flink/src/main/java/org/apache/iceberg/flink/sink/BucketPartitionKeySelector.java

+    if (rowDataWrapper == null) {
+      rowDataWrapper = new RowDataWrapper(flinkSchema, schema.asStruct());
+    }
+    return rowDataWrapper;


nit: iceberg style adds an empty line after the end of a control block }

stevenzwu · 2023-08-10T05:03:19Z

thanks @kengtin for the contribution

advancedxy · 2023-08-23T04:36:55Z

Hi @stevenzwu @kengtin this PR restricts that only one bucket transform in the partition spec. Do you think it's beneficial to support multiple bucket transforms or even the mutli-arg bucket transform(#8259) since it's common to have multiple keys to form a primary key?

kengtin · 2023-08-23T04:45:09Z

Hi @advancedxy, thanks for bringing it up. If I recall correctly @stevenzwu and I discussed that point during the early stages and at least back then we determined it would be much simpler. However I see your point about not restricting the user's design to 1 bucket only. Perhaps we could allow multi-bucket/key definitions but I'd still restrict the routing/distribution to be based only on the first bucket (or something like that) to not overcomplicate it. Thoughts?

advancedxy · 2023-08-23T05:24:56Z

Perhaps we could allow multi-bucket/key definitions but I'd still restrict the routing/distribution to be based only on the first bucket (or something like that) to not overcomplicate it. Thoughts?

Ah, yeah. it's definitely much simpler to routing/distribution by one bucket partition(first one or the one has the largest number of bucket, etc., a.k.a the most significant bucket transform). When multi-arg bucket transform is not supported, perhaps we may allow multiple bucket transforms in the table but choose only one bucket transform. Once multi-arg bucket transform is supported, the multi-arg bucket transform should probably to be selected first.

kengtin · 2023-08-23T22:49:55Z

@advancedxy I promise to get up to speed with the multi-arg bucket transform (not yet familiar) but your suggestion makes sense. I'll also let @stevenzwu weigh in.

stevenzwu · 2023-09-01T17:07:48Z

@advancedxy thanks for a lot for raising the multi-dimensional bucketing. if there is enough use cases, we should be able to adapt the implementation to support it under the assumption that data are evenly distributed. Let's use an example with 2 bucketing columns (m and n). The assumption would be data are evenly distributed in the m x n combinational bucket tuples.

chenwyi2 · 2023-10-13T09:02:30Z

Hi @stevenzwu @kengtin this PR can be create too many small files when parition with dt,hout,minute and bucekt(id), suppose paralisim is 120 and bucke number is 8, then 15 writes can write into same one bucket, but there is problem, data from the previous few hours can be into one commit because of data latency, there can be 15000 and more data files if changed partition is up to 1000, can we use complete parition name instead of just bucket?

stevenzwu · 2023-10-14T04:31:02Z

@chenwyi2 Is your point that we shouldn't only consider bucketing column (like did in this PR). you just want a plain keyBy in this case? that would be a fair point. Do you get balanced traffic distribution among write tasks with simple keyBy?

I am also wondering if the partition spec of dt,hour,minute and bucekt(id) is the best option. especially the minute column as partition. do you really need minute level partition granularity. you are creating very fine grained partitions. even with the most optimal data distribution/shuffle. there are still a lot of partitions and data files.

you used 8 for bucket number. it seems quite small for bucketing. what's the reason of using 8 buckets?

chenwyi2 · 2023-10-16T01:49:36Z

yes, I am creating very fine grained partitions, because i want to query and comput some business metrics between minutes ss fast as possible. As for bucket number, i use a fomula QPS * 500B/条 / bucket number / 1024 /1024 = 10M/秒(The write-in traffic of one bucket), because i found that the write-in traffic of one bucket is large, the writer can be OOM or backpressure, i set 10M/秒 in each bucket.
In this pr hdfs can be influenced by many files, writting to files can be slow then data latency will be relatively large.

stevenzwu · 2023-10-16T04:28:55Z

is the partition time an event time or ingestion/processing time? or asking in a different way, how many active minutes do the Flink writer job process for every commit cycle?

I feel this ongoing work might work better for you
https://docs.google.com/document/d/13N8cMqPi-ZPSKbkXGOBMPOzbv2Fua59j8bIjjtxLWqo
https://www.youtube.com/watch?v=GJplmOO7ULA

Using bucketing column seems a temp solution to handle (1) skewed data distribution across hours and minutes (2) skewed partition distribution from simple keyBy hash distribution.

On the other hand, I also recognize this change imposed a behavior change that doesn't work for your use case. We can revert the distribution mode change in the FlinkSink class. To enable the custom bucketing partitioner, users will have to manually apply the custom keyBy.

input.keyBy(new PartitionKeySelector(partitionSpec, iSchema, flinkRowType))

@kengtin can you create a PR for reverting the FlinkSink change?

chenwyi2 · 2023-10-17T01:53:40Z

In normal conditition, only the data of current minute will be written. However, if the data is delayed, for example, at 11:50, the data has not been written until 11:55, then at 11:56 will commit data composed by 11:51,11:52,...11:56, in this situation some small fies can be created

bendevera · 2023-12-18T18:38:17Z

@stevenzwu I see defaulting to BucketPartitioner was reverted here: #8848

We've found performance issues with DistributionMode.HASH, and wanted to test BucketPartitioner but it isn't public.

You mentioned above that to use the bucketing partitioner, users will need to manually apply keyBy. Believe the code snippet though is the default HASH logic. What is the recommended method for using the bucket partitioner and any issues reported by users that resulted in reverting?

stevenzwu · 2023-12-18T23:27:49Z

It is reverted because there are users depending on the previous behavior of keyBy all partition columns. #7161 (comment)

We were assuming that if there is a bucket column, users only want to shuffle by the bucketing column. that is not the case from the user report linked in the above comment. so we decided to roll back for backward compatibility.

@bendevera you are right that BucketPartitioner isn't public and can't be used at the moment. Now we need to discuss what's the best way moving forward? we are working on a more comprehensive smart shuffling (range partition) feature: https://github.com/apache/iceberg/projects/27. I am thinking maybe we can expose this in range distribution mode.

before that, you may have to copy the code and manually apply the bucketing shuffling.

input.partitionCustom(
                    new BucketPartitioner(partitionSpec),
                    new BucketPartitionKeySelector(partitionSpec, iSchema, flinkRowType));

Note that bucketing partitioning is the simplest form of range partition / smart shuffling we want to achieve, as we can assume each bucket has the same weight. we don't need to dynamically calculate traffic distribution stats for skewed column (like event time hour or country code or event type etc.)

bendevera · 2023-12-19T00:44:33Z

@stevenzwu thank you for the quick response!

Okay, will run some BucketPartitioner tests for our use case by copying code manually. Smart shuffling sounds interesting and would certainly test out. A lot of the use cases we deal with fit well with BucketPartitioner conceptually and so can test to verify. Current DistributionMode.HASH implementation is too slow computing the data file path for each record and we've noticed processing rates take a huge hit when enabling. Glad to see features being extended! Will read up more regarding the smart shuffling design and see if we can get involved

stevenzwu · 2023-12-19T04:00:40Z

@bendevera here is our presentation: https://www.youtube.com/watch?v=GJplmOO7ULA&t=18s. here is the design doc: https://docs.google.com/document/d/13N8cMqPi-ZPSKbkXGOBMPOzbv2Fua59j8bIjjtxLWqo/edit#heading=h.o4q8a61sahkq

binshuohu · 2024-09-12T22:59:26Z

@stevenzwu Is there any plan to reapply this change to the main branch? Has there been any follow up since #8848 ?

stevenzwu · 2024-09-13T22:00:41Z

@binshuohu Currently, there is no plan to reapply this change to the main branch. We have a more general range distribution available now (guided by statistics collection): https://iceberg.apache.org/docs/nightly/flink-writes/#range-distribution-experimental. It is more general than this (bucketing only). Range distribution also handle different parallelisms and partitions well.

Range distribution has one disadvantage. It performs statistics collection and aggregation to guide the range split. That adds a little overhead. Bucketing partitioner here assumes traffic are evenly distributed across buckets, which should be true (hash % nBuckets).

cc @pvary

First batch of code for a new Bucket Partitioning functionality

7084612

github-actions bot added API flink labels Mar 21, 2023

Updating code style.

5eaf718

kengtin changed the title ~~Custom Bucket Partitioning~~ Flink: Custom partitioner for a more even data distribution when bucket partitions are present Mar 23, 2023

stevenzwu mentioned this pull request Mar 31, 2023

Flink: Resolve writers object of PartitionedDeltaWriter will cause OOM when partition number is big #7217

Closed

stevenzwu reviewed Apr 4, 2023

View reviewed changes

Addressing PR comments

4d84e49

github-actions bot removed the API label Apr 5, 2023

schongloo added 2 commits April 5, 2023 14:58

Addressing PR comment: the BucketPartitioner should be provided a Par…

fe54373

…titionSpec with 1 and only 1 bucket.

Addressing build check problems

c7b86d7

stevenzwu requested changes Apr 17, 2023

View reviewed changes

schongloo added 8 commits April 23, 2023 11:34

Addressing PR comments, pending:

75fad88

- Logic simplification of the TestBucketPartitionerFlinkIcebergSink - Migration to Junit5

Addressing PR comments, pending:

264fc7d

- Clarifying the BucketPartitioner logic via Javadoc and better variable names.

Addressing PR comments, pending:

19fa9d6

- Clarifying the BucketPartitioner logic via Javadoc and better variable names.

Addressing PR comments, pending:

2afa7ad

- Cleaning up the test and general comments/javadoc

- Creating the HadoopCatalogExtension.java for Junit5

767a2da

- Migrating the sink unit test to Junit5

Removing the need for an additiona/external @tempdir for the HadoopCa…

4c40ee6

…talogExtension.java, now handled internally.

- Splitting tests across TestBucketPartitionerFlinkIcebergSink (E2E),…

7310f9b

… TestBucketPartitioner and TestBucketPartitionKeySelector. - Refactored and simplified the TestBucketPartitionerUtils.

Additional clean up

26dae02

kengtin requested a review from stevenzwu April 27, 2023 18:01

Checkstyle fixes

3bc5956

stevenzwu changed the title ~~Flink: Custom partitioner for a more even data distribution when bucket partitions are present~~ Flink: Custom partitioner fro bucket partitions May 8, 2023

stevenzwu changed the title ~~Flink: Custom partitioner fro bucket partitions~~ Flink: Custom partitioner for bucket partitions May 8, 2023

stevenzwu reviewed May 8, 2023

View reviewed changes

schongloo added 3 commits May 20, 2023 08:44

Addressing PR comments

7c11879

Using RowData as the main type across tests.

b44bd99

Refactoring the partitioning/schema utilities

1651804

stevenzwu reviewed Jun 18, 2023

View reviewed changes

flink/v1.16/flink/src/test/java/org/apache/iceberg/flink/sink/TestBucketPartitionerUtil.java Outdated Show resolved Hide resolved

flink/v1.16/flink/src/main/java/org/apache/iceberg/flink/sink/BucketPartitioner.java Outdated Show resolved Hide resolved

schongloo added 2 commits July 13, 2023 09:44

Latest round of PR comments

dc4b6f4

Latest round of PR comments

0cb7f37

stevenzwu approved these changes Aug 9, 2023

View reviewed changes

stevenzwu mentioned this pull request Aug 9, 2023

Flink: The Data Skew Problem on FlinkSink #4228

Closed

schongloo added 2 commits August 9, 2023 15:12

Last PR comment

c42ce13

Merge branch 'master' into feature/bucket_partitioner

87e83b7

stevenzwu merged commit 01ab0d9 into apache:master Aug 10, 2023

kengtin mentioned this pull request Aug 15, 2023

Flink: Backporting PR 7161 of custom partitioner for bucket partitions #8328

Merged

liurenjie1024 mentioned this pull request Sep 1, 2023

Improve data skew problem. risingwavelabs/risingwave#12016

Closed

stevenzwu mentioned this pull request Oct 16, 2023

Flink: revert the automatic application of custom partitioner for bucketing column with hash distribution #8847

Closed

kengtin mentioned this pull request Oct 17, 2023

Flink: Reverting the default custom partitioner for bucket column #8848

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Flink: Custom partitioner for bucket partitions #7161

Flink: Custom partitioner for bucket partitions #7161

kengtin commented Mar 21, 2023 •

edited

Loading

stevenzwu Aug 9, 2023

stevenzwu commented Aug 10, 2023

advancedxy commented Aug 23, 2023

kengtin commented Aug 23, 2023 •

edited

Loading

advancedxy commented Aug 23, 2023

kengtin commented Aug 23, 2023

stevenzwu commented Sep 1, 2023

chenwyi2 commented Oct 13, 2023

stevenzwu commented Oct 14, 2023

chenwyi2 commented Oct 16, 2023

stevenzwu commented Oct 16, 2023 •

edited

Loading

chenwyi2 commented Oct 17, 2023

bendevera commented Dec 18, 2023

stevenzwu commented Dec 18, 2023 •

edited

Loading

bendevera commented Dec 19, 2023

stevenzwu commented Dec 19, 2023

binshuohu commented Sep 12, 2024

stevenzwu commented Sep 13, 2024

Flink: Custom partitioner for bucket partitions #7161

Flink: Custom partitioner for bucket partitions #7161

Conversation

kengtin commented Mar 21, 2023 • edited Loading

References

Description

stevenzwu Aug 9, 2023

Choose a reason for hiding this comment

stevenzwu commented Aug 10, 2023

advancedxy commented Aug 23, 2023

kengtin commented Aug 23, 2023 • edited Loading

advancedxy commented Aug 23, 2023

kengtin commented Aug 23, 2023

stevenzwu commented Sep 1, 2023

chenwyi2 commented Oct 13, 2023

stevenzwu commented Oct 14, 2023

chenwyi2 commented Oct 16, 2023

stevenzwu commented Oct 16, 2023 • edited Loading

chenwyi2 commented Oct 17, 2023

bendevera commented Dec 18, 2023

stevenzwu commented Dec 18, 2023 • edited Loading

bendevera commented Dec 19, 2023

stevenzwu commented Dec 19, 2023

binshuohu commented Sep 12, 2024

stevenzwu commented Sep 13, 2024

kengtin commented Mar 21, 2023 •

edited

Loading

kengtin commented Aug 23, 2023 •

edited

Loading

stevenzwu commented Oct 16, 2023 •

edited

Loading

stevenzwu commented Dec 18, 2023 •

edited

Loading