-
Notifications
You must be signed in to change notification settings - Fork 2.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Flink: The Data Skew Problem on FlinkSink #4228
Conversation
There is a recent Slack thread for the same issue where hash distribution leads to skewed shuffling (for bucketing partitions and probably other partition spec too): https://apache-iceberg.slack.com/archives/C025PH0G1D4/p1645676203340179 I don't necessarily agree with the solution provided in this PR. We can provide some general solution in FlinkSink. For bucketing partitions, we can implement a custom partitioner to shuffle data by the bucketing value to the downstream tasks/channels: I don't know if it makes sense to add |
Hi Steven, Glad to having discussion with you! I think in this problem there are three key aspects:
I agree with you that we need to provide a new method to make partition data and task mapped evenly, but not a hash function. I think a new KeySelector logic may apply on bucket partition situation and it is not fit on other partition spec (identity or truncate), or we can make this config only for flinksink on 'bucket' partition spec? BTW, could you please help to add me to the slack discussion mentioned above? Thanks very muck! |
@zhengchar please follow the instruction for Slack invite: https://iceberg.apache.org/community/ |
I agree @stevenzwu, that I don't necessarily think this is the appropriate solution to the concern. Please do join the Iceberg Slack @zhengchar - as Steven mentioned this has come up very recently there. I wouldn't add something so specific like I'd love to continue the discussion on Slack (or on the dev list, but Slack is great for casual async discussion and the Iceberg Slack is very active). |
Thanks @zhengchar for bringing this interesting issue here, and thanks @stevenzwu and @kbendick for providing the slack context. In essence, the current The value range of the first hash is becoming narrow, resulting in a large number of row conflicts in the same bucket for the second hash. |
I agree with @stevenzwu that we need a general solution to fix this data skew issue. |
I just copied the solution from Slack here for further searching:
val kafkaStream.partitionCustom(new Partitioner[String]() {
val hash = Hashing.murmur3_32()
override def partition(key: String, numPartitions: Int): Int = {
val res = hash.hashString(key, StandardCharsets.UTF_8).asInt()
(Integer.MAX_VALUE & res) % numPartitions
}
},value=>value.getField(0).toString)
FlinkSink.forRow(finalStream.javaStream,tableSchema)
.tableLoader(loader)
.writeParallelism(sinkParallelism)
.distributionMode(DistributionMode.NONE)
.build() |
Hi @openinx, Thanks for your explanation. I have tried the solution above. I found the data skew problem still here on IcebergWriter stage. In my opinion, there are two points:
|
@zhengchar We all agree that HASH distribution is not a good fit for bucketing tables. In the sample code from the Slack thread, hash distribution is disabled and a custom partitioner is used to shuffle the data matching the bucketing function in Iceberg. how is the output/destination table partitioned? Trying to understand why the custom partitioner doesn't work for your scenario. |
After talking with Steven, I think we should go ahead and detect the case where we can use bucket values for distribution directly. There should be no need to add a mode to the table property for that. |
Hi Steven, According to my description above, my dest-table just a bucket[64] table. |
@zhengchar can you share you code snippet? thought the customer partitioner (with NONE distribution mode) should work for your case as well. not saying it is the general solution. trying to understand why it doesn't work for you or any other unique conditions from your use case that we missed. |
close obsolete PR as it is superseded by PR #7161 |
Hi,
I tried to load 1TB data from TiDB to Iceberg by Flink. Iceberg table consists of 128 buckets partition.
I found data skew problem on FlinkSQL IcebergWriter stage. We set parrallism 128 on this stage, there are only 49 taskmanagers has data processing tasks, others are finished so quickly.
The data partition operator for a bucket partition table in Flink is 'keyby', a hash policy may occur the data skew. I make a
custom partition function which can distribute a task for a table partition data to a taskmanager evenly。
According to my testing, on batch mode without deletion process, this function can make every tm has task to process and cut down data load time from 96 min to 38 min with 64 parallism.