-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Increase weight of splits with delete files in iceberg #23058
Conversation
Splits with deletes require more work to process than a comparable split without delete. This should be accounted in iceberg split weight
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense to me, approved
One thing which I'm wondering about is are we allowed to use weights above |
double weight = dataWeight; | ||
if (task.deletes().stream().anyMatch(deleteFile -> deleteFile.content() == POSITION_DELETES)) { | ||
// Presence of each data position is looked up in a combined bitmap of deleted positions | ||
weight += dataWeight; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doubling the weight due to any pos deletes being present seems a bit heavy. There's also the length of the delete files we could use to approximate additional work from them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It does seem heavy, but since the calculation is clamped to [minimumAssignedSplitWeight, 1.0]
, this really only matters if the split is small in the first place. In practice, this will turn a split that would be "half-weighted" by the data size to be "whole-weighted".
As a worked example, assuming the default value of 128MiB for read.split.target-size
, this would make a 64MiB scan with positional deletes to be considered to be equal to a 128MiB scan without them- or a 32MiB scan with them equal to a 64MiB without them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Length of delete file wouldn't work well because compression/encoding will reduce it to a small size. The presence of any delete position will result in significant extra work: read delete file, convert to roaring bitmap, produce row positions block, look-up every row position in bitmap, produce a filtered page.
This was discussed in #12832. At the end of that conversation, @sopel39 pointed out that the behavior present in That said, I don't know that you really need to express weights larger than "standard". The mechanism is really there to make sure that workers have enough splits to stay busy between scheduling batches, but as a mechanism it can't guarantee that CPU load is balanced for a number of reasons. Basically, it's important to capture "this work is cheap, send more splits to make sure the worker can stay busy" but "this work is expensive and will take a long time" is handled by the scheduler backing off and waiting for the worker split queue to drain. |
@raunaqmorarka how did you found the issue? |
We had a customer issue where we encountered
This table had a lot of small splits with many delete files. Increasing split weight for delete files should help to reduce chances of this occurring. Alternative is to manually lower |
How does |
There were a lot of splits with many delete files, these contribute to size of task update json. Increasing split weights is a way to force smaller task updates. Also, we should queue fewer splits with deletes as we know these are not as cheap as "normal" splits. |
There is a machanism in |
Also, there might just be small splits without delete files. Isn't that a problem too? |
Splits are serialized before adjustment to get initial size of body. We could use fallback to get size via its representation in Java multiplied by some factor as estimation. |
But why they take 2GB in the first place. Seems like it might kill coordinator <-> worker communication anyway. We could also have slow-start mechanism (with exponential ramp-up). |
It is related to iceberg delete files data included at split level. Adjustment could kick in if we get first body size estimate or on failure fallback to min guaranteed splits per task. |
Ok, but how do we end-up with 2GB of serialized data in the first place. Is there something terribly wrong with delete file serde? |
Description
Splits with deletes require more work to process than a comparable split without delete.
This should be accounted in iceberg split weight
Additional context and related issues
Release notes
(x) This is not user-visible or is docs only, and no release notes are required.
( ) Release notes are required. Please propose a release note for me.
( ) Release notes are required, with the following suggested text: