Skip to content

Commit

Permalink
[Improvement] Add the trigger condition of combine multiple position …
Browse files Browse the repository at this point in the history
…delete file for minor (#2516)

* [Improvement] Add the trigger condition of combine multiple position delete file for minor

* Update docs/user-guides/configurations.md

Co-authored-by: ZhouJinsong <zhoujinsong0505@163.com>

---------

Co-authored-by: ZhouJinsong <zhoujinsong0505@163.com>
  • Loading branch information
zhongqishang and zhoujinsong authored Jan 30, 2024
1 parent 71680f2 commit 36fab23
Show file tree
Hide file tree
Showing 2 changed files with 26 additions and 23 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -59,6 +59,7 @@ public class CommonPartitionEvaluator implements PartitionEvaluator {
protected int undersizedSegmentFileCount = 0;
protected long undersizedSegmentFileSize = 0;
protected int rewritePosSegmentFileCount = 0;
protected int combinePosSegmentFileCount = 0;
protected long rewritePosSegmentFileSize = 0L;
protected long min1SegmentFileSize = Integer.MAX_VALUE;
protected long min2SegmentFileSize = Integer.MAX_VALUE;
Expand Down Expand Up @@ -214,11 +215,12 @@ public boolean fileShouldRewrite(DataFile dataFile, List<ContentFile<?>> deletes

public boolean segmentShouldRewritePos(DataFile dataFile, List<ContentFile<?>> deletes) {
Preconditions.checkArgument(!isFragmentFile(dataFile), "Unsupported fragment file.");
return deletes.stream().anyMatch(delete -> delete.content() == FileContent.EQUALITY_DELETES)
|| deletes.stream()
.filter(delete -> delete.content() == FileContent.POSITION_DELETES)
.count()
>= 2;
if (deletes.stream().filter(delete -> delete.content() == FileContent.POSITION_DELETES).count()
>= 2) {
combinePosSegmentFileCount++;
return true;
}
return deletes.stream().anyMatch(delete -> delete.content() == FileContent.EQUALITY_DELETES);
}

protected boolean isFullOptimizing() {
Expand Down Expand Up @@ -317,7 +319,8 @@ public boolean isMajorNecessary() {
public boolean isMinorNecessary() {
int smallFileCount = fragmentFileCount + equalityDeleteFileCount;
return smallFileCount >= config.getMinorLeastFileCount()
|| (smallFileCount > 1 && reachMinorInterval());
|| (smallFileCount > 1 && reachMinorInterval())
|| combinePosSegmentFileCount > 0;
}

protected boolean reachMinorInterval() {
Expand Down
34 changes: 17 additions & 17 deletions docs/user-guides/configurations.md
Original file line number Diff line number Diff line change
Expand Up @@ -27,23 +27,23 @@ modified through [Alter Table](../using-tables/#modify-table) operations.

Self-optimizing configurations are applicable to both Iceberg Format and Mixed streaming Format.

| Key | Default | Description |
|-----------------------------------------------|------------------|----------------------------------------------------------------------------------------------------------------------------------|
| self-optimizing.enabled | true | Enables Self-optimizing |
| self-optimizing.group | default | Optimizer group for Self-optimizing |
| self-optimizing.quota | 0.1 | Quota for Self-optimizing, indicating the CPU resource the table can take up |
| self-optimizing.execute.num-retries | 5 | Number of retries after failure of Self-optimizing |
| self-optimizing.target-size | 134217728(128MB) | Target size for Self-optimizing |
| self-optimizing.max-file-count | 10000 | Maximum number of files processed by a Self-optimizing process |
| self-optimizing.max-task-size-bytes | 134217728(128MB) | Maximum file size bytes in a single task for splitting tasks |
| self-optimizing.fragment-ratio | 8 | The fragment file size threshold. We could divide self-optimizing.target-size by this ratio to get the actual fragment file size |
| self-optimizing.min-target-size-ratio | 0.75 | The undersized segment file size threshold. Segment files under this threshold will be considered for rewriting |
| self-optimizing.minor.trigger.file-count | 12 | The minimum numbers of fragment files to trigger minor optimizing |
| self-optimizing.minor.trigger.interval | 3600000(1 hour) | The time interval in milliseconds to trigger minor optimizing |
| self-optimizing.major.trigger.duplicate-ratio | 0.1 | The ratio of duplicate data of segment files to trigger major optimizing |
| self-optimizing.full.trigger.interval | -1(closed) | The time interval in milliseconds to trigger full optimizing |
| self-optimizing.full.rewrite-all-files | true | Whether full optimizing rewrites all files or skips files that do not need to be optimized |
| self-optimizing.min-plan-interval | 60000 | The minimum time interval between two self-optimizing planning action |
| Key | Default | Description |
|-----------------------------------------------|------------------|------------------------------------------------------------------------------------------------------------------------------------------|
| self-optimizing.enabled | true | Enables Self-optimizing |
| self-optimizing.group | default | Optimizer group for Self-optimizing |
| self-optimizing.quota | 0.1 | Quota for Self-optimizing, indicating the CPU resource the table can take up |
| self-optimizing.execute.num-retries | 5 | Number of retries after failure of Self-optimizing |
| self-optimizing.target-size | 134217728(128MB) | Target size for Self-optimizing |
| self-optimizing.max-file-count | 10000 | Maximum number of files processed by a Self-optimizing process |
| self-optimizing.max-task-size-bytes | 134217728(128MB) | Maximum file size bytes in a single task for splitting tasks |
| self-optimizing.fragment-ratio | 8 | The fragment file size threshold. We could divide self-optimizing.target-size by this ratio to get the actual fragment file size |
| self-optimizing.min-target-size-ratio | 0.75 | The undersized segment file size threshold. Segment files under this threshold will be considered for rewriting |
| self-optimizing.minor.trigger.file-count | 12 | The minimum number of files to trigger minor optimizing is determined by the sum of fragment file count and equality delete file count |
| self-optimizing.minor.trigger.interval | 3600000(1 hour) | The time interval in milliseconds to trigger minor optimizing |
| self-optimizing.major.trigger.duplicate-ratio | 0.1 | The ratio of duplicate data of segment files to trigger major optimizing |
| self-optimizing.full.trigger.interval | -1(closed) | The time interval in milliseconds to trigger full optimizing |
| self-optimizing.full.rewrite-all-files | true | Whether full optimizing rewrites all files or skips files that do not need to be optimized |
| self-optimizing.min-plan-interval | 60000 | The minimum time interval between two self-optimizing planning action |

## Data-cleaning configurations

Expand Down

0 comments on commit 36fab23

Please sign in to comment.