-
Notifications
You must be signed in to change notification settings - Fork 2.9k
Adds SortRewriteStrategy #2609
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adds SortRewriteStrategy #2609
Conversation
|
@jackye1995 + @openinx - Sort Based PR (Abstract Strategy Only) |
f14ed8f to
53ff8be
Compare
core/src/test/java/org/apache/iceberg/actions/TestSortStrategy.java
Outdated
Show resolved
Hide resolved
SreeramGarlapati
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the contribution @RussellSpitzer
| LOG.info("Sort Strategy for table {} set to rewrite all data files", table().name()); | ||
| return dataFiles; | ||
| } else { | ||
| FluentIterable filesWithCorrectOrder = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the current implementation as it stands today will always rewrite everything as we never currently propagate the sort order id to data files.
|
@aokolnychyi Just updated this, I think this is the next step towards getting our Sort Implementation in as well. |
A rewrite strategy for data files which aims to reorder data with data files to optimally lay them out
in relation to a column. For example, if the Sort strategy is used on a set of files which is ordered
by column x and original has files File A (x: 0 - 50), File B ( x: 10 - 40) and File C ( x: 30 - 60),
this Strategy will attempt to rewrite those files into File A' (x: 0-20), File B' (x: 21 - 40),
File C' (x: 41 - 60).
Currently the there is no clustering detection and we will rewrite all files if {@link SortStrategy#REWRITE_ALL}
is true (default). If this property is disabled any files with the incorrect sort-order as well as any files
that would be chosen by {@link BinPackStrategy} will be rewrite candidates.
In the future other algorithms for determining files to rewrite will be provided.
Also expose sortOrder so extending classes can use it
2795a42 to
d754675
Compare
| * this Strategy will attempt to rewrite those files into File A' (x: 0-20), File B' (x: 21 - 40), | ||
| * File C' (x: 41 - 60). | ||
| * <p> | ||
| * Currently the there is no clustering detection and we will rewrite all files if {@link SortStrategy#REWRITE_ALL} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have we thought about how we will support clustering in the future? Is it going to be:
- another strategy that extends this
SortStrategy - additional configurations in this strategy
- or something else?
I am just wondering this approach for extending BinPackStrategy is extensible in the long run as different strategies might have inter-dependencies with each other. If this is just one config or maybe plus a few additional configs for clustering in the future, would it be better to just directly make them configurations in the BinPackStrategy?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking (2)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just trying to rethink about clustering here. In Hive, people need to consider clustering because bucketing is not a part of partitioning strategy and you always partition by column value. But in Iceberg, when people do something like partition by category, bucket(16, id) in partition strategy, it is already equivalent to Hive's
PARTITIONED BY (category)
CLUSTERED BY (id) INTO 16 BUCKETS
In this sense, does that mean we actually don't really need clustering detection in Iceberg's rewrite strategy? Or did I miss any other use cases when you say clustering?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This clustering is within a partition, perhaps the naming is not great here but I"m talking about whether we have several datafiles which cover the same region of the sort column. Like hopefully in the future we can detect files A, B, C have significant overlap so we should rewrite those 3 files.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, that makes much more sense, we should probably make it clear in the documentation. Apart from that I am good with this!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed, switched to "file-overlap" instead of "clustering"
| @Test | ||
| public void testSelectAll() { | ||
| List<FileScanTask> invalid = ImmutableList.<FileScanTask>builder() | ||
| .addAll(tasksForSortOrder(-1, 500, 500, 500, 500)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is order id -1? If it's default UNSORTED_ORDER, it should be 0.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently it doesn't matter because of the note I wrote below, since we normally do not set this value on our writers
| * <p> | ||
| * Currently the there is no clustering detection and we will rewrite all files if {@link SortStrategy#REWRITE_ALL} | ||
| * is true (default: false). If this property is disabled any files that would be chosen by | ||
| * {@link BinPackStrategy} will be rewrite candidates. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought it would be any files that would be chosen by BinPackStrategy + any file with the wrong sort oder. Why is part 2 excluded?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently we don't ever actually record the sort order that a file is written with, so that means it will never match atm.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sort_order_id is an optional field in data file, in the current process sort order is always null. I think we can distinguish that case from a wrong sort order which mean non-null and not match the given order id.
Once in v2, it will also be beneficial to rewrite data files with null sort order id to backfill the order for table written by v1 writer. It seems like we can add boolean options for something like rewrite-missing-order and rewrite-wrong-order?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's fine, I just think it's ok the hold off on options for code we don't have the ability to use yet
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool, I am good with having it later
|
Ok, if we don't have any other points I think we can get this in so I can rebase the implementation and sort order PR's. I'll wait a while to see if anyone has any last opinions and merge tonight. |
|
Thanks everyone for the review! Please check out the implementation PR as well #2829 |
A rewrite strategy for data files which aims to reorder data with data files to optimally lay them out in relation to a column. For example, if the Sort strategy is used on a set of files which is ordered by column x and original has files File A (x: 0 - 50), File B ( x: 10 - 40) and File C ( x: 30 - 60), this Strategy will attempt to rewrite those files into File A' (x: 0-20), File B' (x: 21 - 40), File C' (x: 41 - 60).
Merge remote-tracking branch 'upstream/merge-master-20210816' into master ## 该MR主要解决什么? merge upstream/master,引入最近的一些bugFix和优化 ## 该MR的修改是什么? 核心关注PR: > Predicate PushDown 支持,https://github.com/apache/iceberg/pull/2358, https://github.com/apache/iceberg/pull/2926, https://github.com/apache/iceberg/pull/2777/files > Spark场景写入空dataset 报错问题,直接skip掉即可, apache#2960 > Flink UI补充uidPrefix到operator方便跟踪多个iceberg sink任务, apache#288 > Spark 修复nested Struct Pruning问题, apache#2877 > 可以使用Table Properties指定创建v2 format表,apache#2887 > 补充SortRewriteStrategy框架,逐步支持不同rewrite策略, apache#2609 (WIP:apache#2829) > Spark 为catalog配置hadoop属性支持, apache#2792 > Spark 针对timestamps without timezone读写支持, apache#2757 > Spark MicroBatch支持配置属性skip delete snapshots, apache#2752 > Spark V2 RewriteDatafilesAction 支持 > Core: Add validation for row-level deletes with rewrites, apache#2865 > schema time travel 功能相关,补充schema-id, Core: add schema id to snapshot > Spark Extension支持identifier fields操作, apache#2560 > Parquet: Update to 1.12.0, apache#2441 > Hive: Vectorized ORC reads for Hive, apache#2613 > Spark: Add an action to remove all referenced files, apache#2415 ## 该MR是如何测试的? UT
A rewrite strategy for data files which aims to reorder data with data files to optimally lay them out
in relation to a column. For example, if the Sort strategy is used on a set of files which is ordered
by column x and original has files File A (x: 0 - 50), File B ( x: 10 - 40) and File C ( x: 30 - 60),
this Strategy will attempt to rewrite those files into File A' (x: 0-20), File B' (x: 21 - 40),
File C' (x: 41 - 60).
Currently the there is no clustering detection and we will rewrite all files if {@link SortStrategy#REWRITE_ALL}
is true (default). If this property is disabled any files with the incorrect sort-order as well as any files
that would be chosen by {@link BinPackStrategy} will be rewrite candidates.
In the future other algorithms for determining files to rewrite will be provided.