Adds SortRewriteStrategy #2609

RussellSpitzer · 2021-05-19T00:13:17Z

A rewrite strategy for data files which aims to reorder data with data files to optimally lay them out
in relation to a column. For example, if the Sort strategy is used on a set of files which is ordered
by column x and original has files File A (x: 0 - 50), File B ( x: 10 - 40) and File C ( x: 30 - 60),
this Strategy will attempt to rewrite those files into File A' (x: 0-20), File B' (x: 21 - 40),
File C' (x: 41 - 60).

Currently the there is no clustering detection and we will rewrite all files if {@link SortStrategy#REWRITE_ALL}
is true (default). If this property is disabled any files with the incorrect sort-order as well as any files
that would be chosen by {@link BinPackStrategy} will be rewrite candidates.

In the future other algorithms for determining files to rewrite will be provided.

RussellSpitzer · 2021-05-19T00:13:56Z

@jackye1995 + @openinx - Sort Based PR (Abstract Strategy Only)

core/src/test/java/org/apache/iceberg/actions/TestSortStrategy.java

SreeramGarlapati

Thanks for the contribution @RussellSpitzer

core/src/main/java/org/apache/iceberg/actions/SortStrategy.java

aokolnychyi · 2021-05-19T21:14:25Z

core/src/main/java/org/apache/iceberg/actions/SortStrategy.java

+      LOG.info("Sort Strategy for table {} set to rewrite all data files", table().name());
+      return dataFiles;
+    } else {
+      FluentIterable filesWithCorrectOrder =


I think the current implementation as it stands today will always rewrite everything as we never currently propagate the sort order id to data files.

RussellSpitzer · 2021-07-13T18:58:58Z

@aokolnychyi Just updated this, I think this is the next step towards getting our Sort Implementation in as well.

A rewrite strategy for data files which aims to reorder data with data files to optimally lay them out in relation to a column. For example, if the Sort strategy is used on a set of files which is ordered by column x and original has files File A (x: 0 - 50), File B ( x: 10 - 40) and File C ( x: 30 - 60), this Strategy will attempt to rewrite those files into File A' (x: 0-20), File B' (x: 21 - 40), File C' (x: 41 - 60). Currently the there is no clustering detection and we will rewrite all files if {@link SortStrategy#REWRITE_ALL} is true (default). If this property is disabled any files with the incorrect sort-order as well as any files that would be chosen by {@link BinPackStrategy} will be rewrite candidates. In the future other algorithms for determining files to rewrite will be provided.

Also expose sortOrder so extending classes can use it

jackye1995 · 2021-07-14T17:37:33Z

core/src/main/java/org/apache/iceberg/actions/SortStrategy.java

+ * this Strategy will attempt to rewrite those files into File A' (x: 0-20), File B' (x: 21 - 40),
+ * File C' (x: 41 - 60).
+ * <p>
+ * Currently the there is no clustering detection and we will rewrite all files if {@link SortStrategy#REWRITE_ALL}


Have we thought about how we will support clustering in the future? Is it going to be:

another strategy that extends this SortStrategy

additional configurations in this strategy

or something else?

I am just wondering this approach for extending BinPackStrategy is extensible in the long run as different strategies might have inter-dependencies with each other. If this is just one config or maybe plus a few additional configs for clustering in the future, would it be better to just directly make them configurations in the BinPackStrategy?

I was thinking (2)

Just trying to rethink about clustering here. In Hive, people need to consider clustering because bucketing is not a part of partitioning strategy and you always partition by column value. But in Iceberg, when people do something like partition by category, bucket(16, id) in partition strategy, it is already equivalent to Hive's

PARTITIONED BY (category) CLUSTERED BY (id) INTO 16 BUCKETS

In this sense, does that mean we actually don't really need clustering detection in Iceberg's rewrite strategy? Or did I miss any other use cases when you say clustering?

This clustering is within a partition, perhaps the naming is not great here but I"m talking about whether we have several datafiles which cover the same region of the sort column. Like hopefully in the future we can detect files A, B, C have significant overlap so we should rewrite those 3 files.

I see, that makes much more sense, we should probably make it clear in the documentation. Apart from that I am good with this!

Fixed, switched to "file-overlap" instead of "clustering"

jackye1995 · 2021-07-14T19:06:46Z

core/src/test/java/org/apache/iceberg/actions/TestSortStrategy.java

+  @Test
+  public void testSelectAll() {
+    List<FileScanTask> invalid = ImmutableList.<FileScanTask>builder()
+        .addAll(tasksForSortOrder(-1, 500, 500, 500, 500))


what is order id -1? If it's default UNSORTED_ORDER, it should be 0.

Currently it doesn't matter because of the note I wrote below, since we normally do not set this value on our writers

jackye1995 · 2021-07-14T19:09:18Z

core/src/main/java/org/apache/iceberg/actions/SortStrategy.java

+ * <p>
+ * Currently the there is no clustering detection and we will rewrite all files if {@link SortStrategy#REWRITE_ALL}
+ * is true (default: false). If this property is disabled any files that would be chosen by
+ * {@link BinPackStrategy} will be rewrite candidates.


I thought it would be any files that would be chosen by BinPackStrategy + any file with the wrong sort oder. Why is part 2 excluded?

Currently we don't ever actually record the sort order that a file is written with, so that means it will never match atm.

sort_order_id is an optional field in data file, in the current process sort order is always null. I think we can distinguish that case from a wrong sort order which mean non-null and not match the given order id.

Once in v2, it will also be beneficial to rewrite data files with null sort order id to backfill the order for table written by v1 writer. It seems like we can add boolean options for something like rewrite-missing-order and rewrite-wrong-order?

That's fine, I just think it's ok the hold off on options for code we don't have the ability to use yet

Cool, I am good with having it later

RussellSpitzer · 2021-07-20T14:13:43Z

Ok, if we don't have any other points I think we can get this in so I can rebase the implementation and sort order PR's. I'll wait a while to see if anyone has any last opinions and merge tonight.

RussellSpitzer · 2021-07-21T13:47:54Z

Thanks everyone for the review! Please check out the implementation PR as well #2829

A rewrite strategy for data files which aims to reorder data with data files to optimally lay them out in relation to a column. For example, if the Sort strategy is used on a set of files which is ordered by column x and original has files File A (x: 0 - 50), File B ( x: 10 - 40) and File C ( x: 30 - 60), this Strategy will attempt to rewrite those files into File A' (x: 0-20), File B' (x: 21 - 40), File C' (x: 41 - 60).

Merge remote-tracking branch 'upstream/merge-master-20210816' into master ## 该MR主要解决什么？ merge upstream/master，引入最近的一些bugFix和优化 ## 该MR的修改是什么？核心关注PR： > Predicate PushDown 支持，https://github.com/apache/iceberg/pull/2358， https://github.com/apache/iceberg/pull/2926， https://github.com/apache/iceberg/pull/2777/files > Spark场景写入空dataset 报错问题，直接skip掉即可， apache#2960 > Flink UI补充uidPrefix到operator方便跟踪多个iceberg sink任务， apache#288 > Spark 修复nested Struct Pruning问题， apache#2877 > 可以使用Table Properties指定创建v2 format表，apache#2887 > 补充SortRewriteStrategy框架，逐步支持不同rewrite策略， apache#2609 （WIP：apache#2829） > Spark 为catalog配置hadoop属性支持， apache#2792 > Spark 针对timestamps without timezone读写支持， apache#2757 > Spark MicroBatch支持配置属性skip delete snapshots， apache#2752 > Spark V2 RewriteDatafilesAction 支持 > Core: Add validation for row-level deletes with rewrites, apache#2865 > schema time travel 功能相关，补充schema-id， Core: add schema id to snapshot > Spark Extension支持identifier fields操作， apache#2560 > Parquet: Update to 1.12.0, apache#2441 > Hive: Vectorized ORC reads for Hive, apache#2613 > Spark: Add an action to remove all referenced files, apache#2415 ## 该MR是如何测试的？ UT

RussellSpitzer requested review from aokolnychyi and rdblue May 19, 2021 00:13

github-actions bot added API core labels May 19, 2021

RussellSpitzer force-pushed the SortRewriteStrategy branch 2 times, most recently from f14ed8f to 53ff8be Compare May 19, 2021 00:25

SreeramGarlapati reviewed May 19, 2021

View reviewed changes

core/src/test/java/org/apache/iceberg/actions/TestSortStrategy.java Outdated Show resolved Hide resolved

SreeramGarlapati approved these changes May 19, 2021

View reviewed changes

dixingxing0 reviewed May 19, 2021

View reviewed changes

core/src/main/java/org/apache/iceberg/actions/SortStrategy.java Show resolved Hide resolved

aokolnychyi reviewed May 19, 2021

View reviewed changes

RussellSpitzer added 3 commits July 13, 2021 14:03

Fix a few typos

85b4928

Also expose sortOrder so extending classes can use it

Reviewer Comments

d754675

RussellSpitzer force-pushed the SortRewriteStrategy branch from 2795a42 to d754675 Compare July 13, 2021 19:03

RussellSpitzer added 2 commits July 13, 2021 14:20

Fix after Rebase

6a98123

Missing endline

20f98f3

jackye1995 reviewed Jul 14, 2021

View reviewed changes

Change javadoc to remove the word clustering

a87a7f3

jackye1995 approved these changes Jul 20, 2021

View reviewed changes

RussellSpitzer merged commit 2a39712 into apache:master Jul 21, 2021

RussellSpitzer deleted the SortRewriteStrategy branch July 21, 2021 13:47

Adds SortRewriteStrategy #2609

Adds SortRewriteStrategy #2609

Uh oh!

Conversation

RussellSpitzer commented May 19, 2021

Uh oh!

RussellSpitzer commented May 19, 2021

Uh oh!

Uh oh!

SreeramGarlapati left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

RussellSpitzer commented Jul 13, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jackye1995 Jul 14, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

RussellSpitzer commented Jul 20, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

RussellSpitzer commented Jul 21, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

jackye1995 Jul 14, 2021 •

edited

Loading

RussellSpitzer commented Jul 20, 2021 •

edited

Loading