Skip to content

Conversation

@RussellSpitzer
Copy link
Member

A rewrite strategy for data files which aims to reorder data with data files to optimally lay them out
in relation to a column. For example, if the Sort strategy is used on a set of files which is ordered
by column x and original has files File A (x: 0 - 50), File B ( x: 10 - 40) and File C ( x: 30 - 60),
this Strategy will attempt to rewrite those files into File A' (x: 0-20), File B' (x: 21 - 40),
File C' (x: 41 - 60).

Currently the there is no clustering detection and we will rewrite all files if {@link SortStrategy#REWRITE_ALL}
is true (default). If this property is disabled any files with the incorrect sort-order as well as any files
that would be chosen by {@link BinPackStrategy} will be rewrite candidates.

In the future other algorithms for determining files to rewrite will be provided.

@RussellSpitzer
Copy link
Member Author

@jackye1995 + @openinx - Sort Based PR (Abstract Strategy Only)

@RussellSpitzer RussellSpitzer force-pushed the SortRewriteStrategy branch 2 times, most recently from f14ed8f to 53ff8be Compare May 19, 2021 00:25
Copy link
Collaborator

@SreeramGarlapati SreeramGarlapati left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the contribution @RussellSpitzer

LOG.info("Sort Strategy for table {} set to rewrite all data files", table().name());
return dataFiles;
} else {
FluentIterable filesWithCorrectOrder =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the current implementation as it stands today will always rewrite everything as we never currently propagate the sort order id to data files.

@RussellSpitzer
Copy link
Member Author

@aokolnychyi Just updated this, I think this is the next step towards getting our Sort Implementation in as well.

A rewrite strategy for data files which aims to reorder data with data files to optimally lay them out
in relation to a column. For example, if the Sort strategy is used on a set of files which is ordered
by column x and original has files File A (x: 0 - 50), File B ( x: 10 - 40) and File C ( x: 30 - 60),
this Strategy will attempt to rewrite those files into File A' (x: 0-20), File B' (x: 21 - 40),
File C' (x: 41 - 60).

Currently the there is no clustering detection and we will rewrite all files if {@link SortStrategy#REWRITE_ALL}
is true (default). If this property is disabled any files with the incorrect sort-order as well as any files
that would be chosen by {@link BinPackStrategy} will be rewrite candidates.

In the future other algorithms for determining files to rewrite will be provided.
Also expose sortOrder so extending classes can use it
* this Strategy will attempt to rewrite those files into File A' (x: 0-20), File B' (x: 21 - 40),
* File C' (x: 41 - 60).
* <p>
* Currently the there is no clustering detection and we will rewrite all files if {@link SortStrategy#REWRITE_ALL}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have we thought about how we will support clustering in the future? Is it going to be:

  1. another strategy that extends this SortStrategy
  2. additional configurations in this strategy
  3. or something else?

I am just wondering this approach for extending BinPackStrategy is extensible in the long run as different strategies might have inter-dependencies with each other. If this is just one config or maybe plus a few additional configs for clustering in the future, would it be better to just directly make them configurations in the BinPackStrategy?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking (2)

Copy link
Contributor

@jackye1995 jackye1995 Jul 14, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just trying to rethink about clustering here. In Hive, people need to consider clustering because bucketing is not a part of partitioning strategy and you always partition by column value. But in Iceberg, when people do something like partition by category, bucket(16, id) in partition strategy, it is already equivalent to Hive's

PARTITIONED BY (category)
CLUSTERED BY (id) INTO 16 BUCKETS

In this sense, does that mean we actually don't really need clustering detection in Iceberg's rewrite strategy? Or did I miss any other use cases when you say clustering?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This clustering is within a partition, perhaps the naming is not great here but I"m talking about whether we have several datafiles which cover the same region of the sort column. Like hopefully in the future we can detect files A, B, C have significant overlap so we should rewrite those 3 files.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, that makes much more sense, we should probably make it clear in the documentation. Apart from that I am good with this!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed, switched to "file-overlap" instead of "clustering"

@Test
public void testSelectAll() {
List<FileScanTask> invalid = ImmutableList.<FileScanTask>builder()
.addAll(tasksForSortOrder(-1, 500, 500, 500, 500))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is order id -1? If it's default UNSORTED_ORDER, it should be 0.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently it doesn't matter because of the note I wrote below, since we normally do not set this value on our writers

* <p>
* Currently the there is no clustering detection and we will rewrite all files if {@link SortStrategy#REWRITE_ALL}
* is true (default: false). If this property is disabled any files that would be chosen by
* {@link BinPackStrategy} will be rewrite candidates.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought it would be any files that would be chosen by BinPackStrategy + any file with the wrong sort oder. Why is part 2 excluded?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently we don't ever actually record the sort order that a file is written with, so that means it will never match atm.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sort_order_id is an optional field in data file, in the current process sort order is always null. I think we can distinguish that case from a wrong sort order which mean non-null and not match the given order id.

Once in v2, it will also be beneficial to rewrite data files with null sort order id to backfill the order for table written by v1 writer. It seems like we can add boolean options for something like rewrite-missing-order and rewrite-wrong-order?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's fine, I just think it's ok the hold off on options for code we don't have the ability to use yet

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool, I am good with having it later

@RussellSpitzer
Copy link
Member Author

RussellSpitzer commented Jul 20, 2021

Ok, if we don't have any other points I think we can get this in so I can rebase the implementation and sort order PR's. I'll wait a while to see if anyone has any last opinions and merge tonight.

@RussellSpitzer RussellSpitzer merged commit 2a39712 into apache:master Jul 21, 2021
@RussellSpitzer RussellSpitzer deleted the SortRewriteStrategy branch July 21, 2021 13:47
@RussellSpitzer
Copy link
Member Author

Thanks everyone for the review! Please check out the implementation PR as well #2829

minchowang pushed a commit to minchowang/iceberg that referenced this pull request Aug 2, 2021
A rewrite strategy for data files which aims to reorder data with data files to optimally lay them out
in relation to a column. For example, if the Sort strategy is used on a set of files which is ordered
by column x and original has files File A (x: 0 - 50), File B ( x: 10 - 40) and File C ( x: 30 - 60),
this Strategy will attempt to rewrite those files into File A' (x: 0-20), File B' (x: 21 - 40),
File C' (x: 41 - 60).
chenjunjiedada pushed a commit to chenjunjiedada/incubator-iceberg that referenced this pull request Oct 20, 2021
Merge remote-tracking branch 'upstream/merge-master-20210816' into master
## 该MR主要解决什么?

merge upstream/master,引入最近的一些bugFix和优化

## 该MR的修改是什么?

核心关注PR:
> Predicate PushDown 支持,https://github.com/apache/iceberg/pull/2358, https://github.com/apache/iceberg/pull/2926, https://github.com/apache/iceberg/pull/2777/files
> Spark场景写入空dataset 报错问题,直接skip掉即可, apache#2960
> Flink UI补充uidPrefix到operator方便跟踪多个iceberg sink任务, apache#288
> Spark 修复nested Struct Pruning问题, apache#2877
> 可以使用Table Properties指定创建v2 format表,apache#2887
> 补充SortRewriteStrategy框架,逐步支持不同rewrite策略, apache#2609 (WIP:apache#2829)
> Spark 为catalog配置hadoop属性支持, apache#2792
> Spark 针对timestamps without timezone读写支持, apache#2757
> Spark MicroBatch支持配置属性skip delete snapshots, apache#2752
> Spark V2 RewriteDatafilesAction 支持
> Core: Add validation for row-level deletes with rewrites, apache#2865 > schema time travel 功能相关,补充schema-id, Core: add schema id to snapshot 
> Spark Extension支持identifier fields操作, apache#2560
> Parquet: Update to 1.12.0, apache#2441
> Hive: Vectorized ORC reads for Hive, apache#2613
> Spark: Add an action to remove all referenced files, apache#2415

## 该MR是如何测试的?

UT
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants