-
Notifications
You must be signed in to change notification settings - Fork 2.9k
Spark: Don't create empty partition replace operations #2960
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Spark: Don't create empty partition replace operations #2960
Conversation
|
@binhnv Could please review as well |
When attempting to insert overwrite with an empty dataset we would previously throw an error. This patch causes spark to skip any no-op partition replacement operations.
5d65f8e to
50d1aa5
Compare
| @Override | ||
| public void commit(WriterCommitMessage[] messages) { | ||
| Iterable<DataFile> files = files(messages); | ||
| if (Iterables.size(files) == 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about using !files.hasNext instead? I'm not sure we want to assume that the iterable can be consumed multiple times. Plus there's no need to consume the entire iterable just to check whether it is empty.
rdblue
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left a minor comment, but this looks good to me.
79e63aa to
975d811
Compare
|
Thank you for fixing this. The change looks good to me and it is also consistent with Hive's behavior. |
Merge remote-tracking branch 'upstream/merge-master-20210816' into master ## 该MR主要解决什么? merge upstream/master,引入最近的一些bugFix和优化 ## 该MR的修改是什么? 核心关注PR: > Predicate PushDown 支持,https://github.com/apache/iceberg/pull/2358, https://github.com/apache/iceberg/pull/2926, https://github.com/apache/iceberg/pull/2777/files > Spark场景写入空dataset 报错问题,直接skip掉即可, apache#2960 > Flink UI补充uidPrefix到operator方便跟踪多个iceberg sink任务, apache#288 > Spark 修复nested Struct Pruning问题, apache#2877 > 可以使用Table Properties指定创建v2 format表,apache#2887 > 补充SortRewriteStrategy框架,逐步支持不同rewrite策略, apache#2609 (WIP:apache#2829) > Spark 为catalog配置hadoop属性支持, apache#2792 > Spark 针对timestamps without timezone读写支持, apache#2757 > Spark MicroBatch支持配置属性skip delete snapshots, apache#2752 > Spark V2 RewriteDatafilesAction 支持 > Core: Add validation for row-level deletes with rewrites, apache#2865 > schema time travel 功能相关,补充schema-id, Core: add schema id to snapshot > Spark Extension支持identifier fields操作, apache#2560 > Parquet: Update to 1.12.0, apache#2441 > Hive: Vectorized ORC reads for Hive, apache#2613 > Spark: Add an action to remove all referenced files, apache#2415 ## 该MR是如何测试的? UT
When attempting to insert overwrite with an empty dataset
we would previously throw an error. This patch causes spark
to skip any no-op partition replacement operations.