Spark: Don't create empty partition replace operations #2960

RussellSpitzer · 2021-08-11T14:52:51Z

When attempting to insert overwrite with an empty dataset
we would previously throw an error. This patch causes spark
to skip any no-op partition replacement operations.

RussellSpitzer · 2021-08-11T15:00:00Z

Solves #2895

#2895 is Caused by attempting to build a dynamic replace operation which contains no files. This is currently not allowed. We can either change this to a NOOP in Spark or allow it in Iceberg. This PR changes the operation to a NOOP in Spark.

RussellSpitzer · 2021-08-11T15:00:43Z

@binhnv Could please review as well

When attempting to insert overwrite with an empty dataset we would previously throw an error. This patch causes spark to skip any no-op partition replacement operations.

rdblue · 2021-08-11T17:42:08Z

spark3/src/main/java/org/apache/iceberg/spark/source/SparkWrite.java

    @Override
    public void commit(WriterCommitMessage[] messages) {
+      Iterable<DataFile> files = files(messages);
+      if (Iterables.size(files) == 0) {


What about using !files.hasNext instead? I'm not sure we want to assume that the iterable can be consumed multiple times. Plus there's no need to consume the entire iterable just to check whether it is empty.

rdblue

I left a minor comment, but this looks good to me.

binhnv · 2021-08-11T18:41:02Z

Thank you for fixing this. The change looks good to me and it is also consistent with Hive's behavior.

RussellSpitzer · 2021-08-11T19:31:47Z

Thanks @rdblue + @binhnv for reviews! Will merge

Merge remote-tracking branch 'upstream/merge-master-20210816' into master ## 该MR主要解决什么？ merge upstream/master，引入最近的一些bugFix和优化 ## 该MR的修改是什么？核心关注PR： > Predicate PushDown 支持，https://github.com/apache/iceberg/pull/2358， https://github.com/apache/iceberg/pull/2926， https://github.com/apache/iceberg/pull/2777/files > Spark场景写入空dataset 报错问题，直接skip掉即可， apache#2960 > Flink UI补充uidPrefix到operator方便跟踪多个iceberg sink任务， apache#288 > Spark 修复nested Struct Pruning问题， apache#2877 > 可以使用Table Properties指定创建v2 format表，apache#2887 > 补充SortRewriteStrategy框架，逐步支持不同rewrite策略， apache#2609 （WIP：apache#2829） > Spark 为catalog配置hadoop属性支持， apache#2792 > Spark 针对timestamps without timezone读写支持， apache#2757 > Spark MicroBatch支持配置属性skip delete snapshots， apache#2752 > Spark V2 RewriteDatafilesAction 支持 > Core: Add validation for row-level deletes with rewrites, apache#2865 > schema time travel 功能相关，补充schema-id， Core: add schema id to snapshot > Spark Extension支持identifier fields操作， apache#2560 > Parquet: Update to 1.12.0, apache#2441 > Hive: Vectorized ORC reads for Hive, apache#2613 > Spark: Add an action to remove all referenced files, apache#2415 ## 该MR是如何测试的？ UT

github-actions bot added the spark label Aug 11, 2021

RussellSpitzer requested a review from rdblue August 11, 2021 15:00

Spark: Don't create empty partition replace operations

50d1aa5

When attempting to insert overwrite with an empty dataset we would previously throw an error. This patch causes spark to skip any no-op partition replacement operations.

RussellSpitzer force-pushed the AllowEmptyReplace branch from 5d65f8e to 50d1aa5 Compare August 11, 2021 15:02

rdblue reviewed Aug 11, 2021

View reviewed changes

rdblue approved these changes Aug 11, 2021

View reviewed changes

Fix Spark2 As Well

975d811

RussellSpitzer force-pushed the AllowEmptyReplace branch from 79e63aa to 975d811 Compare August 11, 2021 18:04

RussellSpitzer merged commit e4df91e into apache:master Aug 11, 2021

RussellSpitzer deleted the AllowEmptyReplace branch August 11, 2021 19:33

hankfanchiu added a commit to hankfanchiu/iceberg that referenced this pull request Aug 27, 2021

Revert apache#2960 and commit no-op partition replacement operations

77f33db

hankfanchiu mentioned this pull request Aug 27, 2021

Revert #2960 and commit no-op partition replacement operations #3043

Closed

This was referenced Feb 15, 2022

Spark: skip empty dataset in append and overwrite mode #3971

Closed

Got exception if overwrite partitions with empty dataset by spark #3969

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Spark: Don't create empty partition replace operations #2960

Spark: Don't create empty partition replace operations #2960

Uh oh!

RussellSpitzer commented Aug 11, 2021

Uh oh!

RussellSpitzer commented Aug 11, 2021

Uh oh!

RussellSpitzer commented Aug 11, 2021

Uh oh!

rdblue Aug 11, 2021 •

edited

Loading

Uh oh!

rdblue left a comment

Uh oh!

binhnv commented Aug 11, 2021

Uh oh!

RussellSpitzer commented Aug 11, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Spark: Don't create empty partition replace operations #2960

Spark: Don't create empty partition replace operations #2960

Uh oh!

Conversation

RussellSpitzer commented Aug 11, 2021

Uh oh!

RussellSpitzer commented Aug 11, 2021

Uh oh!

RussellSpitzer commented Aug 11, 2021

Uh oh!

rdblue Aug 11, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rdblue left a comment

Choose a reason for hiding this comment

Uh oh!

binhnv commented Aug 11, 2021

Uh oh!

RussellSpitzer commented Aug 11, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

rdblue Aug 11, 2021 •

edited

Loading