Skip to content

Conversation

@daksha121
Copy link
Contributor

@daksha121 daksha121 commented Jun 29, 2021

Building on top of #2660, this PR introduces the ability to skip deletes in Micro Batch read path. It also ignores replace snapshots.
Spark option introduced:

  • To skip deletes: streaming-skip-delete-snapshots
    Additional contributor to this PR: @SreeramGarlapati

SreeramGarlapati and others added 3 commits June 25, 2021 19:25
* implement skipDelete and skipReplace options
* revert changes in SnapshotUtil
@github-actions github-actions bot added the spark label Jun 29, 2021
@daksha121 daksha121 changed the title Spark MicroBatch read - implement skipReplace and skipDelete Spark MicroBatch read - Ignore replace snapshots and add Spark option to skip delete snapshots Jun 30, 2021
Copy link
Member

@RussellSpitzer RussellSpitzer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, I had a few questions on the skip/current offset logic. Let me know when I should take another look

@daksha121
Copy link
Contributor Author

Looks good to me, I had a few questions on the skip/current offset logic. Let me know when I should take another look

Thanks for the review @RussellSpitzer! I addressed the comments and it's ready for another look

Copy link
Member

@RussellSpitzer RussellSpitzer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Had tiny minor comment on debug method but this looks good to me

@RussellSpitzer
Copy link
Member

Hitting some style check failures

Error: eckstyle] [ERROR] /home/runner/work/iceberg/iceberg/spark3/src/main/java/org/apache/iceberg/spark/source/SparkMicroBatchStream.java:194:7: 'if' is not followed by whitespace. [WhitespaceAfter]

Be sure to either enable the style checker in your IDE or run the gradle based tests

g build -x test  -x integrationTest

@daksha121 daksha121 requested a review from rdblue June 30, 2021 20:45
@daksha121
Copy link
Contributor Author

Addressed your comments, @rdblue this is ready for another pass. Thanks!

@RussellSpitzer
Copy link
Member

@daksha121 I think we are still waiting on a change to the parameter name. I'm not good on names so I generally am fine with almost anything but I believe we are still hoping for a better name for the parameter

@daksha121
Copy link
Contributor Author

@daksha121 I think we are still waiting on a change to the parameter name. I'm not good on names so I generally am fine with almost anything but I believe we are still hoping for a better name for the parameter

I'm not clear on which parameter needs to be renamed.. Are you referring to the new spark option (skip-deletes-on-stream-read) that this PR introduces?

@daksha121
Copy link
Contributor Author

@daksha121 I think we are still waiting on a change to the parameter name. I'm not good on names so I generally am fine with almost anything but I believe we are still hoping for a better name for the parameter

I'm not clear on which parameter needs to be renamed.. Are you referring to the new spark option (skip-deletes-on-stream-read) that this PR introduces?

@RussellSpitzer how about just:

  • skip-deletes OR
  • ignore-deletes

@RussellSpitzer
Copy link
Member

@rdblue what do you think? I'm fine with streaming.ignore-deletes or anything like that.

@rdblue
Copy link
Contributor

rdblue commented Jul 9, 2021

We try not to use namespaces in the read options to keep them simple. This one is a bit odd because ignore-deletes could be misinterpreted easily -- we don't want anyone to think they can read a table without applying delete files, for example. So I want to make it clear that this is streaming and that this not delete files. The best I'm coming up with is skip-delete-snapshots, which is clear that it is referring to snapshot operations. I don't think it would be misinterpreted if it were used in a batch context since it wouldn't make sense to run a select that doesn't do anything if the latest snapshot is a delete.

@daksha121
Copy link
Contributor Author

We try not to use namespaces in the read options to keep them simple. This one is a bit odd because ignore-deletes could be misinterpreted easily -- we don't want anyone to think they can read a table without applying delete files, for example. So I want to make it clear that this is streaming and that this not delete files. The best I'm coming up with is skip-delete-snapshots, which is clear that it is referring to snapshot operations. I don't think it would be misinterpreted if it were used in a batch context since it wouldn't make sense to run a select that doesn't do anything if the latest snapshot is a delete.

Thanks @rdblue, that makes sense. Can you help understand why we wouldn't want to put the word streaming in like streaming-skip-deletes? We could perhaps eliminate the risk of misinterpretation. Maybe I'm missing something here

@rdblue
Copy link
Contributor

rdblue commented Jul 10, 2021

I'd be fine adding streaming-. I was just trying to keep it as small as possible. We definitely need -snapshots because it should be clear that this doesn't affect row-level deletes.

@daksha121
Copy link
Contributor Author

I'd be fine adding streaming-. I was just trying to keep it as small as possible. We definitely need -snapshots because it should be clear that this doesn't affect row-level deletes.

Thanks @rdblue. Renamed the option to streaming-skip-delete-snapshots

@SreeramGarlapati
Copy link
Collaborator

@rdblue - pl. drop me as a contributor for this work. @daksha121 - was nice enough to mention me as one. I was trying to pay back her help in work: #2660.

@rdblue rdblue merged commit 40e626a into apache:master Jul 13, 2021
@rdblue
Copy link
Contributor

rdblue commented Jul 13, 2021

Thanks, @daksha121! I merged this.

minchowang pushed a commit to minchowang/iceberg that referenced this pull request Aug 2, 2021
chenjunjiedada pushed a commit to chenjunjiedada/incubator-iceberg that referenced this pull request Oct 20, 2021
Merge remote-tracking branch 'upstream/merge-master-20210816' into master
## 该MR主要解决什么?

merge upstream/master,引入最近的一些bugFix和优化

## 该MR的修改是什么?

核心关注PR:
> Predicate PushDown 支持,https://github.com/apache/iceberg/pull/2358, https://github.com/apache/iceberg/pull/2926, https://github.com/apache/iceberg/pull/2777/files
> Spark场景写入空dataset 报错问题,直接skip掉即可, apache#2960
> Flink UI补充uidPrefix到operator方便跟踪多个iceberg sink任务, apache#288
> Spark 修复nested Struct Pruning问题, apache#2877
> 可以使用Table Properties指定创建v2 format表,apache#2887
> 补充SortRewriteStrategy框架,逐步支持不同rewrite策略, apache#2609 (WIP:apache#2829)
> Spark 为catalog配置hadoop属性支持, apache#2792
> Spark 针对timestamps without timezone读写支持, apache#2757
> Spark MicroBatch支持配置属性skip delete snapshots, apache#2752
> Spark V2 RewriteDatafilesAction 支持
> Core: Add validation for row-level deletes with rewrites, apache#2865 > schema time travel 功能相关,补充schema-id, Core: add schema id to snapshot 
> Spark Extension支持identifier fields操作, apache#2560
> Parquet: Update to 1.12.0, apache#2441
> Hive: Vectorized ORC reads for Hive, apache#2613
> Spark: Add an action to remove all referenced files, apache#2415

## 该MR是如何测试的?

UT
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants