-
Notifications
You must be signed in to change notification settings - Fork 2.9k
Spark MicroBatch read - Ignore replace snapshots and add Spark option to skip delete snapshots #2752
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Spark MicroBatch read - Ignore replace snapshots and add Spark option to skip delete snapshots #2752
Conversation
* implement skipDelete and skipReplace options * revert changes in SnapshotUtil
…b.com/daksha121/iceberg into stream.read.ignore.delete.and.replace
spark/src/main/java/org/apache/iceberg/spark/SparkReadOptions.java
Outdated
Show resolved
Hide resolved
spark/src/main/java/org/apache/iceberg/spark/SparkReadOptions.java
Outdated
Show resolved
Hide resolved
spark3/src/test/java/org/apache/iceberg/spark/source/TestStructuredStreamingRead3.java
Outdated
Show resolved
Hide resolved
spark3/src/main/java/org/apache/iceberg/spark/source/SparkMicroBatchStream.java
Outdated
Show resolved
Hide resolved
spark3/src/main/java/org/apache/iceberg/spark/source/SparkMicroBatchStream.java
Outdated
Show resolved
Hide resolved
spark3/src/main/java/org/apache/iceberg/spark/source/SparkMicroBatchStream.java
Show resolved
Hide resolved
spark3/src/main/java/org/apache/iceberg/spark/source/SparkMicroBatchStream.java
Show resolved
Hide resolved
spark3/src/main/java/org/apache/iceberg/spark/source/SparkMicroBatchStream.java
Outdated
Show resolved
Hide resolved
RussellSpitzer
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me, I had a few questions on the skip/current offset logic. Let me know when I should take another look
Thanks for the review @RussellSpitzer! I addressed the comments and it's ready for another look |
spark3/src/main/java/org/apache/iceberg/spark/source/SparkMicroBatchStream.java
Outdated
Show resolved
Hide resolved
RussellSpitzer
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Had tiny minor comment on debug method but this looks good to me
|
Hitting some style check failures Be sure to either enable the style checker in your IDE or run the gradle based tests |
|
Addressed your comments, @rdblue this is ready for another pass. Thanks! |
|
@daksha121 I think we are still waiting on a change to the parameter name. I'm not good on names so I generally am fine with almost anything but I believe we are still hoping for a better name for the parameter |
I'm not clear on which parameter needs to be renamed.. Are you referring to the new spark option (skip-deletes-on-stream-read) that this PR introduces? |
@RussellSpitzer how about just:
|
|
@rdblue what do you think? I'm fine with |
|
We try not to use namespaces in the read options to keep them simple. This one is a bit odd because |
Thanks @rdblue, that makes sense. Can you help understand why we wouldn't want to put the word streaming in like |
|
I'd be fine adding |
Thanks @rdblue. Renamed the option to |
|
@rdblue - pl. drop me as a contributor for this work. @daksha121 - was nice enough to mention me as one. I was trying to pay back her help in work: #2660. |
|
Thanks, @daksha121! I merged this. |
Merge remote-tracking branch 'upstream/merge-master-20210816' into master ## 该MR主要解决什么? merge upstream/master,引入最近的一些bugFix和优化 ## 该MR的修改是什么? 核心关注PR: > Predicate PushDown 支持,https://github.com/apache/iceberg/pull/2358, https://github.com/apache/iceberg/pull/2926, https://github.com/apache/iceberg/pull/2777/files > Spark场景写入空dataset 报错问题,直接skip掉即可, apache#2960 > Flink UI补充uidPrefix到operator方便跟踪多个iceberg sink任务, apache#288 > Spark 修复nested Struct Pruning问题, apache#2877 > 可以使用Table Properties指定创建v2 format表,apache#2887 > 补充SortRewriteStrategy框架,逐步支持不同rewrite策略, apache#2609 (WIP:apache#2829) > Spark 为catalog配置hadoop属性支持, apache#2792 > Spark 针对timestamps without timezone读写支持, apache#2757 > Spark MicroBatch支持配置属性skip delete snapshots, apache#2752 > Spark V2 RewriteDatafilesAction 支持 > Core: Add validation for row-level deletes with rewrites, apache#2865 > schema time travel 功能相关,补充schema-id, Core: add schema id to snapshot > Spark Extension支持identifier fields操作, apache#2560 > Parquet: Update to 1.12.0, apache#2441 > Hive: Vectorized ORC reads for Hive, apache#2613 > Spark: Add an action to remove all referenced files, apache#2415 ## 该MR是如何测试的? UT
Building on top of #2660, this PR introduces the ability to skip deletes in Micro Batch read path. It also ignores replace snapshots.
Spark option introduced:
streaming-skip-delete-snapshotsAdditional contributor to this PR: @SreeramGarlapati