Spark MicroBatch read - Ignore replace snapshots and add Spark option to skip delete snapshots #2752

daksha121 · 2021-06-29T05:09:18Z

Building on top of #2660, this PR introduces the ability to skip deletes in Micro Batch read path. It also ignores replace snapshots.
Spark option introduced:

To skip deletes: streaming-skip-delete-snapshots
Additional contributor to this PR: @SreeramGarlapati

* implement skipDelete and skipReplace options * revert changes in SnapshotUtil

…shots

…b.com/daksha121/iceberg into stream.read.ignore.delete.and.replace

spark/src/main/java/org/apache/iceberg/spark/SparkReadOptions.java

spark3/src/test/java/org/apache/iceberg/spark/source/TestStructuredStreamingRead3.java

spark3/src/main/java/org/apache/iceberg/spark/source/SparkMicroBatchStream.java

RussellSpitzer

Looks good to me, I had a few questions on the skip/current offset logic. Let me know when I should take another look

daksha121 · 2021-06-30T19:58:59Z

Looks good to me, I had a few questions on the skip/current offset logic. Let me know when I should take another look

Thanks for the review @RussellSpitzer! I addressed the comments and it's ready for another look

spark3/src/main/java/org/apache/iceberg/spark/source/SparkMicroBatchStream.java

RussellSpitzer

Had tiny minor comment on debug method but this looks good to me

RussellSpitzer · 2021-06-30T20:32:12Z

Hitting some style check failures

Error: eckstyle] [ERROR] /home/runner/work/iceberg/iceberg/spark3/src/main/java/org/apache/iceberg/spark/source/SparkMicroBatchStream.java:194:7: 'if' is not followed by whitespace. [WhitespaceAfter]

Be sure to either enable the style checker in your IDE or run the gradle based tests

g build -x test  -x integrationTest

daksha121 · 2021-07-01T20:51:59Z

Addressed your comments, @rdblue this is ready for another pass. Thanks!

RussellSpitzer · 2021-07-08T23:30:45Z

@daksha121 I think we are still waiting on a change to the parameter name. I'm not good on names so I generally am fine with almost anything but I believe we are still hoping for a better name for the parameter

daksha121 · 2021-07-09T00:17:29Z

@daksha121 I think we are still waiting on a change to the parameter name. I'm not good on names so I generally am fine with almost anything but I believe we are still hoping for a better name for the parameter

I'm not clear on which parameter needs to be renamed.. Are you referring to the new spark option (skip-deletes-on-stream-read) that this PR introduces?

daksha121 · 2021-07-09T15:33:43Z

@daksha121 I think we are still waiting on a change to the parameter name. I'm not good on names so I generally am fine with almost anything but I believe we are still hoping for a better name for the parameter

I'm not clear on which parameter needs to be renamed.. Are you referring to the new spark option (skip-deletes-on-stream-read) that this PR introduces?

@RussellSpitzer how about just:

skip-deletes OR
ignore-deletes

RussellSpitzer · 2021-07-09T15:35:53Z

@rdblue what do you think? I'm fine with streaming.ignore-deletes or anything like that.

rdblue · 2021-07-09T19:37:45Z

We try not to use namespaces in the read options to keep them simple. This one is a bit odd because ignore-deletes could be misinterpreted easily -- we don't want anyone to think they can read a table without applying delete files, for example. So I want to make it clear that this is streaming and that this not delete files. The best I'm coming up with is skip-delete-snapshots, which is clear that it is referring to snapshot operations. I don't think it would be misinterpreted if it were used in a batch context since it wouldn't make sense to run a select that doesn't do anything if the latest snapshot is a delete.

daksha121 · 2021-07-09T21:39:33Z

We try not to use namespaces in the read options to keep them simple. This one is a bit odd because ignore-deletes could be misinterpreted easily -- we don't want anyone to think they can read a table without applying delete files, for example. So I want to make it clear that this is streaming and that this not delete files. The best I'm coming up with is skip-delete-snapshots, which is clear that it is referring to snapshot operations. I don't think it would be misinterpreted if it were used in a batch context since it wouldn't make sense to run a select that doesn't do anything if the latest snapshot is a delete.

Thanks @rdblue, that makes sense. Can you help understand why we wouldn't want to put the word streaming in like streaming-skip-deletes? We could perhaps eliminate the risk of misinterpretation. Maybe I'm missing something here

rdblue · 2021-07-10T23:06:50Z

I'd be fine adding streaming-. I was just trying to keep it as small as possible. We definitely need -snapshots because it should be clear that this doesn't affect row-level deletes.

daksha121 · 2021-07-12T16:06:52Z

I'd be fine adding streaming-. I was just trying to keep it as small as possible. We definitely need -snapshots because it should be clear that this doesn't affect row-level deletes.

Thanks @rdblue. Renamed the option to streaming-skip-delete-snapshots

SreeramGarlapati · 2021-07-12T18:42:25Z

@rdblue - pl. drop me as a contributor for this work. @daksha121 - was nice enough to mention me as one. I was trying to pay back her help in work: #2660.

rdblue · 2021-07-13T00:17:59Z

Thanks, @daksha121! I merged this.

…he#2752)

Merge remote-tracking branch 'upstream/merge-master-20210816' into master ## 该MR主要解决什么？ merge upstream/master，引入最近的一些bugFix和优化 ## 该MR的修改是什么？核心关注PR： > Predicate PushDown 支持，https://github.com/apache/iceberg/pull/2358， https://github.com/apache/iceberg/pull/2926， https://github.com/apache/iceberg/pull/2777/files > Spark场景写入空dataset 报错问题，直接skip掉即可， apache#2960 > Flink UI补充uidPrefix到operator方便跟踪多个iceberg sink任务， apache#288 > Spark 修复nested Struct Pruning问题， apache#2877 > 可以使用Table Properties指定创建v2 format表，apache#2887 > 补充SortRewriteStrategy框架，逐步支持不同rewrite策略， apache#2609 （WIP：apache#2829） > Spark 为catalog配置hadoop属性支持， apache#2792 > Spark 针对timestamps without timezone读写支持， apache#2757 > Spark MicroBatch支持配置属性skip delete snapshots， apache#2752 > Spark V2 RewriteDatafilesAction 支持 > Core: Add validation for row-level deletes with rewrites, apache#2865 > schema time travel 功能相关，补充schema-id， Core: add schema id to snapshot > Spark Extension支持identifier fields操作， apache#2560 > Parquet: Update to 1.12.0, apache#2441 > Hive: Vectorized ORC reads for Hive, apache#2613 > Spark: Add an action to remove all referenced files, apache#2415 ## 该MR是如何测试的？ UT

SreeramGarlapati and others added 3 commits June 25, 2021 19:25

implement skipDelete and skipReplace (#2)

51df3d3

* implement skipDelete and skipReplace options * revert changes in SnapshotUtil

Spark3 Streaming Read: Adding options to skip delete and replace snap…

7bb4f90

…shots

Fixed spacing

8a6b56e

github-actions bot added the spark label Jun 29, 2021

daksha121 added 3 commits June 29, 2021 08:01

Spark3 Streaming Read: Adding options to skip delete and replace snap…

c39065a

…shots

Fixed spacing

c3fa8ad

Merge branch 'stream.read.ignore.delete.and.replace' of https://githu…

f5d00f5

…b.com/daksha121/iceberg into stream.read.ignore.delete.and.replace

SreeramGarlapati reviewed Jun 29, 2021

View reviewed changes

spark/src/main/java/org/apache/iceberg/spark/SparkReadOptions.java Outdated Show resolved Hide resolved

rdblue reviewed Jun 30, 2021

View reviewed changes

spark/src/main/java/org/apache/iceberg/spark/SparkReadOptions.java Outdated Show resolved Hide resolved

rdblue reviewed Jun 30, 2021

View reviewed changes

spark3/src/test/java/org/apache/iceberg/spark/source/TestStructuredStreamingRead3.java Outdated Show resolved Hide resolved

rdblue reviewed Jun 30, 2021

View reviewed changes

spark3/src/main/java/org/apache/iceberg/spark/source/SparkMicroBatchStream.java Outdated Show resolved Hide resolved

rdblue reviewed Jun 30, 2021

View reviewed changes

spark3/src/main/java/org/apache/iceberg/spark/source/SparkMicroBatchStream.java Outdated Show resolved Hide resolved

Removing skipReplace option - skipping replaces by default

910849e

daksha121 changed the title ~~Spark MicroBatch read - implement skipReplace and skipDelete~~ Spark MicroBatch read - Ignore replace snapshots and add Spark option to skip delete snapshots Jun 30, 2021

Fix checkstyle errors

4b61c56