Skip processing snapshots of type Overwrite during readStream #3517

SreeramGarlapati · 2021-11-10T06:23:40Z

Delete operations on iceberg table can translate to Snapshots of type DataOperations.DELETE or DataOperations.OVERWRITE.

DataOperations.DELETE - when the delete operation on the table translates to full file deletes (for ex: deletes wiping out partitions using partitionKey level predicates)

DataOperations.OVERWRITE - is the most common case of Delete.

@rdblue / @aokolnychyi / @RussellSpitzer / @kbendick - pl. lemme know what you folks think & once we achieve principle level alignment - I will add unittests.

SreeramGarlapati · 2021-11-10T07:12:33Z

https://github.com/apache/iceberg/projects/2
continuation of #2660

SreeramGarlapati · 2021-11-10T16:16:10Z

I stumbled on the exact same change made by @kbendick : #3267
@kbendick - truly apologize - I did not get a chance to look at this (I had to turn off my git notifications at some point in the past). Truly appreciate - if you can reopen your branch and get that going. Or - I can merge your changes into this PR. Either way works.

tprelle · 2021-11-12T13:50:11Z

Hi @SreeramGarlapati @kbendick,
I have some concern to link Overwrite and Delete because in table v1 so Copy on Write mode. A merge with an update will create a overwrite snapshot. So if you skip overwrite snapshot you will miss all theses changes.
For example delta.io allow user to choose to resend everything if it's copy on write mode.

rdblue · 2021-11-14T21:00:14Z

spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/source/SparkMicroBatchStream.java

            "Cannot process delete snapshot : %s. Set read option %s to allow skipping snapshots of type delete",
            snapshot.snapshotId(), SparkReadOptions.STREAMING_SKIP_DELETE_SNAPSHOTS);
        return false;
+      case DataOperations.OVERWRITE:


I don't think that this should use the same configuration to skip deletes and overwrites. Overwrites are different and I think that we should at a minimum have a different property. I would also prefer to have some additional clarity on how we plan to eventually handle this. We could skip overwrites, but what about use cases where they are probably upserts? What about when they're created by copy-on-write MERGE operations?

+1. I would prefer they be two separate configs, but also that we have a plan for the longer term to handle sending out these row deltas.

I'd be ok with getting a PR in to ignore OVERWRITE, but this isn't something we should ignore in the longer term (or even really the near-to-medium term) as others have mentioned.

Personally I would consider using a schema similar to the delta.io change capture feed that has a dataframe with the before image / after image (row before and after update) and then the type of operation for each row (insert, delete, update_before, update_after).

I connected with @SreeramGarlapati to contribute on this PR.

I have added a separate config to skip overwrites. I will discuss with @SreeramGarlapati and will update on the plan to eventually handle upserts.

all in all, there are 2 options for reading upserts:

for updates which are written with - copy on write -- a new data file is created which has a combination of both old rows and these new updated rows. So, in this case - we can take a spark option from the user to take consent - that they are okay with data replay.

for updates which are written with - merge on read - we will expose an option to read change data feed - where we will include a metadata column - which indicates whether a record is an INSERT vs DELETE.
did this make sense - @rdblue & @kbendick

RussellSpitzer · 2021-11-16T22:32:57Z

spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/source/SparkMicroBatchStream.java

+            "Cannot process overwrite snapshot : %s. Set read option %s to allow skipping snapshots of type overwrite",
+            snapshot.snapshotId(), SparkReadOptions.STREAMING_SKIP_DELETE_SNAPSHOTS);
+        return false;
      default:


Should add a test for this new conf option

I have added the Unit Test.

rajarshisarkar · 2022-01-25T05:04:58Z

Hi @rdblue @kbendick @RussellSpitzer Request you to please review.

spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/source/SparkMicroBatchStream.java

...k/v3.2/spark/src/test/java/org/apache/iceberg/spark/source/TestStructuredStreamingRead3.java

spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/source/SparkMicroBatchStream.java

rdblue · 2022-01-28T19:41:14Z

Merged. Thanks @SreeramGarlapati and @rajarshisarkar!

Co-authored-by: Rajarshi Sarkar <srajars@amazon.com>

SreeramGarlapati added 2 commits November 9, 2021 22:16

Skip processing snapshots of type Overwrite during readStream

6f81361

Skip processing snapshots of type Overwrite during readStream

5573942

github-actions bot added the spark label Nov 10, 2021

SreeramGarlapati requested a review from aokolnychyi November 10, 2021 16:16

rdblue reviewed Nov 14, 2021

View reviewed changes

RussellSpitzer reviewed Nov 16, 2021

View reviewed changes

rajarshisarkar added 5 commits December 7, 2021 12:02

Add streaming-skip-overwrite-snapshots SparkReadOptions

cf3ceff

Add Unit Test for streaming-skip-overwrite-snapshots SparkReadOptions

cac4b49

Merge branch 'master' into skip.overwrite.snapshots

ad5bb27

Merge branch 'master' into skip.overwrite.snapshots

6ac8b28

Resolved merge conflicts

040010f