[1105] Change Data Feed - MERGE command #1155

allisonport-db · 2022-05-27T01:07:53Z

See the project plan at #1105.

This PR adds CDF to the MERGE command.

Merge is implemented in two ways.

Insert-only merges. For these we don't need to do anything special, since we only write AddFiles with the new rows.
- However, our current implementation of insert-only merges doesn't correctly update the metric numTargetRowsInserted, which is used to check for data changes in CDCReader. This PR fixes that.
For all other merges, we generate CDF rows for inserts, updates, and deletions. We do this by generating expression sequences for CDF outputs (i.e. preimage, insert, etc) on a clause-by-clause basis. We apply these to the rows in our joinedDF in addition to our existing main data output sequences.
- Changes made to JoinedRowProcessor make column ROW_DELETED_COL unnecessary, so this PR removes it.

Tests are added in MergeCDCSuite.

GitOrigin-RevId: eb88ef7d39632e6559f7b14fd71fc93a40fcf901

GitOrigin-RevId: d9b7f8a1e0844f9ce97f0aaf5533a14c5e4e712b

GitOrigin-RevId: f926e5913b622933ddf2e69b243498d91ce27695

GitOrigin-RevId: 3f03422b890f065efe30b6552adc4f98cb123f8c

GitOrigin-RevId: 2a8d18e0b14177db9418807c3ef5e99ce3042442

GitOrigin-RevId: 3fe9b5daa0ed1c5db004a7f0302c94b7291fc726

GitOrigin-RevId: eb0a83a0fad86e65f8ddbb3adc21e247f2a17820

GitOrigin-RevId: d6a54de385b0a50cc8687d6bf945083aecbfef93

GitOrigin-RevId: cda01c70baa0a3cf0df8040774ea2e3e151cf97a

core/src/main/scala/org/apache/spark/sql/delta/commands/MergeIntoCommand.scala

GitOrigin-RevId: d5233be73d337eab9225e957351d57e3ca49c3a4

core/src/main/scala/org/apache/spark/sql/delta/commands/MergeIntoCommand.scala

core/src/test/scala/org/apache/spark/sql/delta/cdc/MergeCDCSuite.scala

tdas · 2022-06-02T08:29:06Z

core/src/main/scala/org/apache/spark/sql/delta/commands/MergeIntoCommand.scala

+    // performing the final write, and the increment column will always be dropped after executing
+    // the metrics UDF.
+
+    // We produce both rows for the CDC_TYPE_NOT_CDC partition to be written to the main table,


itst hard to understand what partition means in the immediate context of this code. rewrite differently

Updated. LMK if it's more clear.

GitOrigin-RevId: 5a9774228ff1c6943483868d8e4ddf5bc27aff45

GitOrigin-RevId: 0d05065e70aeceb537c03c14d85bb56dcce174c1

GitOrigin-RevId: 068f4a35a2266715c9776638eb8220e5976b6324

allisonport-db · 2022-06-02T21:38:17Z

core/src/main/scala/org/apache/spark/sql/delta/commands/MergeIntoCommand.scala

+        row.getBoolean(
+          outputRowEncoder.schema.getFieldIndex(ROW_DROPPED_COL)
+            .getOrElse(outputRowEncoder.schema.fields.size)
+        )


Note, this is simply mimicking the prior implementation when CDC is disabled.

Another solution is to have outputRowEncoder include ROW_DROPPED_COL when CDC is disabled. It will be dropped on line 684 regardless. Not sure the tradeoff with respect to decoding a column we don't need.

this is moot discussion now right? you have to used ROW_DROPPED_COL to get the metrics right .. right?

This is just about how we get the index of ROW_DROPPED_COL. This fx could be simplified to

row.getBoolean(outputRowEncoder.schema.fieldIndex(ROW_DROPPED_COL))

if we always include ROW_DROPPED_COL in outputRowEncoder

allisonport-db · 2022-06-02T21:45:54Z

core/src/main/scala/org/apache/spark/sql/delta/commands/MergeIntoCommand.scala

      notMatchedConditions: Seq[Expression],
-      notMatchedOutputs: Seq[Seq[Expression]],
+      notMatchedOutputs: Seq[Seq[Seq[Expression]]],
      noopCopyOutput: Seq[Expression],
      deleteRowOutput: Seq[Expression],


we don't actually need this. it can be done the way it is in https://github.com/allisonport-db/delta/blob/02a238e6666e31cc74ea1dbda12842ce929de4d6/core/src/main/scala/org/apache/spark/sql/delta/commands/MergeIntoCommand.scala#L860 such that we simply do not create an output row. Not sure which is clearer to readers

I dont get what you are referring to here. thread got lost?

Sorry might be hard to explain in writing.

But basically since now processRow returns an Iterator[InternalRow] instead of just InternalRow, instead of using an expression to create our "deletedRowOutput" that we later delete, we could simply omit that inputRow from the returned iterator.

It's implemented that way in the above linked commit, before I added back ROW_DROPPED_COL

This is more a question of readability I think... not sure if either way is preferred to the other

GitOrigin-RevId: e9dad844f161fabb6b2f59725eaa6a64c3380b90

core/src/main/scala/org/apache/spark/sql/delta/commands/MergeIntoCommand.scala

tdas · 2022-06-03T03:10:05Z

core/src/main/scala/org/apache/spark/sql/delta/commands/MergeIntoCommand.scala

@@ -748,14 +837,16 @@ object MergeIntoCommand {
  val FILE_NAME_COL = "_file_name_"
  val SOURCE_ROW_PRESENT_COL = "_source_row_present_"
  val TARGET_ROW_PRESENT_COL = "_target_row_present_"
+  val ROW_DROPPED_COL = "_row_dropped_"
+  val INCR_ROW_COUNT_COL = "_incr_row_count_"

  class JoinedRowProcessor(


missed this last time. Definitely add param docs. the triple sequence is hella confusing. honestly i should have param docs when i had originally implemented this

added param docs. An overview of what JoinedRowProcessor is doing may also be helpful, what do you think? I can add tomorrow

core/src/main/scala/org/apache/spark/sql/delta/commands/MergeIntoCommand.scala

GitOrigin-RevId: 201a8e10b7a1c9fd76a1ad24f4d9312574ad7e5d

tdas

This looks great now!

See the project plan at delta-io#1105. This PR adds CDF to the `MERGE` command. Merge is implemented in two ways. - Insert-only merges. For these we don't need to do anything special, since we only write `AddFile`s with the new rows. - However, our current implementation of insert-only merges doesn't correctly update the metric `numTargetRowsInserted`, which is used to check for data changes in [CDCReader](https://github.com/delta-io/delta/blob/master/core/src/main/scala/org/apache/spark/sql/delta/commands/cdc/CDCReader.scala#L313). This PR fixes that. - For all other merges, we generate CDF rows for inserts, updates, and deletions. We do this by generating expression sequences for CDF outputs (i.e. preimage, insert, etc) on a clause-by-clause basis. We apply these to the rows in our joinedDF in addition to our existing main data output sequences. - Changes made to `JoinedRowProcessor` make column `ROW_DELETED_COL` unnecessary, so this PR removes it. Tests are added in `MergeCDCSuite`. Closes delta-io#1155 GitOrigin-RevId: 0386c6ff811abe433644b5f5f46a3c7d51001740

allisonport-db added 8 commits May 26, 2022 14:47

initial impl

a709567

GitOrigin-RevId: eb88ef7d39632e6559f7b14fd71fc93a40fcf901

cdc enabled

eed7556

GitOrigin-RevId: d9b7f8a1e0844f9ce97f0aaf5533a14c5e4e712b

minor updates

d775507

GitOrigin-RevId: f926e5913b622933ddf2e69b243498d91ce27695

comments/docs

5b980a6

GitOrigin-RevId: 3f03422b890f065efe30b6552adc4f98cb123f8c

remove unnecessary ROW_DROPPED_COL

0ba5787

GitOrigin-RevId: 2a8d18e0b14177db9418807c3ef5e99ce3042442

fix inserted row count in insert only merges

497abc9

GitOrigin-RevId: 3fe9b5daa0ed1c5db004a7f0302c94b7291fc726

import order

92bf666

GitOrigin-RevId: eb0a83a0fad86e65f8ddbb3adc21e247f2a17820

scalastyle

cac8bfd

GitOrigin-RevId: d6a54de385b0a50cc8687d6bf945083aecbfef93

allisonport-db requested review from tdas and scottsand-db May 27, 2022 01:08

don't include cdc column when cdc disabled

a6ecfb5

GitOrigin-RevId: cda01c70baa0a3cf0df8040774ea2e3e151cf97a

scottsand-db requested changes Jun 1, 2022

View reviewed changes

core/src/main/scala/org/apache/spark/sql/delta/commands/MergeIntoCommand.scala Show resolved Hide resolved

core/src/main/scala/org/apache/spark/sql/delta/commands/MergeIntoCommand.scala Show resolved Hide resolved

numTargetChangeFileBytes

874569a

GitOrigin-RevId: d5233be73d337eab9225e957351d57e3ca49c3a4