[1105] Change Data Feed - PR 2 - DELETE command #1125

scottsand-db · 2022-05-11T22:50:55Z

See the project plan at #1105.

This PR adds CDF write functionality to the DELETE command, as well as a test suite DeleteCDCSuite. At a high level, during the DELETE command, when we realize that we need to delete some rows in some files (but not the entire file), then instead of creating a new DataFrame which just contains the non-deleted rows (thus, in this new delta version, the previous rows were logically deleted), we instead partition the DataFrame into CDF-deleted columns and non-deleted columns.

Then, when writing out the parquet files, we write the CDF-deleted columns into their own CDF parquet file, and we write the non deleted rows into a standard main-table parquet file (same as usual). We then also add an extra AddCDCFile action to the transaction log.

- correctly calculate numCopiedRows metric in DeleteCommand - add test in SchemaUtilsSuite - add metrics/stats to OptTxn and DeleteCmd - added committer.changeFiles to OptTxn - Add DeleteCDCSuite; add CDC codes in SchemaUtils - add cdc deletes to DeleteCommand.scala - perform CDC partitioning in TransactionalWrite GitOrigin-RevId: 7465d9aee506b29cfce5ed6cd9b8f86288b34f11

core/src/main/scala/org/apache/spark/sql/delta/OptimisticTransaction.scala

core/src/main/scala/org/apache/spark/sql/delta/commands/DeleteCommand.scala

tdas · 2022-05-11T22:58:13Z

core/src/main/scala/org/apache/spark/sql/delta/commands/DeleteCommand.scala

+      numFilesToRewrite: Long): Seq[FileAction] = {
+    val writeCdc = DeltaConfigs.CHANGE_DATA_FEED.fromMetaData(txn.metadata)
+
+    val numTouchedRows = metrics("numTouchedRows")


what does "touched" mean in this case?

number of total rows that we have seen / are either copying or deleting (sum of both).

Please put inline comment. This is not obvious.

core/src/main/scala/org/apache/spark/sql/delta/schema/SchemaUtils.scala

core/src/test/scala/org/apache/spark/sql/delta/cdc/DeleteCDCSuite.scala

respond to PR comments

core/src/main/scala/org/apache/spark/sql/delta/files/TransactionalWrite.scala

GitOrigin-RevId: 316cb8b05e9f87c3c4c377b6608f0eb237cd60ca

See the project plan at(delta-io#1105). This PR adds CDF write functionality to the DELETE command, as well as a test suite `DeleteCDCSuite`. At a high level, during the DELETE command, when we realize that we need to delete some rows in some files (but not the entire file), then instead of creating a new DataFrame which just contains the non-deleted rows (thus, in this new delta version, the previous rows were logically deleted), we instead partition the DataFrame into CDF-deleted columns and non-deleted columns. Then, when writing out the parquet files, we write the CDF-deleted columns into their own CDF parquet file, and we write the non deleted rows into a standard main-table parquet file (same as usual). We then also add an extra `AddCDCFile` action to the transaction log. Closes delta-io#1125. GitOrigin-RevId: 7934de886589bf3d70ce81dcf9d7de598e35fb2e

scottsand-db requested review from tdas, vkorukanti and allisonport-db May 11, 2022 22:50

scottsand-db self-assigned this May 11, 2022

scottsand-db mentioned this pull request May 11, 2022

Support for Change Data Feed in Delta Lake #1105

Closed