-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support for Change Data Feed in Delta Lake #1105
Labels
enhancement
New feature or request
Comments
This was referenced Apr 29, 2022
Closed
This was referenced May 2, 2022
This was referenced May 11, 2022
scottsand-db
added a commit
that referenced
this issue
May 12, 2022
See the project plan at(#1105). This PR adds CDF write functionality to the DELETE command, as well as a test suite `DeleteCDCSuite`. At a high level, during the DELETE command, when we realize that we need to delete some rows in some files (but not the entire file), then instead of creating a new DataFrame which just contains the non-deleted rows (thus, in this new delta version, the previous rows were logically deleted), we instead partition the DataFrame into CDF-deleted columns and non-deleted columns. Then, when writing out the parquet files, we write the CDF-deleted columns into their own CDF parquet file, and we write the non deleted rows into a standard main-table parquet file (same as usual). We then also add an extra `AddCDCFile` action to the transaction log. Closes #1125. GitOrigin-RevId: 7934de886589bf3d70ce81dcf9d7de598e35fb2e
scottsand-db
added a commit
that referenced
this issue
May 18, 2022
See the project plan at #1105. This PR adds the DataFrame API for CDF as well as a new test suite to test this API. This API includes options "startingVersion" "startingTimestamp" "endingVersion" "endingTimestamp" "readChangeFeed" Misc. other CDF improvements, too, like extra schema checks during OptTxn write and returning a CDF relation in the DeltaLog::createRelation method. Closes #1132 GitOrigin-RevId: 7ffafc6772fc314064971d65d9e7946b7a01de64 GitOrigin-RevId: b901d21804fe7aaecd6bb2e03cb33c76e19ae2ad
prakharjain09
pushed a commit
to prakharjain09/delta
that referenced
this issue
May 19, 2022
See the project plan at delta-io#1105. This PR adds the DataFrame API for CDF as well as a new test suite to test this API. This API includes options "startingVersion" "startingTimestamp" "endingVersion" "endingTimestamp" "readChangeFeed" Misc. other CDF improvements, too, like extra schema checks during OptTxn write and returning a CDF relation in the DeltaLog::createRelation method. Closes delta-io#1132 GitOrigin-RevId: 5b8179d1baa154d46b015dd7dfba0f52e7032df5
This was referenced May 19, 2022
scottsand-db
added a commit
that referenced
this issue
May 25, 2022
See the project plan at #1105. This PR adds CDF to the UPDATE command, during which we generate both preimage and postimage CDF data. This PR also adds UpdateCDCSuite which adds basic tests for these CDF changes. As a high-level overview of how this CDF-update operation is performed, when we find a row that satisfies the update condition, we `explode` an array containing the pre-image, post-image, and main-table updated rows. The pre-image and post-image rows are appropriately typed with the corresponding CDF_TYPE, and the main-table updated row has CDF_TYPE `null`. Thus, the first two rows will be written to the cdf parquet file, with the latter is written to standard main-table data parquet file. Closes #1146 GitOrigin-RevId: 47413c5345bb97c0e1303a7f4d4d06b89c35ab7a
allisonport-db
added a commit
that referenced
this issue
Jun 4, 2022
See the project plan at #1105. This PR adds CDF to the `MERGE` command. Merge is implemented in two ways. - Insert-only merges. For these we don't need to do anything special, since we only write `AddFile`s with the new rows. - However, our current implementation of insert-only merges doesn't correctly update the metric `numTargetRowsInserted`, which is used to check for data changes in [CDCReader](https://github.com/delta-io/delta/blob/master/core/src/main/scala/org/apache/spark/sql/delta/commands/cdc/CDCReader.scala#L313). This PR fixes that. - For all other merges, we generate CDF rows for inserts, updates, and deletions. We do this by generating expression sequences for CDF outputs (i.e. preimage, insert, etc) on a clause-by-clause basis. We apply these to the rows in our joinedDF in addition to our existing main data output sequences. - Changes made to `JoinedRowProcessor` make column `ROW_DELETED_COL` unnecessary, so this PR removes it. Tests are added in `MergeCDCSuite`. Closes #1155 GitOrigin-RevId: 0386c6ff811abe433644b5f5f46a3c7d51001740
Hi @scottsand-db, I noticed that the batch reads are already supported in the Databricks platform. Is the limitation of the TableValueFunctions class mentioned above solved or is that an exclusive Databricks feature? |
jbguerraz
pushed a commit
to jbguerraz/delta
that referenced
this issue
Jul 6, 2022
See the project plan at(delta-io#1105). This PR adds CDF write functionality to the DELETE command, as well as a test suite `DeleteCDCSuite`. At a high level, during the DELETE command, when we realize that we need to delete some rows in some files (but not the entire file), then instead of creating a new DataFrame which just contains the non-deleted rows (thus, in this new delta version, the previous rows were logically deleted), we instead partition the DataFrame into CDF-deleted columns and non-deleted columns. Then, when writing out the parquet files, we write the CDF-deleted columns into their own CDF parquet file, and we write the non deleted rows into a standard main-table parquet file (same as usual). We then also add an extra `AddCDCFile` action to the transaction log. Closes delta-io#1125. GitOrigin-RevId: 7934de886589bf3d70ce81dcf9d7de598e35fb2e
jbguerraz
pushed a commit
to jbguerraz/delta
that referenced
this issue
Jul 6, 2022
See the project plan at delta-io#1105. This PR adds the DataFrame API for CDF as well as a new test suite to test this API. This API includes options "startingVersion" "startingTimestamp" "endingVersion" "endingTimestamp" "readChangeFeed" Misc. other CDF improvements, too, like extra schema checks during OptTxn write and returning a CDF relation in the DeltaLog::createRelation method. Closes delta-io#1132 GitOrigin-RevId: 7ffafc6772fc314064971d65d9e7946b7a01de64 GitOrigin-RevId: b901d21804fe7aaecd6bb2e03cb33c76e19ae2ad
jbguerraz
pushed a commit
to jbguerraz/delta
that referenced
this issue
Jul 6, 2022
See the project plan at delta-io#1105. This PR adds CDF to the UPDATE command, during which we generate both preimage and postimage CDF data. This PR also adds UpdateCDCSuite which adds basic tests for these CDF changes. As a high-level overview of how this CDF-update operation is performed, when we find a row that satisfies the update condition, we `explode` an array containing the pre-image, post-image, and main-table updated rows. The pre-image and post-image rows are appropriately typed with the corresponding CDF_TYPE, and the main-table updated row has CDF_TYPE `null`. Thus, the first two rows will be written to the cdf parquet file, with the latter is written to standard main-table data parquet file. Closes delta-io#1146 GitOrigin-RevId: 47413c5345bb97c0e1303a7f4d4d06b89c35ab7a
jbguerraz
pushed a commit
to jbguerraz/delta
that referenced
this issue
Jul 6, 2022
See the project plan at delta-io#1105. This PR adds CDF to the `MERGE` command. Merge is implemented in two ways. - Insert-only merges. For these we don't need to do anything special, since we only write `AddFile`s with the new rows. - However, our current implementation of insert-only merges doesn't correctly update the metric `numTargetRowsInserted`, which is used to check for data changes in [CDCReader](https://github.com/delta-io/delta/blob/master/core/src/main/scala/org/apache/spark/sql/delta/commands/cdc/CDCReader.scala#L313). This PR fixes that. - For all other merges, we generate CDF rows for inserts, updates, and deletions. We do this by generating expression sequences for CDF outputs (i.e. preimage, insert, etc) on a clause-by-clause basis. We apply these to the rows in our joinedDF in addition to our existing main data output sequences. - Changes made to `JoinedRowProcessor` make column `ROW_DELETED_COL` unnecessary, so this PR removes it. Tests are added in `MergeCDCSuite`. Closes delta-io#1155 GitOrigin-RevId: 0386c6ff811abe433644b5f5f46a3c7d51001740
jbguerraz
pushed a commit
to jbguerraz/delta
that referenced
this issue
Jul 6, 2022
See the project plan at(delta-io#1105). This PR adds CDF write functionality to the DELETE command, as well as a test suite `DeleteCDCSuite`. At a high level, during the DELETE command, when we realize that we need to delete some rows in some files (but not the entire file), then instead of creating a new DataFrame which just contains the non-deleted rows (thus, in this new delta version, the previous rows were logically deleted), we instead partition the DataFrame into CDF-deleted columns and non-deleted columns. Then, when writing out the parquet files, we write the CDF-deleted columns into their own CDF parquet file, and we write the non deleted rows into a standard main-table parquet file (same as usual). We then also add an extra `AddCDCFile` action to the transaction log. Closes delta-io#1125. GitOrigin-RevId: 7934de886589bf3d70ce81dcf9d7de598e35fb2e
jbguerraz
pushed a commit
to jbguerraz/delta
that referenced
this issue
Jul 6, 2022
See the project plan at delta-io#1105. This PR adds the DataFrame API for CDF as well as a new test suite to test this API. This API includes options "startingVersion" "startingTimestamp" "endingVersion" "endingTimestamp" "readChangeFeed" Misc. other CDF improvements, too, like extra schema checks during OptTxn write and returning a CDF relation in the DeltaLog::createRelation method. Closes delta-io#1132 GitOrigin-RevId: 7ffafc6772fc314064971d65d9e7946b7a01de64 GitOrigin-RevId: b901d21804fe7aaecd6bb2e03cb33c76e19ae2ad
jbguerraz
pushed a commit
to jbguerraz/delta
that referenced
this issue
Jul 6, 2022
See the project plan at delta-io#1105. This PR adds CDF to the UPDATE command, during which we generate both preimage and postimage CDF data. This PR also adds UpdateCDCSuite which adds basic tests for these CDF changes. As a high-level overview of how this CDF-update operation is performed, when we find a row that satisfies the update condition, we `explode` an array containing the pre-image, post-image, and main-table updated rows. The pre-image and post-image rows are appropriately typed with the corresponding CDF_TYPE, and the main-table updated row has CDF_TYPE `null`. Thus, the first two rows will be written to the cdf parquet file, with the latter is written to standard main-table data parquet file. Closes delta-io#1146 GitOrigin-RevId: 47413c5345bb97c0e1303a7f4d4d06b89c35ab7a
jbguerraz
pushed a commit
to jbguerraz/delta
that referenced
this issue
Jul 6, 2022
See the project plan at delta-io#1105. This PR adds CDF to the `MERGE` command. Merge is implemented in two ways. - Insert-only merges. For these we don't need to do anything special, since we only write `AddFile`s with the new rows. - However, our current implementation of insert-only merges doesn't correctly update the metric `numTargetRowsInserted`, which is used to check for data changes in [CDCReader](https://github.com/delta-io/delta/blob/master/core/src/main/scala/org/apache/spark/sql/delta/commands/cdc/CDCReader.scala#L313). This PR fixes that. - For all other merges, we generate CDF rows for inserts, updates, and deletions. We do this by generating expression sequences for CDF outputs (i.e. preimage, insert, etc) on a clause-by-clause basis. We apply these to the rows in our joinedDF in addition to our existing main data output sequences. - Changes made to `JoinedRowProcessor` make column `ROW_DELETED_COL` unnecessary, so this PR removes it. Tests are added in `MergeCDCSuite`. Closes delta-io#1155 GitOrigin-RevId: 0386c6ff811abe433644b5f5f46a3c7d51001740
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Overview
This is the official issue to track interest, feature requests, and progress being made on Change Data Feed in Delta Lake. This feature is part of the Operation Enhancements section of the Delta Lake 2022 H1 Roadmap with a target of 2022 Q3.
Requirements
The aim of this project is to allow Delta tables to produce change data capture (CDC), a standard pattern for reading changing data from a table. CDC data consists of an underlying row plus metadata indicating whether the row was added, deleted, updated to, or updated from. That is, an update to a row will produce two CDC events: one containing the row preimage from before the update and one containing the row postimage after the update is applied.
When a CDC read is asked for, we’ll need to look at the Delta log entry for every DML operation and output the data as CDC events rather than raw data.
Design Sketch
Please see the official design doc here.
API
We’re adding new syntax to existing Spark interfaces for reading Delta table, so there are no additional considerations involved here - CDC just reflects a different view of data which was already available through the same interfaces.
Enabling CDC for a Delta table
To enable CDC for a table a table property on that table can be set. The version written following that will start recording change data.
This property can be set at the point of table creation as well.
Additionally, this property can be set for all new tables by default
APIs for accessing Change Data
This section talks about the APIs that will allow a user to access changed data
DataFrame API (Scala/Python)
User provides startingVersion, endingVersion as options and also specifies readChangeFeed as an option.
spark.read.format(“delta”) .option(“readChangeFeed”, “true”) .option(“startingVersion”, startingVersion) .option(“endingVersion”, endingVersion) .table(“source”)
Note:
For timestamp variants we would provide startingTimestamp, endingTimestamp instead.
The starting and ending versions and timestamps are inclusive fields to be in line with the other time travel APIs.
The same API can be used with the DataStream reader as well
spark.readStream .format(“delta”) .option(“readChangeFeed”, “true”) .option(“startingVersion”, startingVersion) .table(“source”)
For Streaming use cases, endingVersion is not required.
If the startingVersion is not provided the table should load from the earliest available version. A
latest
starting version should also be supported.SQL API
Currently we do not plan to support a SQL API due to a limitation in Apache Spark’s TableValueFunctions class. Eventually, we’d like the API to be the following.
Project Plan
readChangeFeed
streaming APIThe text was updated successfully, but these errors were encountered: