-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update PROTOCOL to include change data spec #1300
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking good. requested minor changes. cheers.
@@ -331,6 +347,31 @@ The following is an example `remove` action. | |||
} | |||
``` | |||
|
|||
### Add CDC File |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The protocol version information is not here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added two subsections in this area to clarify writer/reader requirements as you've asked. Let me know if you'd like to move it elsewhere. Change Data Feed is unlike the other features (like Column Mapping) so there's no clear precedent on how to organize it within this doc.
@tdas any last comments? |
PROTOCOL.md
Outdated
@@ -100,6 +103,19 @@ By default, the reference implementation stores data files in directories that a | |||
This directory format is only used to follow existing conventions and is not required by the protocol. | |||
Actual partition values for a file must be read from the transaction log. | |||
|
|||
### Change Data Files | |||
Change data files are stored in a directory at the root of the table named `_change_data`, and represent the changes for the table version they are in. For data with partition values, change data files are stored within the `_change_data` directory in their respective partitions (i.e. `_change_data/part1=value1/...`). Writers can _optionally_ produce these change data files as a consequence of operations that change underlying data, like `UPDATE`, `DELETE`, and `MERGE` operations to a Delta Lake table. Operations that only add new data should not produce separate change files. When available, change data readers should use the change data files instead of computing changes from the underlying data files. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Operations that only add new data should not produce separate change files.
If an operation adds no change data files, it must only add new data without deleting or updating any existing data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not clear on what the suggested change is. It appears to be a restatement of the same thing... also shouldn't the causality be thought of in the other direction?
Instead of
If an operation adds no change data files, it must only add new data without deleting or updating any existing data.
Isn't it:
If an operation adds only new data, without deleting or updating any existing data, it should not produce any change data files
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@zsxwing let me know what you think about this last one! resolved the other comments in the latest commit otherwise.
…ge field back to the schema
PROTOCOL.md
Outdated
@@ -331,6 +347,43 @@ The following is an example `remove` action. | |||
} | |||
``` | |||
|
|||
### Add CDC File | |||
The `cdc` action is used to add a [file](#change-data-files) containing only the data that was changed as part of the transaction. When CDC readers encounter a `cdc` action in a particular Delta table version, they must read the changes made in that version exclusively using the `cdc` files. If a version has no `cdc` action, then the data in `add` and `remove` actions are read as inserted and deleted rows, respectively. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The `cdc` action is used to add a [file](#change-data-files) containing only the data that was changed as part of the transaction. When CDC readers encounter a `cdc` action in a particular Delta table version, they must read the changes made in that version exclusively using the `cdc` files. If a version has no `cdc` action, then the data in `add` and `remove` actions are read as inserted and deleted rows, respectively. | |
The `cdc` action is used to add a [file](#change-data-files) containing only the data that was changed as part of the transaction. When change data readers encounter a `cdc` action in a particular Delta table version, they must read the changes made in that version exclusively using the `cdc` files. If a version has no `cdc` action, then the data in `add` and `remove` actions are read as inserted and deleted rows, respectively. |
PROTOCOL.md
Outdated
_commit_version|`Long`| The Delta log or table version containing the change. | ||
_commit_timestamp|`Timestamp`| The timestamp associated when the commit was created. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These two columns are not in data change files.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we still mention them somewhere since they are inferred by the reader at runtime? It's not clear if this should be an actual requirement of readers...
@nkarpov left a few minor comments. We are still missing a section to explain how change data readers generate results. We can do that in a followup PR. |
Specifically, to read the row-level changes made in a version, the following strategy should be used: | ||
1. If there are `cdc` actions in this version, then read only those to get the row-level changes, and skip the remaining `add` and `remove` actions in this version. | ||
2. Otherwise, if there are no `cdc` actions in this version, read and treat all the rows in the `add` and `remove` actions as inserted and deleted rows, respectively. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
3. The following extra columns should also be generated:
Field Name | Data Type | Description
-|-|-
_commit_version|`Long`| The table version containing the change. This can be got from the name of the Delta log file that contains actions.
_commit_timestamp|`Timestamp`| The timestamp associated when the commit was created. This can be got from the file modification time of the Delta log file that contains actions.
Description
Update PROTOCOL.md to include change data file spec.
I think it's possible to consider these new change files as "data files", but I've documented them as their own file type to start because they do not represent the actual table data the same way
add
andremove
files do.How was this patch tested?
N/A
Does this PR introduce any user-facing changes?
Yes. This PR introduces changes to the documentation of the Delta Lake protocol