Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update PROTOCOL to include change data spec #1300

Closed
wants to merge 11 commits into from

Conversation

nkarpov
Copy link
Collaborator

@nkarpov nkarpov commented Jul 29, 2022

Description

Update PROTOCOL.md to include change data file spec.

I think it's possible to consider these new change files as "data files", but I've documented them as their own file type to start because they do not represent the actual table data the same way add and remove files do.

How was this patch tested?

N/A

Does this PR introduce any user-facing changes?

Yes. This PR introduces changes to the documentation of the Delta Lake protocol

@nkarpov nkarpov requested review from tdas and scottsand-db July 29, 2022 22:00
PROTOCOL.md Outdated Show resolved Hide resolved
Copy link
Collaborator

@scottsand-db scottsand-db left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good. requested minor changes. cheers.

PROTOCOL.md Outdated Show resolved Hide resolved
PROTOCOL.md Outdated Show resolved Hide resolved
PROTOCOL.md Outdated Show resolved Hide resolved
PROTOCOL.md Outdated Show resolved Hide resolved
PROTOCOL.md Outdated Show resolved Hide resolved
PROTOCOL.md Outdated Show resolved Hide resolved
@@ -331,6 +347,31 @@ The following is an example `remove` action.
}
```

### Add CDC File
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The protocol version information is not here.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added two subsections in this area to clarify writer/reader requirements as you've asked. Let me know if you'd like to move it elsewhere. Change Data Feed is unlike the other features (like Column Mapping) so there's no clear precedent on how to organize it within this doc.

PROTOCOL.md Outdated Show resolved Hide resolved
PROTOCOL.md Outdated Show resolved Hide resolved
@scottsand-db
Copy link
Collaborator

@tdas any last comments?

@scottsand-db scottsand-db requested a review from tdas September 13, 2022 17:56
PROTOCOL.md Outdated Show resolved Hide resolved
PROTOCOL.md Outdated Show resolved Hide resolved
PROTOCOL.md Show resolved Hide resolved
PROTOCOL.md Outdated Show resolved Hide resolved
PROTOCOL.md Outdated
@@ -100,6 +103,19 @@ By default, the reference implementation stores data files in directories that a
This directory format is only used to follow existing conventions and is not required by the protocol.
Actual partition values for a file must be read from the transaction log.

### Change Data Files
Change data files are stored in a directory at the root of the table named `_change_data`, and represent the changes for the table version they are in. For data with partition values, change data files are stored within the `_change_data` directory in their respective partitions (i.e. `_change_data/part1=value1/...`). Writers can _optionally_ produce these change data files as a consequence of operations that change underlying data, like `UPDATE`, `DELETE`, and `MERGE` operations to a Delta Lake table. Operations that only add new data should not produce separate change files. When available, change data readers should use the change data files instead of computing changes from the underlying data files.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Operations that only add new data should not produce separate change files.

If an operation adds no change data files, it must only add new data without deleting or updating any existing data.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not clear on what the suggested change is. It appears to be a restatement of the same thing... also shouldn't the causality be thought of in the other direction?

Instead of

If an operation adds no change data files, it must only add new data without deleting or updating any existing data.

Isn't it:

If an operation adds only new data, without deleting or updating any existing data, it should not produce any change data files

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zsxwing let me know what you think about this last one! resolved the other comments in the latest commit otherwise.

PROTOCOL.md Outdated Show resolved Hide resolved
PROTOCOL.md Outdated
@@ -331,6 +347,43 @@ The following is an example `remove` action.
}
```

### Add CDC File
The `cdc` action is used to add a [file](#change-data-files) containing only the data that was changed as part of the transaction. When CDC readers encounter a `cdc` action in a particular Delta table version, they must read the changes made in that version exclusively using the `cdc` files. If a version has no `cdc` action, then the data in `add` and `remove` actions are read as inserted and deleted rows, respectively.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The `cdc` action is used to add a [file](#change-data-files) containing only the data that was changed as part of the transaction. When CDC readers encounter a `cdc` action in a particular Delta table version, they must read the changes made in that version exclusively using the `cdc` files. If a version has no `cdc` action, then the data in `add` and `remove` actions are read as inserted and deleted rows, respectively.
The `cdc` action is used to add a [file](#change-data-files) containing only the data that was changed as part of the transaction. When change data readers encounter a `cdc` action in a particular Delta table version, they must read the changes made in that version exclusively using the `cdc` files. If a version has no `cdc` action, then the data in `add` and `remove` actions are read as inserted and deleted rows, respectively.

PROTOCOL.md Outdated
Comment on lines 114 to 115
_commit_version|`Long`| The Delta log or table version containing the change.
_commit_timestamp|`Timestamp`| The timestamp associated when the commit was created.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These two columns are not in data change files.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we still mention them somewhere since they are inferred by the reader at runtime? It's not clear if this should be an actual requirement of readers...

@zsxwing
Copy link
Member

zsxwing commented Sep 26, 2022

@nkarpov left a few minor comments. We are still missing a section to explain how change data readers generate results. We can do that in a followup PR.

@nkarpov nkarpov requested a review from zsxwing September 29, 2022 16:49
Specifically, to read the row-level changes made in a version, the following strategy should be used:
1. If there are `cdc` actions in this version, then read only those to get the row-level changes, and skip the remaining `add` and `remove` actions in this version.
2. Otherwise, if there are no `cdc` actions in this version, read and treat all the rows in the `add` and `remove` actions as inserted and deleted rows, respectively.

Copy link
Member

@zsxwing zsxwing Sep 29, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

3. The following extra columns should also be generated:

Field Name | Data Type | Description
-|-|-
_commit_version|`Long`| The table version containing the change. This can be got from the name of the Delta log file that contains actions.
_commit_timestamp|`Timestamp`| The timestamp associated when the commit was created. This can be got from the file modification time of the Delta log file that contains actions.

@nkarpov nkarpov requested a review from zsxwing September 29, 2022 20:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants