-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update PROTOCOL to include change data spec #1300
Changes from 8 commits
820666e
9f5058b
f18e40e
0db4b9d
23355f1
765bd5f
ea25eb8
d8c88d9
39450da
2a1689f
7a5a4ba
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change | ||||
---|---|---|---|---|---|---|
|
@@ -6,6 +6,7 @@ | |||||
- [Delta Table Specification](#delta-table-specification) | ||||||
- [File Types](#file-types) | ||||||
- [Data Files](#data-files) | ||||||
- [Change Data Files](#change-data-files) | ||||||
- [Delta Log Entries](#delta-log-entries) | ||||||
- [Checkpoints](#checkpoints) | ||||||
- [Last Checkpoint File](#last-checkpoint-file) | ||||||
|
@@ -15,6 +16,7 @@ | |||||
- [Change Metadata](#change-metadata) | ||||||
- [Format Specification](#format-specification) | ||||||
- [Add File and Remove File](#add-file-and-remove-file) | ||||||
- [Add CDC File](#add-cdc-file) | ||||||
- [Transaction Identifiers](#transaction-identifiers) | ||||||
- [Protocol Evolution](#protocol-evolution) | ||||||
- [Commit Provenance Information](#commit-provenance-information) | ||||||
|
@@ -82,7 +84,7 @@ The state of a table at a given version is called a _snapshot_ and is defined by | |||||
- **Set of applications-specific transactions** that have been successfully committed to the table | ||||||
|
||||||
## File Types | ||||||
A Delta table is stored within a directory and is composed of four different types of files. | ||||||
A Delta table is stored within a directory and is composed of the following different types of files. | ||||||
|
||||||
Here is an example of a Delta table with three entries in the commit log, stored in the directory `mytable`. | ||||||
``` | ||||||
|
@@ -91,6 +93,7 @@ Here is an example of a Delta table with three entries in the commit log, stored | |||||
/mytable/_delta_log/00000000000000000003.json | ||||||
/mytable/_delta_log/00000000000000000003.checkpoint.parquet | ||||||
/mytable/_delta_log/_last_checkpoint | ||||||
/mytable/_change_data/cdc-00000-924d9ac7-21a9-4121-b067-a0a6517aa8ed.c000.snappy.parquet | ||||||
/mytable/part-00000-3935a07c-416b-4344-ad97-2a38342ee2fc.c000.snappy.parquet | ||||||
``` | ||||||
|
||||||
|
@@ -100,6 +103,19 @@ By default, the reference implementation stores data files in directories that a | |||||
This directory format is only used to follow existing conventions and is not required by the protocol. | ||||||
Actual partition values for a file must be read from the transaction log. | ||||||
|
||||||
### Change Data Files | ||||||
Change data files are stored in a directory at the root of the table named `_change_data`, and represent the changes for the table version they are in. For data with partition values, change data files are stored within the `_change_data` directory in their respective partitions (i.e. `_change_data/part1=value1/...`). Writers can _optionally_ produce these change data files as a consequence of operations that change underlying data, like `UPDATE`, `DELETE`, and `MERGE` operations to a Delta Lake table. Operations that only add new data should not produce separate change files. When available, change data readers should use the change data files instead of computing changes from the underlying data files. | ||||||
nkarpov marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
|
||||||
In addition to the data columns, change data files contain additional columns that identify the type of change event: | ||||||
|
||||||
Field Name | Data Type | Description | ||||||
-|-|- | ||||||
_change_type|`String`| `insert`, `update_preimage` , `update_postimage`, `delete` __(1)__ | ||||||
_commit_version|`Long`| The Delta log or table version containing the change. | ||||||
_commit_timestamp|`Timestamp`| The timestamp associated when the commit was created. | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. These two columns are not in data change files. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Should we still mention them somewhere since they are inferred by the reader at runtime? It's not clear if this should be an actual requirement of readers... |
||||||
|
||||||
__(1)__ `preimage` is the value before the update, `postimage` is the value after the update. | ||||||
|
||||||
### Delta Log Entries | ||||||
Delta files are stored as JSON in a directory at the root of the table named `_delta_log`, and together with checkpoints make up the log of all changes that have occurred to a table. | ||||||
|
||||||
|
@@ -331,6 +347,39 @@ The following is an example `remove` action. | |||||
} | ||||||
``` | ||||||
|
||||||
### Add CDC File | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The protocol version information is not here. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Added two subsections in this area to clarify writer/reader requirements as you've asked. Let me know if you'd like to move it elsewhere. Change Data Feed is unlike the other features (like Column Mapping) so there's no clear precedent on how to organize it within this doc. |
||||||
The `cdc` action is used to add a [file](#change-data-files) containing only the data that was changed as part of the transaction. When CDC readers encounter a `cdc` action in a particular Delta table version, they must read the changes made in that version exclusively using the `cdc` files. If a version has no `cdc` action, then the data in `add` and `remove` actions are read as inserted and deleted rows, respectively. | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||
|
||||||
The schema of the `cdc` action is as follows: | ||||||
|
||||||
Field Name | Data Type | Description | ||||||
-|-|- | ||||||
path| String | A relative path to a change data file from the root of the table or an absolute path to a change data file that should be added to the table. The path is a URI as specified by [RFC 2396 URI Generic Syntax](https://www.ietf.org/rfc/rfc2396.txt), which needs to be decoded to get the file path. | ||||||
partitionValues| Map[String, String] | A map from partition column to value for this file. See also [Partition Value Serialization](#Partition-Value-Serialization) | ||||||
size| Long | The size of this file in bytes | ||||||
tags | Map[String, String] | Map containing metadata about this file | ||||||
|
||||||
The following is an example of `cdc` action. | ||||||
|
||||||
``` | ||||||
{ | ||||||
"cdc": { | ||||||
"path": "_change_data/cdc-00001-c…..snappy.parquet", | ||||||
"partitionValues": {}, | ||||||
"size": 1213, | ||||||
"dataChange": false | ||||||
nkarpov marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
} | ||||||
} | ||||||
``` | ||||||
|
||||||
#### Writer Requirements for AddCDCFile | ||||||
|
||||||
As of [Writer Version 4](#Writer-Version-Requirements), all writers must respect the `delta.enableChangeDataFeed` configuration flag in the metadata of the table. When `delta.enableChangeDataFeed` is `true`, writers must produce the relevant `AddCDCFile`'s for any operation that changes data, as specified in [Change Data Files](#change-data-files) | ||||||
|
||||||
#### Reader Requirements for AddCDCFile | ||||||
|
||||||
When available, change data readers should use the `AddCDCFile`s in a given table version instead of computing changes from the underlying data files referenced by the `add` and `remove` actions. | ||||||
nkarpov marked this conversation as resolved.
Show resolved
Hide resolved
nkarpov marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
|
||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
|
||||||
### Transaction Identifiers | ||||||
Incremental processing systems (e.g., streaming systems) that track progress using their own application-specific versions need to record what progress has been made, in order to avoid duplicating data in the face of failures and retries during a write. | ||||||
Transaction identifiers allow this information to be recorded atomically in the transaction log of a delta table along with the other actions that modify the contents of the table. | ||||||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If an operation adds no change data files, it must only add new data without deleting or updating any existing data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not clear on what the suggested change is. It appears to be a restatement of the same thing... also shouldn't the causality be thought of in the other direction?
Instead of
Isn't it:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@zsxwing let me know what you think about this last one! resolved the other comments in the latest commit otherwise.