Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update PROTOCOL to include change data spec #1300

Closed
wants to merge 11 commits into from
51 changes: 50 additions & 1 deletion PROTOCOL.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@
- [Delta Table Specification](#delta-table-specification)
- [File Types](#file-types)
- [Data Files](#data-files)
- [Change Data Files](#change-data-files)
- [Delta Log Entries](#delta-log-entries)
- [Checkpoints](#checkpoints)
- [Last Checkpoint File](#last-checkpoint-file)
Expand All @@ -15,6 +16,7 @@
- [Change Metadata](#change-metadata)
- [Format Specification](#format-specification)
- [Add File and Remove File](#add-file-and-remove-file)
- [Add CDC File](#add-cdc-file)
- [Transaction Identifiers](#transaction-identifiers)
- [Protocol Evolution](#protocol-evolution)
- [Commit Provenance Information](#commit-provenance-information)
Expand Down Expand Up @@ -82,7 +84,7 @@ The state of a table at a given version is called a _snapshot_ and is defined by
- **Set of applications-specific transactions** that have been successfully committed to the table

## File Types
A Delta table is stored within a directory and is composed of four different types of files.
A Delta table is stored within a directory and is composed of the following different types of files.

Here is an example of a Delta table with three entries in the commit log, stored in the directory `mytable`.
```
Expand All @@ -91,6 +93,7 @@ Here is an example of a Delta table with three entries in the commit log, stored
/mytable/_delta_log/00000000000000000003.json
/mytable/_delta_log/00000000000000000003.checkpoint.parquet
/mytable/_delta_log/_last_checkpoint
/mytable/_change_data/cdc-00000-924d9ac7-21a9-4121-b067-a0a6517aa8ed.c000.snappy.parquet
/mytable/part-00000-3935a07c-416b-4344-ad97-2a38342ee2fc.c000.snappy.parquet
```

Expand All @@ -100,6 +103,19 @@ By default, the reference implementation stores data files in directories that a
This directory format is only used to follow existing conventions and is not required by the protocol.
Actual partition values for a file must be read from the transaction log.

### Change Data Files
Change data files are stored in a directory at the root of the table named `_change_data`, and represent the changes for the table version they are in. For data with partition values, change data files are stored within the `_change_data` directory in their respective partitions (i.e. `_change_data/part1=value1/...`). Writers can _optionally_ produce these change data files as a consequence of operations that change underlying data, like `UPDATE`, `DELETE`, and `MERGE` operations to a Delta Lake table. Operations that only add new data should not produce separate change files. When available, change data readers should use the change data files instead of computing changes from the underlying data files.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Operations that only add new data should not produce separate change files.

If an operation adds no change data files, it must only add new data without deleting or updating any existing data.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not clear on what the suggested change is. It appears to be a restatement of the same thing... also shouldn't the causality be thought of in the other direction?

Instead of

If an operation adds no change data files, it must only add new data without deleting or updating any existing data.

Isn't it:

If an operation adds only new data, without deleting or updating any existing data, it should not produce any change data files

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zsxwing let me know what you think about this last one! resolved the other comments in the latest commit otherwise.

nkarpov marked this conversation as resolved.
Show resolved Hide resolved

In addition to the data columns, change data files contain additional columns that identify the type of change event:

Field Name | Data Type | Description
-|-|-
_change_type|`String`| `insert`, `update_preimage` , `update_postimage`, `delete` __(1)__
_commit_version|`Long`| The Delta log or table version containing the change.
_commit_timestamp|`Timestamp`| The timestamp associated when the commit was created.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These two columns are not in data change files.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we still mention them somewhere since they are inferred by the reader at runtime? It's not clear if this should be an actual requirement of readers...


__(1)__ `preimage` is the value before the update, `postimage` is the value after the update.

### Delta Log Entries
Delta files are stored as JSON in a directory at the root of the table named `_delta_log`, and together with checkpoints make up the log of all changes that have occurred to a table.

Expand Down Expand Up @@ -331,6 +347,39 @@ The following is an example `remove` action.
}
```

### Add CDC File
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The protocol version information is not here.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added two subsections in this area to clarify writer/reader requirements as you've asked. Let me know if you'd like to move it elsewhere. Change Data Feed is unlike the other features (like Column Mapping) so there's no clear precedent on how to organize it within this doc.

The `cdc` action is used to add a [file](#change-data-files) containing only the data that was changed as part of the transaction. When CDC readers encounter a `cdc` action in a particular Delta table version, they must read the changes made in that version exclusively using the `cdc` files. If a version has no `cdc` action, then the data in `add` and `remove` actions are read as inserted and deleted rows, respectively.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The `cdc` action is used to add a [file](#change-data-files) containing only the data that was changed as part of the transaction. When CDC readers encounter a `cdc` action in a particular Delta table version, they must read the changes made in that version exclusively using the `cdc` files. If a version has no `cdc` action, then the data in `add` and `remove` actions are read as inserted and deleted rows, respectively.
The `cdc` action is used to add a [file](#change-data-files) containing only the data that was changed as part of the transaction. When change data readers encounter a `cdc` action in a particular Delta table version, they must read the changes made in that version exclusively using the `cdc` files. If a version has no `cdc` action, then the data in `add` and `remove` actions are read as inserted and deleted rows, respectively.


The schema of the `cdc` action is as follows:

Field Name | Data Type | Description
-|-|-
path| String | A relative path to a change data file from the root of the table or an absolute path to a change data file that should be added to the table. The path is a URI as specified by [RFC 2396 URI Generic Syntax](https://www.ietf.org/rfc/rfc2396.txt), which needs to be decoded to get the file path.
partitionValues| Map[String, String] | A map from partition column to value for this file. See also [Partition Value Serialization](#Partition-Value-Serialization)
size| Long | The size of this file in bytes
tags | Map[String, String] | Map containing metadata about this file

The following is an example of `cdc` action.

```
{
"cdc": {
"path": "_change_data/cdc-00001-c…..snappy.parquet",
"partitionValues": {},
"size": 1213,
"dataChange": false
nkarpov marked this conversation as resolved.
Show resolved Hide resolved
}
}
```

#### Writer Requirements for AddCDCFile

As of [Writer Version 4](#Writer-Version-Requirements), all writers must respect the `delta.enableChangeDataFeed` configuration flag in the metadata of the table. When `delta.enableChangeDataFeed` is `true`, writers must produce the relevant `AddCDCFile`'s for any operation that changes data, as specified in [Change Data Files](#change-data-files)

#### Reader Requirements for AddCDCFile

When available, change data readers should use the `AddCDCFile`s in a given table version instead of computing changes from the underlying data files referenced by the `add` and `remove` actions.
nkarpov marked this conversation as resolved.
Show resolved Hide resolved
nkarpov marked this conversation as resolved.
Show resolved Hide resolved

Copy link
Member

@zsxwing zsxwing Sep 29, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

3. The following extra columns should also be generated:

Field Name | Data Type | Description
-|-|-
_commit_version|`Long`| The table version containing the change. This can be got from the name of the Delta log file that contains actions.
_commit_timestamp|`Timestamp`| The timestamp associated when the commit was created. This can be got from the file modification time of the Delta log file that contains actions.

### Transaction Identifiers
Incremental processing systems (e.g., streaming systems) that track progress using their own application-specific versions need to record what progress has been made, in order to avoid duplicating data in the face of failures and retries during a write.
Transaction identifiers allow this information to be recorded atomically in the transaction log of a delta table along with the other actions that modify the contents of the table.
Expand Down