Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update PROTOCOL to include change data spec #1300

Closed
wants to merge 11 commits into from
48 changes: 47 additions & 1 deletion PROTOCOL.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,19 +6,24 @@
- [Delta Table Specification](#delta-table-specification)
- [File Types](#file-types)
- [Data Files](#data-files)
- [Change Data Files](#change-data-files)
- [Delta Log Entries](#delta-log-entries)
- [Checkpoints](#checkpoints)
- [Last Checkpoint File](#last-checkpoint-file)
- [JSON checksum](#json-checksum)
- [How to URL encode keys and string values](#how-to-url-encode-keys-and-string-values)
- [Actions](#actions)
- [Change Metadata](#change-metadata)
- [Format Specification](#format-specification)
- [Add File and Remove File](#add-file-and-remove-file)
- [Add CDC File](#add-cdc-file)
- [Transaction Identifiers](#transaction-identifiers)
- [Protocol Evolution](#protocol-evolution)
- [Commit Provenance Information](#commit-provenance-information)
- [Action Reconciliation](#action-reconciliation)
- [Column Mapping](#column-mapping)
- [Writer Requirements for Column Mapping](#writer-requirements-for-column-mapping)
- [Reader Requirements for Column Mapping](#reader-requirements-for-column-mapping)
- [Requirements for Writers](#requirements-for-writers)
- [Creation of New Log Entries](#creation-of-new-log-entries)
- [Consistency Between Table Metadata and Data Files](#consistency-between-table-metadata-and-data-files)
Expand All @@ -29,7 +34,9 @@
- [Append-only Tables](#append-only-tables)
- [Column Invariants](#column-invariants)
- [Generated Columns](#generated-columns)
- [Identity Columns](#identity-columns)
- [Writer Version Requirements](#writer-version-requirements)
- [Requirements for Readers](#requirements-for-readers)
- [Appendix](#appendix)
- [Per-file Statistics](#per-file-statistics)
- [Partition Value Serialization](#partition-value-serialization)
Expand Down Expand Up @@ -77,7 +84,7 @@ The state of a table at a given version is called a _snapshot_ and is defined by
- **Set of applications-specific transactions** that have been successfully committed to the table

## File Types
A Delta table is stored within a directory and is composed of four different types of files.
A Delta table is stored within a directory and is composed of five different types of files.
nkarpov marked this conversation as resolved.
Show resolved Hide resolved

Here is an example of a Delta table with three entries in the commit log, stored in the directory `mytable`.
```
Expand All @@ -86,6 +93,7 @@ Here is an example of a Delta table with three entries in the commit log, stored
/mytable/_delta_log/00000000000000000003.json
/mytable/_delta_log/00000000000000000003.checkpoint.parquet
/mytable/_delta_log/_last_checkpoint
/mytable/_change_data/cdc-00000-924d9ac7-21a9-4121-b067-a0a6517aa8ed.c000.snappy.parquet
/mytable/part-00000-3935a07c-416b-4344-ad97-2a38342ee2fc.c000.snappy.parquet
```

Expand All @@ -95,6 +103,19 @@ By default, the reference implementation stores data files in directories that a
This directory format is only used to follow existing conventions and is not required by the protocol.
Actual partition values for a file must be read from the transaction log.

### Change Data Files
Change data files are stored in a directory at the root of the table named `_change_data`, and represent the changes for the table version they are in. For data with partition values, change data files are stored within the `_change_data` directory in their respective partitions (i.e. `_change_data/part1=value1/...`). Writers can _optionally_ produce these change data files as a consequence of operations that change underlying data, like `UPDATE`, `DELETE`, and `MERGE` operations to a Delta Lake table. Operations that only add new data should not produce separate change files. When available, change data readers should use the change data files instead of computing changes from the underlying data files.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Operations that only add new data should not produce separate change files.

If an operation adds no change data files, it must only add new data without deleting or updating any existing data.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not clear on what the suggested change is. It appears to be a restatement of the same thing... also shouldn't the causality be thought of in the other direction?

Instead of

If an operation adds no change data files, it must only add new data without deleting or updating any existing data.

Isn't it:

If an operation adds only new data, without deleting or updating any existing data, it should not produce any change data files

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zsxwing let me know what you think about this last one! resolved the other comments in the latest commit otherwise.

nkarpov marked this conversation as resolved.
Show resolved Hide resolved

In addition to the data columns, change data files contain metadata columns that identify the type of change event:
nkarpov marked this conversation as resolved.
Show resolved Hide resolved

Field Name | Data Type | Description
-|-|-
_change_type|`String`| `insert`, `update_preimage` , `update_postimage`, `delete` __(1)__
_commit_version|`Long`| The Delta log or table version containing the change.
_commit_timestamp|`Timestamp`| The timestamp associated when the commit was created.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These two columns are not in data change files.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we still mention them somewhere since they are inferred by the reader at runtime? It's not clear if this should be an actual requirement of readers...


__(1)__ `preimage` is the value before the update, `postimage` is the value after the update.

### Delta Log Entries
Delta files are stored as JSON in a directory at the root of the table named `_delta_log`, and together with checkpoints make up the log of all changes that have occurred to a table.

Expand Down Expand Up @@ -326,6 +347,31 @@ The following is an example `remove` action.
}
```

### Add CDC File
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The protocol version information is not here.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added two subsections in this area to clarify writer/reader requirements as you've asked. Let me know if you'd like to move it elsewhere. Change Data Feed is unlike the other features (like Column Mapping) so there's no clear precedent on how to organize it within this doc.

The `cdc` action is used to add a [file](#change-data-files) containing only the data that was changed as part of the transaction. When CDC readers encounter a `cdc` action in a particular Delta version, they must read from that version exclusively using the `cdc` files, rather than inferring changes from add and remove actions as they do for the other type of operations.
nkarpov marked this conversation as resolved.
Show resolved Hide resolved

The schema of the `cdc` action is as follows:

Field Name | Data Type | Description
-|-|-
path| String | A relative path to a change data file from the root of the table or an absolute path to a change data file that should be added to the table. The path is a URI as specified by [RFC 2396 URI Generic Syntax](https://www.ietf.org/rfc/rfc2396.txt), which needs to be decoded to get the file path.
partitionValues| Map[String, String] | A map from partition column to value for this file. See also [Partition Value Serialization](#Partition-Value-Serialization)
size| Long | The size of this file in bytes
tags | Map[String, String] | Map containing metadata about this file

The following is an example of `cdc` action.

```
{
"cdc": {
"path": "_change_data/cdc-00001-c…..snappy.parquet",
"partitionValues": {},
"size": 1213,
"dataChange": false
nkarpov marked this conversation as resolved.
Show resolved Hide resolved
}
}
```

### Transaction Identifiers
Incremental processing systems (e.g., streaming systems) that track progress using their own application-specific versions need to record what progress has been made, in order to avoid duplicating data in the face of failures and retries during a write.
Transaction identifiers allow this information to be recorded atomically in the transaction log of a delta table along with the other actions that modify the contents of the table.
Expand Down