From 820666e7780f29a92f17db416e8ffd840e14a77e Mon Sep 17 00:00:00 2001 From: Nick Karpov Date: Fri, 29 Jul 2022 14:46:44 -0700 Subject: [PATCH 01/10] Update PROTOCOL to include change data spec --- PROTOCOL.md | 41 ++++++++++++++++++++++++++++++++++++++++- 1 file changed, 40 insertions(+), 1 deletion(-) diff --git a/PROTOCOL.md b/PROTOCOL.md index ce330321382..3c49c19a377 100644 --- a/PROTOCOL.md +++ b/PROTOCOL.md @@ -77,7 +77,7 @@ The state of a table at a given version is called a _snapshot_ and is defined by - **Set of applications-specific transactions** that have been successfully committed to the table ## File Types -A Delta table is stored within a directory and is composed of four different types of files. +A Delta table is stored within a directory and is composed of five different types of files. Here is an example of a Delta table with three entries in the commit log, stored in the directory `mytable`. ``` @@ -86,6 +86,7 @@ Here is an example of a Delta table with three entries in the commit log, stored /mytable/_delta_log/00000000000000000003.json /mytable/_delta_log/00000000000000000003.checkpoint.parquet /mytable/_delta_log/_last_checkpoint +/mytable/_change_data/cdc-00000-924d9ac7-21a9-4121-b067-a0a6517aa8ed.c000.snappy.parquet /mytable/part-00000-3935a07c-416b-4344-ad97-2a38342ee2fc.c000.snappy.parquet ``` @@ -95,6 +96,19 @@ By default, the reference implementation stores data files in directories that a This directory format is only used to follow existing conventions and is not required by the protocol. Actual partition values for a file must be read from the transaction log. +### Change Data Files +Change data files are stored in a directory at the root of the table named `_change_data`, and represent the changes for the table version they are in. For data with partition values, change data files are stored within the `_change_data` directory in their respective partitions (i.e. `_change_data/part1=value1/...`). Writers can _optionally_ produce these change data files as a consequence of operations that change underlying data, like `UPDATE`, `DELETE`, and `MERGE` operations to a Delta Lake table. When available, change data readers should use the change data files instead of computing changes from the underlying data files. + +In addition to the data columns, change data files contain metadata columns that identify the type of change event: + +Field Name | Data Type | Description +-|-|- +_change_type|`String`| `insert`, `update_preimage` , `update_postimage`, `delete` __(1)__ +_commit_version|`Long`| The Delta log or table version containing the change. +_commit_timestamp|`Timestamp`| The timestamp associated when the commit was created. + +__(1)__ `preimage` is the value before the update, postimage is the value after the update. + ### Delta Log Entries Delta files are stored as JSON in a directory at the root of the table named `_delta_log`, and together with checkpoints make up the log of all changes that have occurred to a table. @@ -326,6 +340,31 @@ The following is an example `remove` action. } ``` +### Add CDC File +The `cdc` action is used to add a [file](#change-data-files) containing only the data that was changed as part of the transaction. + +The schema of the `cdc` action is as follows: + +Field Name | Data Type | Description +-|-|- +path| String | A relative path to a file from the root of the table or an absolute path to a file that should be removed from the table. The path is a URI as specified by [RFC 2396 URI Generic Syntax](https://www.ietf.org/rfc/rfc2396.txt), which needs to be decoded to get the file path. +partitionValues| Map[String, String] | A map from partition column to value for this file. See also [Partition Value Serialization](#Partition-Value-Serialization) +size| Long | The size of this file in bytes +dataChange | Boolean | Should always be `false` for change data because it only mirrors the effective changes of the data files + +The following is an example of `cdc` action. + +``` +{ + "cdc": { + "path": "_change_data/cdc-00001-c…..snappy.parquet", + "partitionValues": {}, + "size": 1213, + "dataChange": false + } +} +``` + ### Transaction Identifiers Incremental processing systems (e.g., streaming systems) that track progress using their own application-specific versions need to record what progress has been made, in order to avoid duplicating data in the face of failures and retries during a write. Transaction identifiers allow this information to be recorded atomically in the transaction log of a delta table along with the other actions that modify the contents of the table. From 9f5058ba3a6beb282086bc2ce364851c8c86c8e2 Mon Sep 17 00:00:00 2001 From: Nick Karpov Date: Fri, 29 Jul 2022 15:27:09 -0700 Subject: [PATCH 02/10] suggested changes --- PROTOCOL.md | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/PROTOCOL.md b/PROTOCOL.md index 3c49c19a377..6c49b30f345 100644 --- a/PROTOCOL.md +++ b/PROTOCOL.md @@ -107,7 +107,7 @@ _change_type|`String`| `insert`, `update_preimage` , `update_postimage`, `delete _commit_version|`Long`| The Delta log or table version containing the change. _commit_timestamp|`Timestamp`| The timestamp associated when the commit was created. -__(1)__ `preimage` is the value before the update, postimage is the value after the update. +__(1)__ `preimage` is the value before the update, `postimage` is the value after the update. ### Delta Log Entries Delta files are stored as JSON in a directory at the root of the table named `_delta_log`, and together with checkpoints make up the log of all changes that have occurred to a table. @@ -341,16 +341,16 @@ The following is an example `remove` action. ``` ### Add CDC File -The `cdc` action is used to add a [file](#change-data-files) containing only the data that was changed as part of the transaction. +The `cdc` action is used to add a [file](#change-data-files) containing only the data that was changed as part of the transaction. When CDC readers encounter a cdc action in a particular Delta version, they must read from that version exclusively using the cdc files, rather than inferring changes from add and remove actions as they do for the other type of operations. The schema of the `cdc` action is as follows: Field Name | Data Type | Description -|-|- -path| String | A relative path to a file from the root of the table or an absolute path to a file that should be removed from the table. The path is a URI as specified by [RFC 2396 URI Generic Syntax](https://www.ietf.org/rfc/rfc2396.txt), which needs to be decoded to get the file path. +path| String | A relative path to a change data file from the root of the table or an absolute path to a change data file that should be added to the table. The path is a URI as specified by [RFC 2396 URI Generic Syntax](https://www.ietf.org/rfc/rfc2396.txt), which needs to be decoded to get the file path. partitionValues| Map[String, String] | A map from partition column to value for this file. See also [Partition Value Serialization](#Partition-Value-Serialization) size| Long | The size of this file in bytes -dataChange | Boolean | Should always be `false` for change data because it only mirrors the effective changes of the data files +tags | Map[String, String] | Map containing metadata about this file The following is an example of `cdc` action. From f18e40e402ab2c8892fd9b41ce3f056330957185 Mon Sep 17 00:00:00 2001 From: Nick Karpov Date: Fri, 29 Jul 2022 15:35:59 -0700 Subject: [PATCH 03/10] suggested changed contd. --- PROTOCOL.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/PROTOCOL.md b/PROTOCOL.md index 6c49b30f345..3b72c82b438 100644 --- a/PROTOCOL.md +++ b/PROTOCOL.md @@ -97,7 +97,7 @@ This directory format is only used to follow existing conventions and is not req Actual partition values for a file must be read from the transaction log. ### Change Data Files -Change data files are stored in a directory at the root of the table named `_change_data`, and represent the changes for the table version they are in. For data with partition values, change data files are stored within the `_change_data` directory in their respective partitions (i.e. `_change_data/part1=value1/...`). Writers can _optionally_ produce these change data files as a consequence of operations that change underlying data, like `UPDATE`, `DELETE`, and `MERGE` operations to a Delta Lake table. When available, change data readers should use the change data files instead of computing changes from the underlying data files. +Change data files are stored in a directory at the root of the table named `_change_data`, and represent the changes for the table version they are in. For data with partition values, change data files are stored within the `_change_data` directory in their respective partitions (i.e. `_change_data/part1=value1/...`). Writers can _optionally_ produce these change data files as a consequence of operations that change underlying data, like `UPDATE`, `DELETE`, and `MERGE` operations to a Delta Lake table. Operations that only add new data should not produce separate change files. When available, change data readers should use the change data files instead of computing changes from the underlying data files. In addition to the data columns, change data files contain metadata columns that identify the type of change event: From 0db4b9d2602d455fe9ab707475634e962adf3bfc Mon Sep 17 00:00:00 2001 From: Nick Karpov Date: Fri, 29 Jul 2022 15:41:31 -0700 Subject: [PATCH 04/10] Update PROTOCOL.md --- PROTOCOL.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/PROTOCOL.md b/PROTOCOL.md index 3b72c82b438..f2f79b6a856 100644 --- a/PROTOCOL.md +++ b/PROTOCOL.md @@ -341,7 +341,7 @@ The following is an example `remove` action. ``` ### Add CDC File -The `cdc` action is used to add a [file](#change-data-files) containing only the data that was changed as part of the transaction. When CDC readers encounter a cdc action in a particular Delta version, they must read from that version exclusively using the cdc files, rather than inferring changes from add and remove actions as they do for the other type of operations. +The `cdc` action is used to add a [file](#change-data-files) containing only the data that was changed as part of the transaction. When CDC readers encounter a `cdc` action in a particular Delta version, they must read from that version exclusively using the `cdc` files, rather than inferring changes from add and remove actions as they do for the other type of operations. The schema of the `cdc` action is as follows: From 23355f156a4f325b0d87edbee1da896e0b3c94df Mon Sep 17 00:00:00 2001 From: Nick Karpov Date: Sun, 31 Jul 2022 19:39:50 -0700 Subject: [PATCH 05/10] update TOC w/ doctoc --- PROTOCOL.md | 7 +++++++ 1 file changed, 7 insertions(+) diff --git a/PROTOCOL.md b/PROTOCOL.md index f2f79b6a856..ef750287de5 100644 --- a/PROTOCOL.md +++ b/PROTOCOL.md @@ -6,19 +6,24 @@ - [Delta Table Specification](#delta-table-specification) - [File Types](#file-types) - [Data Files](#data-files) + - [Change Data Files](#change-data-files) - [Delta Log Entries](#delta-log-entries) - [Checkpoints](#checkpoints) - [Last Checkpoint File](#last-checkpoint-file) - [JSON checksum](#json-checksum) + - [How to URL encode keys and string values](#how-to-url-encode-keys-and-string-values) - [Actions](#actions) - [Change Metadata](#change-metadata) - [Format Specification](#format-specification) - [Add File and Remove File](#add-file-and-remove-file) + - [Add CDC File](#add-cdc-file) - [Transaction Identifiers](#transaction-identifiers) - [Protocol Evolution](#protocol-evolution) - [Commit Provenance Information](#commit-provenance-information) - [Action Reconciliation](#action-reconciliation) - [Column Mapping](#column-mapping) + - [Writer Requirements for Column Mapping](#writer-requirements-for-column-mapping) + - [Reader Requirements for Column Mapping](#reader-requirements-for-column-mapping) - [Requirements for Writers](#requirements-for-writers) - [Creation of New Log Entries](#creation-of-new-log-entries) - [Consistency Between Table Metadata and Data Files](#consistency-between-table-metadata-and-data-files) @@ -29,7 +34,9 @@ - [Append-only Tables](#append-only-tables) - [Column Invariants](#column-invariants) - [Generated Columns](#generated-columns) + - [Identity Columns](#identity-columns) - [Writer Version Requirements](#writer-version-requirements) +- [Requirements for Readers](#requirements-for-readers) - [Appendix](#appendix) - [Per-file Statistics](#per-file-statistics) - [Partition Value Serialization](#partition-value-serialization) From ea25eb8f99590f27e6fe69c706d558b176f1b4e8 Mon Sep 17 00:00:00 2001 From: Nick Karpov Date: Mon, 12 Sep 2022 10:44:51 -0700 Subject: [PATCH 06/10] feedback changes --- PROTOCOL.md | 14 +++++++++++--- 1 file changed, 11 insertions(+), 3 deletions(-) diff --git a/PROTOCOL.md b/PROTOCOL.md index ef750287de5..aff7c1bf1aa 100644 --- a/PROTOCOL.md +++ b/PROTOCOL.md @@ -84,7 +84,7 @@ The state of a table at a given version is called a _snapshot_ and is defined by - **Set of applications-specific transactions** that have been successfully committed to the table ## File Types -A Delta table is stored within a directory and is composed of five different types of files. +A Delta table is stored within a directory and is composed of the following different types of files. Here is an example of a Delta table with three entries in the commit log, stored in the directory `mytable`. ``` @@ -106,7 +106,7 @@ Actual partition values for a file must be read from the transaction log. ### Change Data Files Change data files are stored in a directory at the root of the table named `_change_data`, and represent the changes for the table version they are in. For data with partition values, change data files are stored within the `_change_data` directory in their respective partitions (i.e. `_change_data/part1=value1/...`). Writers can _optionally_ produce these change data files as a consequence of operations that change underlying data, like `UPDATE`, `DELETE`, and `MERGE` operations to a Delta Lake table. Operations that only add new data should not produce separate change files. When available, change data readers should use the change data files instead of computing changes from the underlying data files. -In addition to the data columns, change data files contain metadata columns that identify the type of change event: +In addition to the data columns, change data files contain additional columns that identify the type of change event: Field Name | Data Type | Description -|-|- @@ -348,7 +348,7 @@ The following is an example `remove` action. ``` ### Add CDC File -The `cdc` action is used to add a [file](#change-data-files) containing only the data that was changed as part of the transaction. When CDC readers encounter a `cdc` action in a particular Delta version, they must read from that version exclusively using the `cdc` files, rather than inferring changes from add and remove actions as they do for the other type of operations. +The `cdc` action is used to add a [file](#change-data-files) containing only the data that was changed as part of the transaction. When CDC readers encounter a `cdc` action in a particular Delta table version, they must read the changes made in that version exclusively using the `cdc` files. If a version has no `cdc` action, then the data in `add` and `remove` actions are read as inserted and deleted rows, respectively. The schema of the `cdc` action is as follows: @@ -372,6 +372,14 @@ The following is an example of `cdc` action. } ``` +#### Writer Requirements for AddCDCFile + +As of [Writer Version 4](#Writer-Version-Requirements), all writers must respect the `delta.enableChangeDataFeed` configuration flag in the metadata of the table. Writers must produce the relevant `AddCDCFile`'s for any operation that changes data, as specified in [Change Data Files](#change-data-files) + +#### Reader Requirements for AddCDCFile + +When available, change data readers should use the `AddCDCFile`s in a given table version instead of computing changes from the underlying data files referenced by the `add` and `remove` actions. + ### Transaction Identifiers Incremental processing systems (e.g., streaming systems) that track progress using their own application-specific versions need to record what progress has been made, in order to avoid duplicating data in the face of failures and retries during a write. Transaction identifiers allow this information to be recorded atomically in the transaction log of a delta table along with the other actions that modify the contents of the table. From d8c88d93271b43185b12abaa3cdfd72f5fe2884d Mon Sep 17 00:00:00 2001 From: Nick Karpov Date: Mon, 12 Sep 2022 10:49:27 -0700 Subject: [PATCH 07/10] more explicit description of respecting config flag --- PROTOCOL.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/PROTOCOL.md b/PROTOCOL.md index aff7c1bf1aa..7b1ea7f4b83 100644 --- a/PROTOCOL.md +++ b/PROTOCOL.md @@ -374,7 +374,7 @@ The following is an example of `cdc` action. #### Writer Requirements for AddCDCFile -As of [Writer Version 4](#Writer-Version-Requirements), all writers must respect the `delta.enableChangeDataFeed` configuration flag in the metadata of the table. Writers must produce the relevant `AddCDCFile`'s for any operation that changes data, as specified in [Change Data Files](#change-data-files) +As of [Writer Version 4](#Writer-Version-Requirements), all writers must respect the `delta.enableChangeDataFeed` configuration flag in the metadata of the table. When `delta.enableChangeDataFeed` is `true`, writers must produce the relevant `AddCDCFile`'s for any operation that changes data, as specified in [Change Data Files](#change-data-files) #### Reader Requirements for AddCDCFile From 39450daef8ab6fb4dd4ba6d3052c91532dfa0e65 Mon Sep 17 00:00:00 2001 From: Nick Karpov Date: Tue, 20 Sep 2022 14:23:33 -0700 Subject: [PATCH 08/10] linked to the writer table, some language modifications, add dataChange field back to the schema --- PROTOCOL.md | 8 ++++++-- 1 file changed, 6 insertions(+), 2 deletions(-) diff --git a/PROTOCOL.md b/PROTOCOL.md index 7b1ea7f4b83..9525b87cd2d 100644 --- a/PROTOCOL.md +++ b/PROTOCOL.md @@ -357,6 +357,7 @@ Field Name | Data Type | Description path| String | A relative path to a change data file from the root of the table or an absolute path to a change data file that should be added to the table. The path is a URI as specified by [RFC 2396 URI Generic Syntax](https://www.ietf.org/rfc/rfc2396.txt), which needs to be decoded to get the file path. partitionValues| Map[String, String] | A map from partition column to value for this file. See also [Partition Value Serialization](#Partition-Value-Serialization) size| Long | The size of this file in bytes +dataChange | Boolean | Should always be set to `false` for `cdc` actions because they _do not_ change the underlying data of the table tags | Map[String, String] | Map containing metadata about this file The following is an example of `cdc` action. @@ -378,7 +379,10 @@ As of [Writer Version 4](#Writer-Version-Requirements), all writers must respect #### Reader Requirements for AddCDCFile -When available, change data readers should use the `AddCDCFile`s in a given table version instead of computing changes from the underlying data files referenced by the `add` and `remove` actions. +When available, change data readers should use the `cdc` actions in a given table version instead of computing changes from the underlying data files referenced by the `add` and `remove` actions. +Specifically, to read the row-level changes made in a version, the following strategy should be used: +1. If there are `cdc` actions in this version, then read only those to get the row-level changes, and skip the remaining `add` and `remove` actions in this version. +2. Otherwise, if there are no `cdc` actions in this version, read and treat all the rows in the `add` and `remove` actions as inserted and deleted rows, respectively. ### Transaction Identifiers Incremental processing systems (e.g., streaming systems) that track progress using their own application-specific versions need to record what progress has been made, in order to avoid duplicating data in the face of failures and retries during a write. @@ -638,7 +642,7 @@ The requirements of the writers according to the protocol versions are summarize -|- Writer Version 2 | - Support [`delta.appendOnly`](#append-only-tables)
- Support [Column Invariants](#column-invariants) Writer Version 3 | Enforce:
- `delta.checkpoint.writeStatsAsJson`
- `delta.checkpoint.writeStatsAsStruct`
- `CHECK` constraints -Writer Version 4 | - Support Change Data Feed
- Support [Generated Columns](#generated-columns) +Writer Version 4 | - Support [Change Data Feed](#add-cdc-file)
- Support [Generated Columns](#generated-columns) Writer Version 5 | Respect [Column Mapping](#column-mapping) Writer Version 6 | Support [Identity Columns](#identity-columns) From 2a1689f17125e154616cba8ae70cc7a58f852e45 Mon Sep 17 00:00:00 2001 From: Nick Karpov Date: Thu, 29 Sep 2022 09:45:12 -0700 Subject: [PATCH 09/10] incorporating latest feedback --- PROTOCOL.md | 6 ++---- 1 file changed, 2 insertions(+), 4 deletions(-) diff --git a/PROTOCOL.md b/PROTOCOL.md index 9525b87cd2d..cad5b4132df 100644 --- a/PROTOCOL.md +++ b/PROTOCOL.md @@ -104,15 +104,13 @@ This directory format is only used to follow existing conventions and is not req Actual partition values for a file must be read from the transaction log. ### Change Data Files -Change data files are stored in a directory at the root of the table named `_change_data`, and represent the changes for the table version they are in. For data with partition values, change data files are stored within the `_change_data` directory in their respective partitions (i.e. `_change_data/part1=value1/...`). Writers can _optionally_ produce these change data files as a consequence of operations that change underlying data, like `UPDATE`, `DELETE`, and `MERGE` operations to a Delta Lake table. Operations that only add new data should not produce separate change files. When available, change data readers should use the change data files instead of computing changes from the underlying data files. +Change data files are stored in a directory at the root of the table named `_change_data`, and represent the changes for the table version they are in. For data with partition values, it is recommended that the change data files are stored within the `_change_data` directory in their respective partitions (i.e. `_change_data/part1=value1/...`). Writers can _optionally_ produce these change data files as a consequence of operations that change underlying data, like `UPDATE`, `DELETE`, and `MERGE` operations to a Delta Lake table. If an operation only adds new data or removes existing data without updating any existing rows, a writer can write only data files and commit them in `add` or `remove` actions without duplicating the data into change data files. When available, change data readers should use the change data files instead of computing changes from the underlying data files. In addition to the data columns, change data files contain additional columns that identify the type of change event: Field Name | Data Type | Description -|-|- _change_type|`String`| `insert`, `update_preimage` , `update_postimage`, `delete` __(1)__ -_commit_version|`Long`| The Delta log or table version containing the change. -_commit_timestamp|`Timestamp`| The timestamp associated when the commit was created. __(1)__ `preimage` is the value before the update, `postimage` is the value after the update. @@ -348,7 +346,7 @@ The following is an example `remove` action. ``` ### Add CDC File -The `cdc` action is used to add a [file](#change-data-files) containing only the data that was changed as part of the transaction. When CDC readers encounter a `cdc` action in a particular Delta table version, they must read the changes made in that version exclusively using the `cdc` files. If a version has no `cdc` action, then the data in `add` and `remove` actions are read as inserted and deleted rows, respectively. +The `cdc` action is used to add a [file](#change-data-files) containing only the data that was changed as part of the transaction. When change data readers encounter a `cdc` action in a particular Delta table version, they must read the changes made in that version exclusively using the `cdc` files. If a version has no `cdc` action, then the data in `add` and `remove` actions are read as inserted and deleted rows, respectively. The schema of the `cdc` action is as follows: From 7a5a4baafa842413c4e48ba9f5cba18e7cf36fab Mon Sep 17 00:00:00 2001 From: Nick Karpov Date: Thu, 29 Sep 2022 11:58:38 -0700 Subject: [PATCH 10/10] document derived fields --- PROTOCOL.md | 6 ++++++ 1 file changed, 6 insertions(+) diff --git a/PROTOCOL.md b/PROTOCOL.md index cad5b4132df..5d01b2b19f5 100644 --- a/PROTOCOL.md +++ b/PROTOCOL.md @@ -381,6 +381,12 @@ When available, change data readers should use the `cdc` actions in a given tabl Specifically, to read the row-level changes made in a version, the following strategy should be used: 1. If there are `cdc` actions in this version, then read only those to get the row-level changes, and skip the remaining `add` and `remove` actions in this version. 2. Otherwise, if there are no `cdc` actions in this version, read and treat all the rows in the `add` and `remove` actions as inserted and deleted rows, respectively. +3. The following extra columns should also be generated: + +Field Name | Data Type | Description +-|-|- +_commit_version|`Long`| The table version containing the change. This can be got from the name of the Delta log file that contains actions. +_commit_timestamp|`Timestamp`| The timestamp associated when the commit was created. This can be got from the file modification time of the Delta log file that contains actions. ### Transaction Identifiers Incremental processing systems (e.g., streaming systems) that track progress using their own application-specific versions need to record what progress has been made, in order to avoid duplicating data in the face of failures and retries during a write.