-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Protocol update for Table Features #1450
Conversation
Hey @tdas and @allisonport-db, could you review this PR? |
# Conflicts: # PROTOCOL.md
Hey @scottsand-db and @allisonport-db, for some reason Github removed you two from the reviewer list after I re-requested a review from @tdas. Please add yourself back :) |
This PR implements Table Features proposed in the feature request (#1408) and the PROTOCOL doc (#1450). This PR implements the basic functionality, including - The protocol structure and necessary APIs - Protocol upgrade logic - Append-only feature ported to Table Features - Protocol upgrade path - User-facing APIs, such as allowing referencing features manually - Partial test coverage Not covered by this PR: - Adapt all features - Full test coverage - Make `DESCRIBE TABLE` show referenced features - Enable table clone and time travel paths Table Features support starts from reader protocol version `3` and writer version `7`. When supported, features can be **referenced** by a protocol by placing a `DeltaFeatureDescriptor` into the protocol's `readerFeatures` and/or `writerFeatures`. A feature can be one of two types: writer-only and reader-writer. The first type means that only writers must care about such a feature, while the latter means that in addition to writers, readers must also be aware of the feature to read the data correctly. We now have the following features released: - `appendOnly`: legacy, writer-only - `invariants`: legacy, writer-only - `checkConstriants`: legacy, writer-only - `changeDataFeed`: legacy, writer-only - `generatedColumns`: legacy, writer-only - `columnMapping`: legacy, reader-writer - `identityColumn`: legacy, writer-only - `deletionVector`: native, reader-writer Some examples of the `protocol` action: ```scala // Valid protocol. Both reader and writer versions are capable. Protocol( minReaderVersion = 3, minWriterVersion = 7, readerFeatures = {(columnMapping,enabled), (changeDataFeed,enabled)}, writerFeatures = {(appendOnly,enabled), (columnMapping,enabled), (changeDataFeed,enabled)}) // Valid protocol. Only writer version is capable. "columnMapping" is implicitly enabled by readers. Protocol( minReaderVersion = 2, minWriterVersion = 7, readerFeatures = None, writerFeatures = {(columnMapping,enabled)}) // Invalid protocol. Reader version does enable "columnMapping" implicitly. Protocol( minReaderVersion = 1, minWriterVersion = 7, readerFeatures = None, writerFeatures = {(columnMapping,enabled)}) ``` When reading or writing a table, clients MUST respect all enabled features. Upon table creation, the system assigns the table a minimum protocol that satisfies all features that are **automatically enabled** in the table's metadata. This means the table can be on a "legacy" protocol with both `readerFeatures` and `writerFeatures` set to `None` (if all active features are legacy, which is the current behavior) or be on a Table Features-capable protocol with all active features explicitly referenced in `readerFeatures` and/or `writerFeatures` (if one of the active features is Table Features-native or the user has specified a Table Features-capable protocol version). It's possible to upgrade an existing table to support table features. The update can be either for writers or for both readers and writers. During the upgrade, the system will explicitly reference all legacy features that are implicitly supported by the old protocol. Users can mark a feature to be required by a table by using the following commands: ```sql -- for an existing table ALTER TABLE table_name SET TBLPROPERTIES (delta.feature.featureName = 'enabled') -- for a new table CREATE TABLE table_name ... TBLPROPERTIES (delta.feature.featureName = 'enabled') -- for all new tables SET spark.databricks.delta.properties.defaults.feature.featureName = 'enabled' ``` When some features are set to `enabled` in table properties and some others in Spark sessions, the final table will enable all features defined in two places: ```sql SET spark.databricks.delta.properties.defaults.feature.featureA = 'enabled'; CREATE TABLE table_name ... TBLPROPERTIES (delta.feature.featureB = 'enabled') --- 'table_name' will have 'featureA' and 'featureB' enabled. ``` Closes #1520 GitOrigin-RevId: 2b05f397b24e57f1804761b3242a0f29098a209c
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here's a bunch of nits!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left some minor comments/questions.
This change addresses feedback in #1450 (comment) to remove all uses of `TableFeatureDescriptor` in the codebase because that `status` field in the descriptor can only be `enabled` which can be omitted. Therefore we could simplify store feature names as `Set[String]` in the protocol and assume all listed features are `enabled`. `Protocol` action before this change: ```json { "protocol":{ "readerVersion":3, "writerVersion":7, "readerFeatures":[ {"name":"columnMapping","status":"enabled"} ], "writerFeatures":[ {"name":"columnMapping","status":"enabled"}, {"name":"identityColumns","status":"enabled"} ] } } ``` `Protocol` action after this change: ```json { "protocol":{ "readerVersion":3, "writerVersion":7, "readerFeatures":["columnMapping"], "writerFeatures":["columnMapping","identityColumns"] } } ``` GitOrigin-RevId: 445601bbaebf39f1cd41f8257ece7aeaff48e101
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Will merge
Description
This PR proposes a change to the Delta Protocol to accommodate Table Features discussed in [Feature Request] Table Features to specify the features needed to read/write to a table #1408.
TOC is updated using
doctoc
.How was this patch tested?
Not needed.