-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Spark] Improve Delta Protocol Transitions #2848
[Spark] Improve Delta Protocol Transitions #2848
Conversation
One question. Consider the following command sequence:
Should the table get |
spark/src/test/scala/org/apache/spark/sql/delta/DeltaProtocolVersionSuite.scala
Outdated
Show resolved
Hide resolved
spark/src/test/scala/org/apache/spark/sql/delta/DeltaProtocolVersionSuite.scala
Outdated
Show resolved
Hide resolved
spark/src/test/scala/org/apache/spark/sql/delta/DeltaProtocolVersionSuite.scala
Outdated
Show resolved
Hide resolved
Very good point. Command 2 will result to |
Asking the user to set |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
True...but it's even worse if they have to do that and add each feature individually. At least this way they can go "ok, my connector supports (x, y), let me set this on the table and then downgrade. |
For me it seems the existing behavior you described is better: “For example, consider creating a table with Protocol(3, 7, RemovableReaderWriterFeature, ChangeDataFeed) and then dropping RemovableReaderWriterFeature. The resulting protocol will be Protocol(3, 7, ChangeDataFeed) instead of Protocol(1, 4)” Protocol 3,7 besides being newer has less requirements. Meaning if you “downgrade” to 1,4 you are making your table less compatible to clients that doesn’t support all the features from writer v4, but supports v7 + CDC |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regarding the user journey: this journey works when the user thinks of this ahead of time. Should we also support the reverse order:
- DROP FEATURE
- Set the protocol versions to (x, y).
I.e, should we "normalize" the protocol versions to a legacy protocol version whenever someone alters the protocol, not just on DROP FEATURE?
spark/src/main/scala/org/apache/spark/sql/delta/OptimisticTransaction.scala
Outdated
Show resolved
Hide resolved
a8e526e
to
7ce67d0
Compare
spark/src/main/scala/org/apache/spark/sql/delta/OptimisticTransaction.scala
Outdated
Show resolved
Hide resolved
spark/src/main/scala/org/apache/spark/sql/delta/actions/TableFeatureSupport.scala
Outdated
Show resolved
Hide resolved
spark/src/main/scala/org/apache/spark/sql/delta/OptimisticTransaction.scala
Outdated
Show resolved
Hide resolved
spark/src/main/scala/org/apache/spark/sql/delta/actions/TableFeatureSupport.scala
Outdated
Show resolved
Hide resolved
3c1ffa6
to
09c8065
Compare
spark/src/main/scala/org/apache/spark/sql/delta/OptimisticTransaction.scala
Outdated
Show resolved
Hide resolved
spark/src/main/scala/org/apache/spark/sql/delta/actions/TableFeatureSupport.scala
Outdated
Show resolved
Hide resolved
09c8065
to
387a9b5
Compare
spark/src/main/scala/org/apache/spark/sql/delta/OptimisticTransaction.scala
Outdated
Show resolved
Hide resolved
spark/src/main/scala/org/apache/spark/sql/delta/OptimisticTransaction.scala
Show resolved
Hide resolved
spark/src/main/scala/org/apache/spark/sql/delta/actions/TableFeatureSupport.scala
Outdated
Show resolved
Hide resolved
b50aa8f
to
613fc3c
Compare
613fc3c
to
ee781d6
Compare
assert(log.update().protocol === Protocol(1, 7) | ||
.merge(Protocol(1, 2)).withFeature(TestWriterFeature)) | ||
.withFeature(TestWriterFeature).merge(Protocol(1, 2))) | ||
table.addFeatureSupport(TestReaderWriterFeature.name) | ||
assert( | ||
log.update().protocol === Protocol(3, 7) | ||
.merge(Protocol(1, 2)) | ||
.withFeatures(Seq(TestWriterFeature, TestReaderWriterFeature))) | ||
.withFeatures(Seq(TestWriterFeature, TestReaderWriterFeature)) | ||
.merge(Protocol(1, 2))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: it wasn't clear to me why this change was necessary. It seems a bit error-prone to write the expected result this way. Let's change it to be an explicit Protocol (without merging two protocol objects).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes I agree. I changed it. Unfortunately, we use merge
very often in the tests for validating results and it would be a pain to change all of them.
The order matters because of a check in withFeatures
that requires table feature versions when adding a table feature. Merge would now normalize (1, 7) + (1, 2) to (1, 1). So, (1, 1).withFeature
would produce an error.
|'$minReaderKey' = '3', | ||
|'$minWriterKey' = '7', | ||
|'${DeltaConfigs.ENABLE_DELETION_VECTORS_CREATION.key}' = 'true' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You shouldn't need this change as part of this PR, right? I initially made the minReaderKey 2 to expose a bug: a655bed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Setting table feature versions without a table feature is now always normalised to (1, 1). I modified the test to capture your initial intention.
spark/src/main/scala/org/apache/spark/sql/delta/DeltaColumnMapping.scala
Show resolved
Hide resolved
spark/src/main/scala/org/apache/spark/sql/delta/actions/actions.scala
Outdated
Show resolved
Hide resolved
| 'delta.minReaderVersion' = '3', | ||
| 'delta.minWriterVersion' = '7') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why did we need to update this test? the minReaderVersion and minWriterVersion are "suggestions" and are ignored anyways since we enable DVs which will make this table have (3,7) in the end anyways. Are we enforcing users' minReaderVersion and minWriterVersion after this change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is changed so the protocol always ends up to (1, 1) in testDroppingColumnMapping
independently of whether we enable DVs or not. As a result, the validation at the end of verifyDropFeatureTruncateHistory
does not need to change whether this is a table features or not.
With the new semantics there are 2 cases:
- Enabling columnMapping on a legacy protocol results to (2, 5).
- Enabling columnMapping on a table features protocol results to (2, 7, ColumnMapping).
} | ||
|
||
test("protocol upgrade compatibility") { | ||
assert(Protocol(1, 1).canUpgradeTo(Protocol(1, 1))) | ||
assert(Protocol(1, 1).canUpgradeTo(Protocol(2, 1))) | ||
assert(!Protocol(1, 2).canUpgradeTo(Protocol(1, 1))) | ||
assert(!Protocol(2, 2).canUpgradeTo(Protocol(2, 1))) | ||
assert( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a bit concerned that this needed to be removed in this PR. Intuitively, (1, 2) -> (1,1) is not an "upgrade" so as a developer I expect this to return false
. Have we changed the definition of canUpgradeTo
to mean canTransitionTo
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change here is orthogonal and it is not needed by anything else done in this PR. The idea came up in early iterations of this PR with @bart-samwel to simplify canUpgradeTo
.
spark/src/main/scala/org/apache/spark/sql/delta/sources/DeltaSQLConf.scala
Show resolved
Hide resolved
spark/src/main/scala/org/apache/spark/sql/delta/actions/TableFeatureSupport.scala
Outdated
Show resolved
Hide resolved
spark/src/main/scala/org/apache/spark/sql/delta/actions/TableFeatureSupport.scala
Outdated
Show resolved
Hide resolved
spark/src/main/scala/org/apache/spark/sql/delta/actions/actions.scala
Outdated
Show resolved
Hide resolved
spark/src/main/scala/org/apache/spark/sql/delta/sources/DeltaSQLConf.scala
Show resolved
Hide resolved
spark/src/test/scala-spark-master/org/apache/spark/sql/delta/DeltaVariantSuite.scala
Outdated
Show resolved
Hide resolved
@@ -25,6 +25,7 @@ import java.util.Locale | |||
import scala.sys.process.Process | |||
|
|||
// scalastyle:off import.ordering.noEmptyLine | |||
// scalastyle:off line.size.limit |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems unnecessary when only deleting stuff?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The issue with that one is that was contained in the piece of code I deleted and it turns out it is needed by the rest of the code below... I think what happened is that there is a bug below were the style is not turned back on. So the style was not enforced when follow up code was added.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, but then, don't add it at the top, but just scope it where it's needed, please? Or is just sooooo many places that that would be crazy?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am afraid there are 38 errors :(.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That are all non-consecutive?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am afraid they are not.
spark/src/test/scala/org/apache/spark/sql/delta/schema/InvariantEnforcementSuite.scala
Outdated
Show resolved
Hide resolved
There are spark master test failures: https://github.com/delta-io/delta/actions/runs/9891350186/job/27321586977?pr=2848
|
The PR is not ready to be merged. The particular failure is fixed by #3356. |
Which Delta project/connector is this regarding?
Description
Currently, protocol transitions can be hard to manage. A few examples:
This PR proposes several protocol transition improvements in order to simplify user journeys. The high level proposal is the following:
Two protocol representations with singular operational semantics. This means that we have two ways to represent a protocol: a) The legacy representation and b) the table features representation. The latter representation is more powerful than the former, i.e the table features representation can represent all legacy protocols but the opposite is not true. This is followed by three simple rules:
The PR introduces the following behavioural changes:
(3, 7, AppendOnly, Invariants, DeletionVectors)
.(1, 3)
and deletion vectors results to protocol(3, 7, AppendOnly, Invariants, CheckConstraints, DeletionVectors)
.(3, 7)
and no table features is now normalised to(1, 1)
.How was this patch tested?
Added
DeltaProtocolTransitionsSuite
. Also modified existing tests inDeltaProtocolVersionSuite
.Does this PR introduce any user-facing changes?
Yes.