-
Notifications
You must be signed in to change notification settings - Fork 28.7k
[SPARK-51834][SQL] Support end-to-end table constraint management #50631
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
cc @aokolnychyi |
val constraint = tableConstraint.toV2Constraint(isCreateTable = false) | ||
val validatedTableVersion = table match { | ||
case t: ResolvedTable if constraint.enforced() => | ||
t.table.currentVersion() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Created a follow-up https://issues.apache.org/jira/browse/SPARK-51835 for testing the table version
"message" : [ | ||
"The check constraint `<checkCondition>` is non-deterministic. Check constraints must only contain deterministic expressions." | ||
], | ||
"sqlState" : "42621" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The error code seems consistent with DB2 and what we use for generated columns, +1.
@@ -18,11 +18,12 @@ | |||
package org.apache.spark.sql.catalyst.analysis | |||
|
|||
import org.apache.spark.SparkThrowable | |||
import org.apache.spark.sql.catalyst.expressions.{Expression, Literal} | |||
import org.apache.spark.sql.catalyst.expressions._ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the agreement in the community on wildcard imports? Are they permitted after a given number of elements are imported directly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As per https://github.com/databricks/scala-style-guide?tab=readme-ov-file#imports,
"Avoid using wildcard imports, unless you are importing more than 6 entities"
Some(LocalRelation(attributeList)) | ||
} | ||
|
||
private def analyzeConstraints( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any other way to do this? Can we restructure the plan so that the analyzer naturally resolves these expressions? I like that we pivoted to DefaultValueExpression
for default values, rather than using a custom analyzer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't think of any other way.
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/ResolveDefaultColumnsUtil.scala#L335
This is similar to what we did for column default.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me think a bit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we do something similar to what @cloud-fan did for OverwriteByExpression
in SPARK-33412?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My worry is that we added DefaultValueExpression
to eventually get rid of the custom analyzer and optimizer for default values. It would be great not to add more dependencies on it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are some differences here. In a create table statement:
- column default value and option CANNOT reference columns
- constraint CAN reference columns
Using a default analyzer with a dummy relation is simple, and it can include all other analysis batches other than the main Resolution
batch.
val validateStatus = if (isCreateTable) { | ||
Constraint.ValidationStatus.UNVALIDATED | ||
} else { | ||
Constraint.ValidationStatus.VALID |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the idea here that we always validate existing data in ALTER?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes for check constraint
@@ -112,6 +117,27 @@ case class CheckConstraint( | |||
with TableConstraint { | |||
// scalastyle:on line.size.limit | |||
|
|||
def toV2Constraint(isCreateTable: Boolean): Constraint = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if the input param should be related to the validation status, rather than to whether it is create or alter. For instance, we can make validation optional in ALTER.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, how about let's make all the validate status as UNVALIDATED
in this PR? Once we support enforcing check constraint, we can have more discussions on this one
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated. PTAL. I am keeping the parameter here since the enforcement will be added soon.
case _ => | ||
null | ||
} | ||
Seq(TableChange.addConstraint(constraint, validatedTableVersion)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CHECK constraints must optionally validate existing data in ALTER.
Am I right this PR doesn't have this? What would be our plan?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
must optionally validate
Make sense. Do you mean CHECK ... NOT ENFOCED
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ENFORCED/NOT ENFORCED impacts subsequent writes. I was referring to ALTER TABLE ... ADD CONSTRAINT that must scan the existing data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Created a follow-up: https://issues.apache.org/jira/browse/SPARK-51905
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just for my understanding. Anton's comment was about how to validate the existing data in ALTER TABLE ... ADD CONSTRAINT
. Is it addressed in this PR, @gengliangwang ?
The above follow-up JIRA (SPARK-51905) is not about that, isn't it?
SPARK-51905 Disallow NOT ENFORCED CHECK constraint
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I think we need one more JIRA to add the scan capability to ALTER TABLE ... ADD CONSTRAINT.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, @gengliangwang already created it: SPARK-51903.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it. Thank you, @aokolnychyi . Ya, SPARK-51903 is what I expected.
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/constraints.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/constraints.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/constraints.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveCatalogs.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/constraints.scala
Outdated
Show resolved
Hide resolved
sql/core/src/test/scala/org/apache/spark/sql/execution/command/CheckConstraintParseSuite.scala
Outdated
Show resolved
Hide resolved
sql/core/src/test/scala/org/apache/spark/sql/execution/command/CheckConstraintParseSuite.scala
Outdated
Show resolved
Hide resolved
// Convert to a data source v2 constraint | ||
def toV2Constraint(isCreateTable: Boolean): Constraint |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For an API doc, could you clarify what isCreateTable
is used for? Maybe add @param
for it. Seems it is not even used in this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also the doc comment style is not consistent with other methods.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See comment in #50631 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To avoid confusion, I removed the parameter
override def nullable: Boolean = true | ||
|
||
override def dataType: DataType = | ||
throw new SparkUnsupportedOperationException("CONSTRAINT_DOES_NOT_HAVE_DATA_TYPE") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, by default dataType
is not supported, but it has default nullable
? Does it make sense?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you explain what is the issue here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure. Is there a case you will use nullable without dataType? Should it be unsupported too by default for nullable?
Not a big deal, though. Just wondering the reason.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Either way should be ok. I changed to throwing exception.
nameParts: Seq[String], | ||
allowTemp: Boolean, | ||
columns: Seq[ColumnDefinition]): ResolvedIdentifier = { | ||
val columnOutput = columns.map { col => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cloud-fan I made this new change to bypass 38c6ef4#diff-583171e935b2dc349378063a5841c5b98b30a2d57ac3743a9eccfe7bffcb8f2aR286
Does this look good to you?
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala
Outdated
Show resolved
Hide resolved
sql/core/src/test/scala/org/apache/spark/sql/execution/command/v2/CheckConstraintSuite.scala
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the new approach that doesn't involve invoking the custom analyzer is a much better option!
@aokolnychyi @cloud-fan @viirya @dongjoon-hyun Thanks for the review. Merging this one to master. |
### What changes were proposed in this pull request? Support end-to-end table constraint management: - Create a DSV2 table with constraints - Replace a DSV2 table with constraints - ALTER a DSV2 table to add a new constraint - ALTER a DSV2 table to drop a constraint ### Why are the changes needed? Allow users to define and modify table constraints in connectors that support them. ### Does this PR introduce _any_ user-facing change? No, it is for DSV2 framework. ### How was this patch tested? New UT ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#50631 from gengliangwang/constraintE2E. Authored-by: Gengliang Wang <gengliang@apache.org> Signed-off-by: Gengliang Wang <gengliang@apache.org>
### What changes were proposed in this pull request? Support end-to-end table constraint management: - Create a DSV2 table with constraints - Replace a DSV2 table with constraints - ALTER a DSV2 table to add a new constraint - ALTER a DSV2 table to drop a constraint ### Why are the changes needed? Allow users to define and modify table constraints in connectors that support them. ### Does this PR introduce _any_ user-facing change? No, it is for DSV2 framework. ### How was this patch tested? New UT ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#50631 from gengliangwang/constraintE2E. Authored-by: Gengliang Wang <gengliang@apache.org> Signed-off-by: Gengliang Wang <gengliang@apache.org>
### What changes were proposed in this pull request? Support end-to-end table constraint management: - Create a DSV2 table with constraints - Replace a DSV2 table with constraints - ALTER a DSV2 table to add a new constraint - ALTER a DSV2 table to drop a constraint ### Why are the changes needed? Allow users to define and modify table constraints in connectors that support them. ### Does this PR introduce _any_ user-facing change? No, it is for DSV2 framework. ### How was this patch tested? New UT ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#50631 from gengliangwang/constraintE2E. Authored-by: Gengliang Wang <gengliang@apache.org> Signed-off-by: Gengliang Wang <gengliang@apache.org>
What changes were proposed in this pull request?
Support end-to-end table constraint management:
Why are the changes needed?
Allow users to define and modify table constraints in connectors that support them.
Does this PR introduce any user-facing change?
No, it is for DSV2 framework.
How was this patch tested?
New UT
Was this patch authored or co-authored using generative AI tooling?
No