-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-50405][SQL] Handle collation type coercion of complex data types properly #48936
[SPARK-50405][SQL] Handle collation type coercion of complex data types properly #48936
Conversation
@cloud-fan please take a look when you get the chance. You raised some valid concerns in the last PR about structs and maps so I think I was able to figure out a nice way to have it work for all data types :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@stevomitric @uros-db @ivanjevtic-db @vladanvasi-db Please, review this PR.
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CollationTypeCoercion.scala
Show resolved
Hide resolved
sql/core/src/test/scala/org/apache/spark/sql/collation/CollationTypePrecedenceSuite.scala
Outdated
Show resolved
Hide resolved
…onTypePrecedenceSuite.scala Co-authored-by: Uros Bojanic <uros.bojanic@databricks.com>
@MaxGekk can we go ahead and merge the change? |
sql/core/src/test/scala/org/apache/spark/sql/CollationSQLRegexpSuite.scala
Outdated
Show resolved
Hide resolved
Some(leftType) | ||
|
||
case (ArrayType(leftElemType, nullable), ArrayType(rightElemType, _)) => | ||
mergeStructurally(leftElemType, rightElemType)( baseCase).map(ArrayType(_, nullable)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mergeStructurally(leftElemType, rightElemType)( baseCase).map(ArrayType(_, nullable)) | |
mergeStructurally(leftElemType, rightElemType)(baseCase).map(ArrayType(_, nullable)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you clarify merging nullable
, so, if the left array type has nullable
is false
, but the right one has true
, you propose to set it to false
? I would expect something like leftNullable || rightNullable
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's say you have an expression e
with children c1, c2, c3
. You will find the least common type for them t = LCT(c1, c2, c3)
. Then you should apply that type to all children (bear in mind that some of these children can be complex while others can be primitive types).
When applying the new type t
we should not change the nullability of the type that the children had originally, and that's why I am doing it this way. Not sure if I could have named the parameters better, but the left
and right
type are not equal, we are trying to merge the collations of the right one into the left and keep everything else the same.
It wouldn't make sense for one of the children to suddenly become nullable after coercion or the other way around.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense.
Maybe name the parameters original
and update
: Implies right contains updates applied to left.
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CollationTypeCoercion.scala
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CollationTypeCoercion.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CollationTypeCoercion.scala
Show resolved
Hide resolved
…pSuite.scala Co-authored-by: Maxim Gekk <max.gekk@gmail.com>
…ysis/CollationTypeCoercion.scala Co-authored-by: Maxim Gekk <max.gekk@gmail.com>
+1, LGTM. Merging to master. |
### What changes were proposed in this pull request? Add proper support and tests for finding the collation context of cast expressions, especially on complex types. ### Why are the changes needed? #48936 introduced a way for handling collation type coercion in complex types, however I didn't think about casting, which should be special cased when searching for the collation context of an expression. This means that currently collation context could return the type which is not the target type of the cast, and that is just incorrect. Also we should be able to use casts to change collation of complex types (arrays, maps, structs), and since we can't just use `collate` clause on complex types we should assign IMPLICIT strength to user defined casts on complex types. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? New unit tests. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #49075 from stefankandic/fixComplexTypesCast. Authored-by: Stefan Kandic <stefan.kandic@databricks.com> Signed-off-by: Max Gekk <max.gekk@gmail.com>
What changes were proposed in this pull request?
This pull request generalizes collation type coercion to support not just casting all children to a single string type, but also handling complex data types such as structs, maps, and arrays (arrays partially worked already).
The core idea is to recursively analyze the entire data type of an expression, annotating each StringType within it with the highest-priority collation and its strength. This annotation propagates upward through the expression tree. Once the root of the expression is reached, the annotations are removed, and the expression is cast to the desired data type.
For the root expression
e
, the collation data type context is computed by first calculating the context for all its children and then merging those results into the data type ofe
.Why are the changes needed?
In #48663, a new approach to calculating collation precedence was introduced. This approach recursively examines the children of an expression and propagates the collation with the highest priority upward.
However, the current implementation of collation coercion is limited to determining the StringType that all children should be cast to. This approach falls short when dealing with complex types like structs, maps, and arrays, which can also contain collations. To address this limitation, we need a more general mechanism that allows coercion of any data type, not just simple strings.
Does this PR introduce any user-facing change?
No.
How was this patch tested?
With new unit tests.
Was this patch authored or co-authored using generative AI tooling?
No.