Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-50405][SQL] Handle collation type coercion of complex data types properly #48936

Closed

Conversation

stefankandic
Copy link
Contributor

@stefankandic stefankandic commented Nov 22, 2024

What changes were proposed in this pull request?

This pull request generalizes collation type coercion to support not just casting all children to a single string type, but also handling complex data types such as structs, maps, and arrays (arrays partially worked already).

The core idea is to recursively analyze the entire data type of an expression, annotating each StringType within it with the highest-priority collation and its strength. This annotation propagates upward through the expression tree. Once the root of the expression is reached, the annotations are removed, and the expression is cast to the desired data type.

For the root expression e, the collation data type context is computed by first calculating the context for all its children and then merging those results into the data type of e.

Why are the changes needed?

In #48663, a new approach to calculating collation precedence was introduced. This approach recursively examines the children of an expression and propagates the collation with the highest priority upward.

However, the current implementation of collation coercion is limited to determining the StringType that all children should be cast to. This approach falls short when dealing with complex types like structs, maps, and arrays, which can also contain collations. To address this limitation, we need a more general mechanism that allows coercion of any data type, not just simple strings.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

With new unit tests.

Was this patch authored or co-authored using generative AI tooling?

No.

@github-actions github-actions bot added the SQL label Nov 22, 2024
@stefankandic stefankandic changed the title [DRAFT] Collation type coercion of complex types [SPARK-50405] Handle collation type coercion of complex data types properly Nov 25, 2024
@stefankandic stefankandic marked this pull request as ready for review November 25, 2024 01:35
@stefankandic
Copy link
Contributor Author

@cloud-fan please take a look when you get the chance.

You raised some valid concerns in the last PR about structs and maps so I think I was able to figure out a nice way to have it work for all data types :)

@HyukjinKwon HyukjinKwon changed the title [SPARK-50405] Handle collation type coercion of complex data types properly [SPARK-50405][SQL] Handle collation type coercion of complex data types properly Nov 25, 2024
Copy link
Member

@MaxGekk MaxGekk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

stefankandic and others added 2 commits December 2, 2024 10:52
…onTypePrecedenceSuite.scala

Co-authored-by: Uros Bojanic <uros.bojanic@databricks.com>
@stefankandic
Copy link
Contributor Author

@MaxGekk can we go ahead and merge the change?

Some(leftType)

case (ArrayType(leftElemType, nullable), ArrayType(rightElemType, _)) =>
mergeStructurally(leftElemType, rightElemType)( baseCase).map(ArrayType(_, nullable))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
mergeStructurally(leftElemType, rightElemType)( baseCase).map(ArrayType(_, nullable))
mergeStructurally(leftElemType, rightElemType)(baseCase).map(ArrayType(_, nullable))

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you clarify merging nullable, so, if the left array type has nullable is false, but the right one has true, you propose to set it to false? I would expect something like leftNullable || rightNullable.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's say you have an expression e with children c1, c2, c3. You will find the least common type for them t = LCT(c1, c2, c3). Then you should apply that type to all children (bear in mind that some of these children can be complex while others can be primitive types).

When applying the new type t we should not change the nullability of the type that the children had originally, and that's why I am doing it this way. Not sure if I could have named the parameters better, but the left and right type are not equal, we are trying to merge the collations of the right one into the left and keep everything else the same.

It wouldn't make sense for one of the children to suddenly become nullable after coercion or the other way around.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense.
Maybe name the parameters original and update: Implies right contains updates applied to left.

stefankandic and others added 3 commits December 3, 2024 20:14
…pSuite.scala

Co-authored-by: Maxim Gekk <max.gekk@gmail.com>
…ysis/CollationTypeCoercion.scala

Co-authored-by: Maxim Gekk <max.gekk@gmail.com>
@MaxGekk
Copy link
Member

MaxGekk commented Dec 4, 2024

+1, LGTM. Merging to master.
Thank you, @stefankandic and @uros-db @stevomitric @dejankrak-db @vladanvasi-db for review.

MaxGekk pushed a commit that referenced this pull request Dec 9, 2024
### What changes were proposed in this pull request?
Add proper support and tests for finding the collation context of cast expressions, especially on complex types.

### Why are the changes needed?
#48936 introduced a way for handling collation type coercion in complex types, however I didn't think about casting, which should be special cased when searching for the collation context of an expression. This means that currently collation context could return the type which is not the target type of the cast, and that is just incorrect.

Also we should be able to use casts to change collation of complex types (arrays, maps, structs), and since we can't just use `collate` clause on complex types we should assign IMPLICIT strength to user defined casts on complex types.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
New unit tests.

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes #49075 from stefankandic/fixComplexTypesCast.

Authored-by: Stefan Kandic <stefan.kandic@databricks.com>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants