forked from apache/spark
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
[SPARK-48175][SQL][PYTHON] Store collation information in metadata an…
…d not in type for SER/DE ### What changes were proposed in this pull request? Changing serialization and deserialization of collated strings so that the collation information is put in the metadata of the enclosing struct field - and then read back from there during parsing. Format of serialization will look something like this: ```json { "type": "struct", "fields": [ "name": "colName", "type": "string", "nullable": true, "metadata": { "__COLLATIONS": { "colName": "UNICODE" } } ] } ``` If we have a map we will add suffixes `.key` and `.value` in the metadata: ```json { "type": "struct", "fields": [ { "name": "mapField", "type": { "type": "map", "keyType": "string", "valueType": "string", "valueContainsNull": true }, "nullable": true, "metadata": { "__COLLATIONS": { "mapField.key": "UNICODE", "mapField.value": "UNICODE" } } } ] } ``` It will be a similar story for arrays (we will add `.element` suffix). We could have multiple suffixes when working with deeply nested data types (Map[String, Array[Array[String]]] - see tests for this example) ### Why are the changes needed? Putting collation info in field metadata is the only way to not break old clients reading new tables with collations. `CharVarcharUtils` does a similar thing but this is much less hacky, and more friendly for all 3p clients - which is especially important since delta also uses spark for schema ser/de. It will also remove the need for additional logic introduced in apache#46083 to remove collations before writing to HMS as this way the tables will be fully HMS compatible. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? With unit tests ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#46280 from stefankandic/newDeltaSchema. Lead-authored-by: Stefan Kandic <stefan.kandic@databricks.com> Co-authored-by: Stefan Kandic <154237371+stefankandic@users.noreply.github.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>
- Loading branch information
Showing
13 changed files
with
1,004 additions
and
59 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.