[SPARK-43333][SQL] Allow Avro to convert union type to SQL with field name stable with type #41263

siying · 2023-05-22T17:53:22Z

What changes were proposed in this pull request?

Introduce AvroOption "enableStableIdentifiersForUnionType". If it is set to true (default remains to be false), Avro's union is converted to SQL schema by naming field name "member_" + type name. This is to try to keep field name stable with type name.

Why are the changes needed?

The purpose of this is twofold:

To allow adding or removing types to the union without affecting the record names of other member types. If the new or removed type is not ordered last, then existing queries referencing "member2" may need to be rewritten to reference "member1" or "member3".
Referencing the type name in the query is more readable than referencing "member0".
For example, our system produces an avro schema from a Java type structure where subtyping maps to union types whose members are ordered lexicographically. Adding a subtype can therefore easily result in all references to "member2" needing to be updated to "member3".

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Add a unit test that covers all types supported in union, as well as some potential name conflict cases.

sadikovi

Thanks for working on this. I left a few comments, would appreciate it if you could take a look.

sadikovi · 2023-05-22T22:22:20Z

connector/avro/src/main/scala/org/apache/spark/sql/avro/SchemaConverters.scala

@@ -144,11 +147,28 @@ object SchemaConverters {
 case _ =>
 // Convert complex unions to struct types where field names are member0, member1, etc.
 // This is consistent with the behavior when converting between Avro and Parquet.
+ val use_stable_id = SQLConf.get.getConf(SQLConf.AVRO_STABLE_ID_FOR_UNION_TYPE)


nit: Could you use useStableId?

sadikovi · 2023-05-22T22:24:38Z

connector/avro/src/main/scala/org/apache/spark/sql/avro/SchemaConverters.scala

+ // could be "a" and "A" and we need to distinguish them.
+ var temp_name = s"member_${s.getName.toLowerCase(Locale.ROOT)}"
+ while (fieldNameSet.contains(temp_name)) {
+ temp_name = s"${temp_name}_$i"


I was thinking if we could simply throw an error when this case happens; the reason is that they might not be stable identifiers anymore. We could explain that stable identifiers can only be used if the types are unique, if you have more than one type that has the same name, please use sequential numeric ids instead.

Alternatively, we could just store them as is, without converting to upper or lower case - that could be an option.

sadikovi · 2023-05-22T22:26:10Z

connector/avro/src/main/scala/org/apache/spark/sql/avro/SchemaConverters.scala

@@ -144,11 +147,28 @@ object SchemaConverters {
 case _ =>
 // Convert complex unions to struct types where field names are member0, member1, etc.
 // This is consistent with the behavior when converting between Avro and Parquet.
+ val use_stable_id = SQLConf.get.getConf(SQLConf.AVRO_STABLE_ID_FOR_UNION_TYPE)
+
+ var fieldNameSet : Set[String] = Set()


This would copy the set every time you add an element. We can change it to a mutable set (https://www.scala-lang.org/api/2.13.6/scala/collection/mutable/Set$.html) or just java.util.HashSet.

sadikovi · 2023-05-22T22:29:26Z

connector/avro/src/main/scala/org/apache/spark/sql/avro/SchemaConverters.scala

@@ -144,11 +147,28 @@ object SchemaConverters {
 case _ =>
 // Convert complex unions to struct types where field names are member0, member1, etc.
 // This is consistent with the behavior when converting between Avro and Parquet.
+ val use_stable_id = SQLConf.get.getConf(SQLConf.AVRO_STABLE_ID_FOR_UNION_TYPE)
+
+ var fieldNameSet : Set[String] = Set()
 val fields = avroSchema.getTypes.asScala.zipWithIndex.map {
 case (s, i) =>
 val schemaType = toSqlTypeHelper(s, existingRecordNames)
 // All fields are nullable because only one of them is set at a time


Could you move this comment to L171 - it was referring to the nullable flag.

sadikovi · 2023-05-22T22:30:38Z

connector/avro/src/test/scala/org/apache/spark/sql/avro/AvroSuite.scala

@@ -98,6 +98,44 @@ abstract class AvroSuite
 }, new GenericDatumReader[Any]()).getSchema.toString(false)
 }

+ def checkUnionStableId(
+ types: List[Schema],


nit: 4 space indentation for method parameters. Could you also make it private?

sadikovi · 2023-05-22T22:34:33Z

connector/avro/src/test/scala/org/apache/spark/sql/avro/AvroSuite.scala

+ "member_myrecord2: struct<field: float>>",
+ Seq())
+
+ // Two array or map is not allowed in union.


nit: Could we change the comment to this: Two array types or two map types are not allowed in union.

sadikovi · 2023-05-22T22:36:22Z

connector/avro/src/test/scala/org/apache/spark/sql/avro/AvroSuite.scala

+ def checkUnionStableId(
+ types: List[Schema],
+ expectedSchema: String,
+ fieldsAndRow : Seq[(Any, Row)]): Unit = {


nit: no space before :.

Also, could you explain the type? I think you can just pass the expected DataFrame that would contain the schema and the data.

I copied the way SQL schema is generated from test "Complex Union Type". I feel it is easier to write unit test this way and if possible I will maintain it. I will add comments to explain the parameters.

sadikovi · 2023-05-22T22:38:18Z

connector/avro/src/test/scala/org/apache/spark/sql/avro/AvroSuite.scala

- dataFileWriter.append(avroRec2)
- dataFileWriter.flush()
- dataFileWriter.close()
+ test("union stable id") {


Almost forgot to mention, could you change this to SPARK-43333: union type with stable ids? Thanks.

sadikovi · 2023-05-22T22:39:41Z

cc @dongjoon-hyun @sunchao @gengliangwang

dongjoon-hyun · 2023-05-22T22:40:43Z

Thank you for pinging me, @sadikovi .

dongjoon-hyun · 2023-05-22T22:42:04Z

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

+ "spark.sql.avro.enableStableIdentifiersForUnionType")
+ .doc("When Avro is desrialized to SQL schema, the union type is converted to structure in a " +
+ "way that field names of the structure are stable with the type, in most cases.")
+ .version("3.4.0")


dongjoon-hyun · 2023-05-22T22:43:15Z

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

+ val AVRO_STABLE_ID_FOR_UNION_TYPE = buildConf(
+ "spark.sql.avro.enableStableIdentifiersForUnionType")
+ .doc("When Avro is desrialized to SQL schema, the union type is converted to structure in a " +
+ "way that field names of the structure are stable with the type, in most cases.")


The description seems to need revisions to clarify what is the difference between true and false.

Especially, please mention the case-sensitivity cases.

sadikovi · 2023-05-24T00:42:08Z

connector/avro/src/main/scala/org/apache/spark/sql/avro/SchemaConverters.scala

+ val fieldName = if (useSchemaId) {
+ // Avro's field name may be case sensitive, so field names for two named type
+ // could be "a" and "A" and we need to distinguish them. In this case, we throw
+ // an option.


nit: ... we throw an exception.

sadikovi · 2023-05-24T00:49:55Z

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

@@ -3413,6 +3413,17 @@ object SQLConf {
 .booleanConf
 .createWithDefault(true)

+ val AVRO_STABLE_ID_FOR_UNION_TYPE = buildConf(
+ "spark.sql.avro.enableStableIdentifiersForUnionType")
+ .doc("If it is set to true, then Avro is desrialized to SQL schema, the union type is " +


Let's rephrase the doc like this:

If it is set to true, Avro schema is deserialized into Spark SQL schema, and the Avro Union type is transformed into a structure where the field names remain consistent with their respective types. The resulting field names are converted to lowercase, e.g. member_int or member_string. If two user-defined type names are identical regardless of case, an exception will be raised. However, in other cases, the field names can be uniquely identified.

sadikovi · 2023-05-24T00:51:09Z

connector/avro/src/test/scala/org/apache/spark/sql/avro/AvroSuite.scala

@@ -98,6 +98,52 @@ abstract class AvroSuite
 }, new GenericDatumReader[Any]()).getSchema.toString(false)
 }

+ /* Check whether an Avro schema of union type is converted to SQL in an expected way, when the


Is this supposed to be Javadoc? If so, then it should look like this:

/** * <your text> */

If not, you can just use the inline comments.

sadikovi

Thanks for making the changes. LGTM.

sadikovi · 2023-05-24T01:31:10Z

@dongjoon-hyun Could you take another look when you have time? Thank you.

rangadi · 2023-05-24T16:36:58Z

connector/avro/src/main/scala/org/apache/spark/sql/avro/SchemaConverters.scala

@@ -144,11 +148,31 @@ object SchemaConverters {
 case _ =>
 // Convert complex unions to struct types where field names are member0, member1, etc.
 // This is consistent with the behavior when converting between Avro and Parquet.
+ val useSchemaId = SQLConf.get.getConf(SQLConf.AVRO_STABLE_ID_FOR_UNION_TYPE)


Normally these configs are provided as options for functions (e.g. for from_avro()).
For file source, it should be an option for the source.
Lets not use spark conf.

rangadi · 2023-05-24T16:38:36Z

connector/avro/src/test/scala/org/apache/spark/sql/avro/AvroSuite.scala

- dataFileWriter.append(avroRec2)
- dataFileWriter.flush()
- dataFileWriter.close()
+ test("SPARK-43333: union stable id") {


We can remove SPARK jira id here.
Can we include a user defined Avro struct also in addition to primitive types? Say 'CustomerInfo'.

Tests need to include Spark Jira ids unless the test suite is new.

Does it mean we need to read the Spark Jira to understand the test? I would be surprised if there is a such policy. Do you have link?
It is a test for a new feature. Ideally it should be understandable by itself and should not need to go to jira ticket. I have added many new tests without adding Jira id.
I am ok if we want to include it here. I don't see any of use of doing so.

Yes, Jira number needs to be included, however, the test name should be descriptive enough to understand what the test does. Jira number is added for the reference, if the test breaks, it is much easier to track down the original change and understand the motivation behind it.

You can find a note on this in https://spark.apache.org/contributing.html (Pull request section).

Thanks for the link. Sure.

rangadi · 2023-05-24T16:39:14Z

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

@@ -3413,6 +3413,18 @@ object SQLConf {
 .booleanConf
 .createWithDefault(true)

+ val AVRO_STABLE_ID_FOR_UNION_TYPE = buildConf(


Commented above. I think it should be an option for Avro functions and Avro source, not a spark conf.

siying · 2023-05-25T03:50:50Z

Move the knob from SQLConf to AvroOptions, after the discussion with @sadikovi and @rangadi

sadikovi · 2023-05-25T06:03:06Z

connector/avro/src/test/scala/org/apache/spark/sql/avro/AvroSuite.scala

- dataFileWriter.append(avroRec2)
- dataFileWriter.flush()
- dataFileWriter.close()
+ test("SPARK-43333: union stable id") {


Could you update the test name, e.g. Stable field names when converting Union type or Union type: stable field ids/names? So other contributors could understand what is being tested here.

…roOptions, as SchemaConverters is public and AvroOptions is a private class.

siying · 2023-05-25T18:45:11Z

@dongjoon-hyun I addressed the comments and the CI appears to pass now. Can you help take a look?

rangadi

Overall LGTM. Made a few suggestions.

rangadi · 2023-05-25T18:47:21Z

connector/avro/src/main/scala/org/apache/spark/sql/avro/AvroOptions.scala

@@ -154,4 +157,5 @@ private[sql] object AvroOptions extends DataSourceOptions {
 // datasource similarly to the SQL config `spark.sql.avro.datetimeRebaseModeInRead`,
 // and can be set to the same values: `EXCEPTION`, `LEGACY` or `CORRECTED`.
 val DATETIME_REBASE_MODE = newOption("datetimeRebaseMode")
+ val STABLE_ID_FOR_UNION_TYPE = newOption("enableStableIdentifiersForUnionType")


Can we add documentation for this? I think Spark conf version had long doc comment. We can reuse that here.

rangadi · 2023-05-25T18:50:33Z

connector/avro/src/main/scala/org/apache/spark/sql/avro/SchemaConverters.scala

+ // types where field names are member0, member1, etc. This is consistent with the
+ // behavior when converting between Avro and Parquet.


What is Parquet connection here? Should this say "consistent with default behavior before adding support for stable names"?.

This is the existing comment. I just got in different lines after adding "When avroOptions.useStableIdForUnionType is false" in the beginning. I don't know what it is and I have no reason to doubt it is wrong.

rangadi · 2023-05-25T18:52:35Z

connector/avro/src/test/scala/org/apache/spark/sql/avro/AvroFunctionsSuite.scala

+ toSqlType(avroSchema, options).
+ dataType.
+ asInstanceOf[StructType]


Code style: . should move to start. E.g. :

val sparkSchema = SchemaConverters .toSqlType(avroSchema, options) .dataType .asInstanceOf[StructType]

Thanks for catching it. I don't know why it became like that.

rangadi · 2023-05-25T18:58:47Z

connector/avro/src/test/scala/org/apache/spark/sql/avro/AvroSuite.scala

+ }
+ }
+
+ test("SPARK-27858 Union type: More than one non-null type") {


Could add a short description of the test in a comment at the top? This helps in understanding the test.

This is not a new test. It's an existing test. I just added the scenario of stable ID.

rangadi · 2023-05-25T18:59:39Z

docs/sql-data-sources-avro.md

+ <tr>
+ <td><code>enableStableIdentifiersForUnionType</code></td>
+ <td>false</td>
+ <td>If it is set to true, Avro schema is deserialized into Spark SQL schema, and the Avro Union type is transformed into a structure where the field names remain consistent with their respective types. The resulting field names are converted to lowercase, e.g. member_int or member_string. If two user-defined type names or a user-defined type name and a built-in type name are identical regardless of case, an exception will be raised. However, in other cases, the field names can be uniquely identified.</td>


Please copy this description to AvroOptions.scala as well.

mridulm · 2023-05-26T10:50:49Z

+CC @shardulm94

siying · 2023-05-26T16:55:24Z

@dongjoon-hyun the tests are all pass now. Can you help take a look?

siying · 2023-05-30T22:35:52Z

@dongjoon-hyun do you plan to take a look?

gengliangwang

LGTM, thanks for the work

… name stable with type ### What changes were proposed in this pull request? Introduce AvroOption "enableStableIdentifiersForUnionType". If it is set to true (default remains to be false), Avro's union is converted to SQL schema by naming field name "member_" + type name. This is to try to keep field name stable with type name. ### Why are the changes needed? The purpose of this is twofold: To allow adding or removing types to the union without affecting the record names of other member types. If the new or removed type is not ordered last, then existing queries referencing "member2" may need to be rewritten to reference "member1" or "member3". Referencing the type name in the query is more readable than referencing "member0". For example, our system produces an avro schema from a Java type structure where subtyping maps to union types whose members are ordered lexicographically. Adding a subtype can therefore easily result in all references to "member2" needing to be updated to "member3". ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Add a unit test that covers all types supported in union, as well as some potential name conflict cases. Closes apache#41263 from siying/avro_stable_union. Authored-by: Siying Dong <siying.dong@databricks.com> Signed-off-by: Gengliang Wang <gengliang@apache.org>

Allow Avro to convert union type to SQL with field name stable with type

c4137f9

github-actions bot added AVRO SQL labels May 22, 2023

formatting

c22dc11

sadikovi reviewed May 22, 2023

View reviewed changes

dongjoon-hyun reviewed May 22, 2023

View reviewed changes

siying changed the title ~~[SPARK-43333][CORE] Allow Avro to convert union type to SQL with field name stable with type~~ [SPARK-43333][SQL] Allow Avro to convert union type to SQL with field name stable with type May 23, 2023

Address comments

fa0a86c

sadikovi reviewed May 24, 2023

View reviewed changes

siying added 2 commits May 23, 2023 17:58

Address comments.

c4c87a3

Fix an unintended change

5ec1ace

github-actions bot added the STRUCTURED STREAMING label May 24, 2023

siying added 2 commits May 23, 2023 18:09

Add a unit test case and update config description

09123bd

Again remove an unintended change

2f9ae46

github-actions bot removed the STRUCTURED STREAMING label May 24, 2023

sadikovi approved these changes May 24, 2023

View reviewed changes

rangadi reviewed May 24, 2023

View reviewed changes

Change from SQLConf to AvroOptions

ff507fd

github-actions bot added the DOCS label May 25, 2023

fix a typo

75f3f6a

sadikovi approved these changes May 25, 2023

View reviewed changes

Change SchemaConverters.toSql() to use option as a map rather than Av…

da017a9

…roOptions, as SchemaConverters is public and AvroOptions is a private class.

rangadi approved these changes May 25, 2023

View reviewed changes

siying added 3 commits May 25, 2023 12:25

Address comments

4232d36

Formatting

030054c

Another format

d3c10b0

gengliangwang approved these changes May 31, 2023

View reviewed changes

gengliangwang closed this in 8f2afb8 May 31, 2023

kevinwallimann mentioned this pull request Feb 1, 2024

Support upcoming constructor change in Spark 3.5.1-SNAPSHOT AbsaOSS/ABRiS#353

Merged

		// types where field names are member0, member1, etc. This is consistent with the
		// behavior when converting between Avro and Parquet.

[SPARK-43333][SQL] Allow Avro to convert union type to SQL with field name stable with type #41263

[SPARK-43333][SQL] Allow Avro to convert union type to SQL with field name stable with type #41263

Conversation

siying commented May 22, 2023 • edited Loading

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

sadikovi left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sadikovi commented May 22, 2023

dongjoon-hyun commented May 22, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sadikovi left a comment

Choose a reason for hiding this comment

sadikovi commented May 24, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rangadi May 24, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

siying commented May 25, 2023

Choose a reason for hiding this comment

siying commented May 25, 2023

rangadi left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mridulm commented May 26, 2023

siying commented May 26, 2023

siying commented May 30, 2023

gengliangwang left a comment

Choose a reason for hiding this comment

siying commented May 22, 2023 •

edited

Loading

rangadi May 24, 2023 •

edited

Loading