Skip to content

Conversation

@BramBoog
Copy link

@BramBoog BramBoog commented May 2, 2023

What changes were proposed in this pull request?

When converting a StructType instance containing a nested StructType column which in turn contains a column for which nullable = false to a DDL string using .toDDL, the resulting DDL string does not include this non-nullability. For example:

val testSchema = StructType(List(
  StructField("key", IntegerType, false),
  StructField("value", StringType, true),
  StructField("nestedCols", StructType(List(
    StructField("nestedKey", IntegerType, false),
    StructField("nestedValue", StringType, true)
  )), false)
))

println(testSchema.toDDL)
println(StructType.fromDDL(testschema.toDDL))

gives

key INT NOT NULL,value STRING,nestedCols STRUCT<nestedKey: INT, nestedValue: STRING> NOT NULL

StructType(
  StructField(key,IntegerType,false),
  StructField(value,StringType,true),
  StructField(nestedCols,StructType(
    StructField(nestedKey,IntegerType,true),
    StructField(nestedValue,StringType,true)
  ),false)
)

This is due to the fact that StructType.toDDL calls StructField.toDDL for its fields, which in turn calls .sql for its dataType. If dataType is a StructType, the call to .sql in turn calls .sql for all the nested fields, and this last method does not include the nullability of the field in its output. The proposed solution is therefore to have StructField.toDDL call dataType.toDDL for a StructType, since this will include information about nullability of nested columns.

To work around the different DDL formats of nested and non-nested structs (the former is wrapped in "STRUCT ...>" and uses colName: dataType for its fields instead of colName dataType), package-private nested-specific versions of .toDDL have been added for StructType and StructField.

Why are the changes needed?

Currently, converting a StructType schema to a DDL string does not pass information about nullability of nested columns. This leads to a loss of information, and means converting to DDL and then back could alter the StructType schema.

Does this PR introduce any user-facing change?

Yes, given the example above, the output will become:

key INT NOT NULL,value STRING,nestedCols STRUCT<nestedKey: INT NOT NULL, nestedValue: STRING> NOT NULL

StructType(
  StructField(key,IntegerType,false),
  StructField(value,StringType,true),
  StructField(nestedCols,StructType(
    StructField(nestedKey,IntegerType,false),
    StructField(nestedValue,StringType,true)
  ),false)
)

How was this patch tested?

In StructTypeSuite, the nestedStruct testing value has been modified to include a non-nullable nested column. The relevant unit tests have been changed accordingly.

@github-actions github-actions bot added the SQL label May 2, 2023
@HyukjinKwon HyukjinKwon changed the title [SPARK-43341-][SQL] Patch StructType.toDDL not picking up on non-nullability of nested column [SPARK-43341][SQL] Patch StructType.toDDL not picking up on non-nullability of nested column May 8, 2023
@HyukjinKwon
Copy link
Member

cc @MaxGekk FYI

Copy link
Member

@MaxGekk MaxGekk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@BramBoog Could you rebase on the recent master, please.

@BramBoog BramBoog changed the base branch from branch-3.4 to master July 3, 2023 13:02
@BramBoog
Copy link
Author

BramBoog commented Jul 3, 2023

Hmm yeah that was strange, I should have noticed that immediately :/. Seems GitHub had added a whole list of commits which already were on master to the diff. Changing the base to a different branch and back to master has fixed it though.

@zaza
Copy link

zaza commented Sep 19, 2023

Hey guys, any chance to have the PR merged? Although not critical it would simplify our schema tests a lot.

@BramBoog
Copy link
Author

@zaza Yeah the way I see it the PR is ready, I've just been waiting for a review

@HyukjinKwon
Copy link
Member

@BramBoog it has a conflicts against the lastest master branch. You would need to resolve the conflicts by git fetch upstream & git rebase upstream/master

@BramBoog
Copy link
Author

BramBoog commented Nov 8, 2023

@HyukjinKwon Apologies, it took a while for me to find the time to get back to this. Resolved the conflicts, PR is up to date again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants