Skip to content

Conversation

@singhpk234
Copy link
Contributor

@singhpk234 singhpk234 commented Jul 2, 2022

About the changes

Address #5094 (comment)
Spark was writing 3 level list rather than 2 level list which was expected in the UT.

On debugging this more found that, since the schema was passed via spark.read().schema(sparkSchema).json and as of spark 3.3 spark will not respect the nullability in the schema passed via above by default (ref. this).

Now since the nullability is not respected (will be considered nullable) by spark by default the Parquet writer despite writeLegacyParquetFormat being true, will write in Three level list. CodePointer

This pr adds the conf to respect the nullability provided presently and hence preserves the existing behaviour.

P.S : A good long term fix would be to get rid of this form of specifying schema from our test / test utils, can pick this in a follow-up.


Testing Done

Re-enabled the UT, which was ignored in version upgrade.

cc @rdblue

@github-actions github-actions bot added the spark label Jul 2, 2022
@singhpk234 singhpk234 changed the title Spark 3.3: Re-Enable TwoLevel List in Parquet UT Spark 3.3: Re-Enable TwoLevel Parquet List UT Jul 2, 2022
@singhpk234 singhpk234 force-pushed the fix/re-enable-two-level-list-ut branch from 27ed9cb to 8fdb98f Compare July 2, 2022 04:24
@rdblue
Copy link
Contributor

rdblue commented Jul 3, 2022

Thanks, @singhpk234! Looks great.

@rdblue rdblue merged commit 36d0b91 into apache:master Jul 3, 2022
namrathamyske pushed a commit to namrathamyske/iceberg that referenced this pull request Jul 10, 2022
Co-authored-by: Prashant Singh <psinghvk@amazon.com>
namrathamyske pushed a commit to namrathamyske/iceberg that referenced this pull request Jul 10, 2022
Co-authored-by: Prashant Singh <psinghvk@amazon.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants