-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-41096][SQL] Support reading parquet FIXED_LEN_BYTE_ARRAY type #38628
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@kazuyukitanimura Thanks for working on this! I took a look at how Iceberg handles FLBA. For iceberg type |
sunchao
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Even though Spark itself won't write binary using FLBA, it's good for compatibility against Parquet files written by 3rd party tools. This is similar to the other cases like SPARK-34816.
|
Committed to master, thanks @kazuyukitanimura ! |
|
Thank you @huaxingao @sunchao @LuciferYang |
viirya
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me.
|
Just found a previous PR #35902. The change is the same, but there are some avro test stuff that we can consider to add as a followup too. |
|
Thanks @viirya I also realized PR #35902 along with #20826 and #1737 after I submit this PR. The avro compatibility tests are nice to have. Wondering if the previous authors are still interested to work on. @ghost @aws-awinstan @nicolaslrveiga |
### What changes were proposed in this pull request? Parquet supports FIXED_LEN_BYTE_ARRAY (FLBA) data type. However, Spark Parquet reader currently cannot handle FLBA. This PR proposes to read FLBA column as BinaryType data in Spark. ### Why are the changes needed? Iceberg Parquet reader, for example, can handle FLBA. This PR reduces the implementation gap between Spark and Iceberg Parquet reader. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Unit test added Closes apache#38628 from kazuyukitanimura/SPARK-41096. Authored-by: Kazuyuki Tanimura <ktanimura@apple.com> Signed-off-by: Chao Sun <sunchao@apple.com>
### What changes were proposed in this pull request? Parquet supports FIXED_LEN_BYTE_ARRAY (FLBA) data type. However, Spark Parquet reader currently cannot handle FLBA. This PR proposes to read FLBA column as BinaryType data in Spark. ### Why are the changes needed? Iceberg Parquet reader, for example, can handle FLBA. This PR reduces the implementation gap between Spark and Iceberg Parquet reader. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Unit test added Closes apache#38628 from kazuyukitanimura/SPARK-41096. Authored-by: Kazuyuki Tanimura <ktanimura@apple.com> Signed-off-by: Chao Sun <sunchao@apple.com>
What changes were proposed in this pull request?
Parquet supports FIXED_LEN_BYTE_ARRAY (FLBA) data type. However, Spark Parquet reader currently cannot handle FLBA.
This PR proposes to read FLBA column as BinaryType data in Spark.
Why are the changes needed?
Iceberg Parquet reader, for example, can handle FLBA. This PR reduces the implementation gap between Spark and Iceberg Parquet reader.
Does this PR introduce any user-facing change?
No
How was this patch tested?
Unit test added