Skip to content

Conversation

@kazuyukitanimura
Copy link
Contributor

What changes were proposed in this pull request?

Parquet supports FIXED_LEN_BYTE_ARRAY (FLBA) data type. However, Spark Parquet reader currently cannot handle FLBA.
This PR proposes to read FLBA column as BinaryType data in Spark.

Why are the changes needed?

Iceberg Parquet reader, for example, can handle FLBA. This PR reduces the implementation gap between Spark and Iceberg Parquet reader.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Unit test added

@github-actions github-actions bot added the SQL label Nov 11, 2022
@kazuyukitanimura
Copy link
Contributor Author

cc @huaxingao @sunchao @viirya

@huaxingao
Copy link
Contributor

@kazuyukitanimura Thanks for working on this!

I took a look at how Iceberg handles FLBA. For iceberg type Types.FixedType, the underneath Parquet type is fixed_len_byte_array. In Scan.readSchema, Iceberg map Types.FixedType to Spark BinaryType. Since FLBA is a valid parquet data type, it makes sense to me for Spark support this type and map it to BinaryType.

Copy link
Member

@sunchao sunchao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Even though Spark itself won't write binary using FLBA, it's good for compatibility against Parquet files written by 3rd party tools. This is similar to the other cases like SPARK-34816.

@sunchao sunchao closed this in 928f518 Nov 14, 2022
@sunchao
Copy link
Member

sunchao commented Nov 14, 2022

Committed to master, thanks @kazuyukitanimura !

@kazuyukitanimura
Copy link
Contributor Author

Thank you @huaxingao @sunchao @LuciferYang

Copy link
Member

@viirya viirya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me.

@viirya
Copy link
Member

viirya commented Nov 14, 2022

Just found a previous PR #35902. The change is the same, but there are some avro test stuff that we can consider to add as a followup too.

@kazuyukitanimura
Copy link
Contributor Author

Thanks @viirya I also realized PR #35902 along with #20826 and #1737 after I submit this PR.
I believe all attempts were not merged due to missing Parquet test as pointed here #20826 (comment)
This PR has a dedicated Parquet test and merged.

The avro compatibility tests are nice to have. Wondering if the previous authors are still interested to work on. @ghost @aws-awinstan @nicolaslrveiga

SandishKumarHN pushed a commit to SandishKumarHN/spark that referenced this pull request Dec 12, 2022
### What changes were proposed in this pull request?
Parquet supports FIXED_LEN_BYTE_ARRAY (FLBA) data type. However, Spark Parquet reader currently cannot handle FLBA.
This PR proposes to read FLBA column as BinaryType data in Spark.

### Why are the changes needed?
Iceberg Parquet reader, for example, can handle FLBA. This PR reduces the implementation gap between Spark and Iceberg Parquet reader.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Unit test added

Closes apache#38628 from kazuyukitanimura/SPARK-41096.

Authored-by: Kazuyuki Tanimura <ktanimura@apple.com>
Signed-off-by: Chao Sun <sunchao@apple.com>
sunchao pushed a commit to sunchao/spark that referenced this pull request Jun 2, 2023
### What changes were proposed in this pull request?
Parquet supports FIXED_LEN_BYTE_ARRAY (FLBA) data type. However, Spark Parquet reader currently cannot handle FLBA.
This PR proposes to read FLBA column as BinaryType data in Spark.

### Why are the changes needed?
Iceberg Parquet reader, for example, can handle FLBA. This PR reduces the implementation gap between Spark and Iceberg Parquet reader.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Unit test added

Closes apache#38628 from kazuyukitanimura/SPARK-41096.

Authored-by: Kazuyuki Tanimura <ktanimura@apple.com>
Signed-off-by: Chao Sun <sunchao@apple.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants