-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-2489][SQL] Support Parquet's optional fixed_len_byte_array #20826
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| * Default constructor. Note that this does not initialize fields | ||
| * to their default values from the schema. If that is desired then | ||
| * one should use <code>newBuilder()</code>. | ||
| * one should use <code>newBuilder()</code>. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@aws-awinstan Can we remove unrelated changes? It looks hard to follow the changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@HyukjinKwon Done! I had debated leaving these changes in when submitting the PR but decided to leave the newly generated files since they had some minor improvements (e.g. to the JavaDocs). Ideally these Avro files would be generated during the build process rather than checked in but that's a separate issue.
…uired for the fixed_column to simplify the PR.
|
Just pinging on this for a review. Let me know if there are any questions or concerns. |
|
ok to test |
|
cc @liancheng and @marmbrus too. |
|
Test build #88501 has finished for PR 20826 at commit
|
|
retest this please |
|
Test build #88513 has finished for PR 20826 at commit
|
|
@HyukjinKwon @liancheng @marmbrus - Any comments on this PR? Can we get this merged? |
|
Hope we can see this in the next release of Spark... |
|
Pinging on this once again. Is there anything more you'd like to see before this is merged? |
|
Pinging on this once again. It's been almost a year since this PR was opened. |
|
Can this be merged in ? |
|
cc @rdblue for review since it;s a Parquet one. |
|
Pinging this. Could you please have a look at this PR? |
|
+1 This looks good to me. |
|
I am not sure about the comment #1737 (comment) said before. It seems a concern about using BinaryType for fixed_len_byte_array. What do you think about the argument? |
| } | ||
|
|
||
| logParquetSchema(path) | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we revert this newline?
| column.putNull(rowId + i); | ||
| } | ||
| } | ||
| } else if (column.dataType() == DataTypes.BinaryType) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like this change doesn't have associated test. Should we add one?
|
|
||
| logParquetSchema(path) | ||
|
|
||
| checkAnswer(spark.read.parquet(path), (0 until 10).map { i => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this only tests against spark.sql.parquet.enableVectorizedReader is true (default), should we test non-vectorized reader too?
|
@rdblue @viirya Thanks for checking it out.
I bisected for cc @liancheng |
|
I think it is fine to use BinaryType to read fixed_len_byte_array. That's the best mapping, even though the type is technically wider because it includes shorter and longer binary sequences. This only allows reading that data, not writing values with a fixed length. All writes would use BinaryType regardless. |
|
Thanks for clarifying. Although that comment concerned about reading problem, looks like |
|
The idea is fine. Spark reads Hive varchar as string type, although it's not an accurate mapping. |
|
ok to test |
|
|
||
| private def generateFixedLengthByteArray(i : Int): Array[Byte] = { | ||
| val fixedLengthByteArray = Array[Byte](0, 0, 0, 0, 0, 0, 0, 0) | ||
| val fixedLengthByteArrayComponent = s"val_$i".getBytes(StandardCharsets.UTF_8) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit s seems not needed
|
Yea, let's have a parquet dedicated test (I guess) as already pointed out. |
HyukjinKwon
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am okay with supporting this - don't particularly mind.
But I would like to note:
Lines 138 to 140 in 27ac6af
| case UINT_8 => typeNotSupported() | |
| case UINT_16 => typeNotSupported() | |
| case UINT_32 => typeNotSupported() |
We don't support unsigned types although other types like long can contain (see #9646 - wow it's 4 years ago). cc @liancheng
|
Test build #109071 has finished for PR 20826 at commit
|
|
ok to test |
|
Test build #110691 has finished for PR 20826 at commit
|
| column.putNull(rowId + i); | ||
| } | ||
| } | ||
| } else if (column.dataType() == DataTypes.BinaryType) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can merge it into the previous if:
else if (DecimalType.isByteArrayDecimalType(column.dataType()) || column.dataType() == DataTypes.BinaryType)
|
The added tests only test parquet-avro compatibility. We should have a test for parquet. |
|
We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable. |
|
Is anyone interested in this PR and want to take over? |
|
If it helps, I have a fairly complex parquet file with a few nested fields as FIXED_LEN_BYTE_ARRAY, so this bug is a show stopper for spark on this dataset. I tried to fix by cloning this repo with the PR (https://github.com/aws-awinstan/spark.git) to local machine and compiling. I did the same for the master repo for spark which worked fine on a with a few of the columns (to test without parsing the FIXED_LEN_BYTE_ARRAY columns). However, the aws-awinstan repo fails with on the same test columns: [Stage 0:> (0 + 1) / 1]20/01/17 12:37:13 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, 192.168.42.107, executor 0): java.io.StreamCorruptedException: invalid stream header: 0000000F I'm using in the following in my client environment, with the HDFS and Spark remote in VM and standalone with a single worker. $ pip freeze | grep spark I'm surprised this bug was allowed to languish for as long as it has, it's not possible for us to serialize the upstream data and need this feature or have to move on... Edit: further troubleshooting showed that it was the toPandas() call that is failing |
|
We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable. |
|
Can we reopen this issue or proceed with the work in #35902? |
What changes were proposed in this pull request?
This PR adds support for reading Parquet FIXED_LENGTH_BYTE_ARRAYs as a Binary column if no OriginalType is specified. Parquet-avro writes the Avro fixed type as a Parquet FIXED_LENGTH_BYTE_ARRAY type. Currently when trying to load Parquet files with a column of this type with Spark SQL it throws an exception similar to the following:
After this change Spark SQL is able to correctly load the Parquet files. There was a PR to fix this 3 years ago (#1737) however it was ultimately rejected as the committer went down the path of adding a new SQL Type specifically for FIXED_LENGTH_BYTE_ARRAYs and the maintainers believed this was too intrusive of a change. This PR simply defaults to Binary if no OriginalType is specified. A few updates were required to the VectorizedColumnReader to support Binary FIXED_LENGTH_BYTE_ARRAYs.
Note: All the changes to the gen-java/* files were generated by avro-tools-1.8.1 and the mostly documentation updates look to come from changes in the template avro-tools uses.
How was this patch tested?
I added a fixed attribute to the AvroPrimitives and AvroOptionalPrimitives record types which are used by the ParquetAvroCompatibilitySuite. These values were populated by taking the same value as other type ("val_$i"), padding it to 8 bytes (the chosen fixed length), and storing it as the fixed type. I verified that before my fix the "required primitives" and "optional primitives" failed with the same exception we're seeing in our clusters. After my change the tests succeed with the expected results.