[SPARK-2489][SQL] Support Parquet's optional fixed_len_byte_array #20826

aws-awinstan · 2018-03-14T20:43:30Z

What changes were proposed in this pull request?

This PR adds support for reading Parquet FIXED_LENGTH_BYTE_ARRAYs as a Binary column if no OriginalType is specified. Parquet-avro writes the Avro fixed type as a Parquet FIXED_LENGTH_BYTE_ARRAY type. Currently when trying to load Parquet files with a column of this type with Spark SQL it throws an exception similar to the following:

Caused by: org.apache.spark.sql.AnalysisException: Illegal Parquet type: FIXED_LEN_BYTE_ARRAY;
	at org.apache.spark.sql.execution.datasources.parquet.ParquetToSparkSchemaConverter.illegalType$1(ParquetSchemaConverter.scala:108)
	at org.apache.spark.sql.execution.datasources.parquet.ParquetToSparkSchemaConverter.convertPrimitiveField(ParquetSchemaConverter.scala:177)
	at org.apache.spark.sql.execution.datasources.parquet.ParquetToSparkSchemaConverter.convertField(ParquetSchemaConverter.scala:90)
	at org.apache.spark.sql.execution.datasources.parquet.ParquetToSparkSchemaConverter$$anonfun$2.apply(ParquetSchemaConverter.scala:72)
	at org.apache.spark.sql.execution.datasources.parquet.ParquetToSparkSchemaConverter$$anonfun$2.apply(ParquetSchemaConverter.scala:66)
	at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
	at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
	at scala.collection.Iterator$class.foreach(Iterator.scala:893)
	at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
	at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
	at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
	at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
	at scala.collection.AbstractTraversable.map(Traversable.scala:104)
	at org.apache.spark.sql.execution.datasources.parquet.ParquetToSparkSchemaConverter.org$apache$spark$sql$execution$datasources$parquet$ParquetToSparkSchemaConverter$$convert(ParquetSchemaConverter.scala:66)
	at org.apache.spark.sql.execution.datasources.parquet.ParquetToSparkSchemaConverter.convert(ParquetSchemaConverter.scala:63)
	at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$readSchemaFromFooter$2.apply(ParquetFileFormat.scala:642)
	at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$readSchemaFromFooter$2.apply(ParquetFileFormat.scala:642)
	at scala.Option.getOrElse(Option.scala:121)
	at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$.readSchemaFromFooter(ParquetFileFormat.scala:642)
	at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$9.apply(ParquetFileFormat.scala:599)
	at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$9.apply(ParquetFileFormat.scala:581)
	at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:800)
	at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:800)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
	at org.apache.spark.scheduler.Task.run(Task.scala:109)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)

After this change Spark SQL is able to correctly load the Parquet files. There was a PR to fix this 3 years ago (#1737) however it was ultimately rejected as the committer went down the path of adding a new SQL Type specifically for FIXED_LENGTH_BYTE_ARRAYs and the maintainers believed this was too intrusive of a change. This PR simply defaults to Binary if no OriginalType is specified. A few updates were required to the VectorizedColumnReader to support Binary FIXED_LENGTH_BYTE_ARRAYs.

Note: All the changes to the gen-java/* files were generated by avro-tools-1.8.1 and the mostly documentation updates look to come from changes in the template avro-tools uses.

How was this patch tested?

I added a fixed attribute to the AvroPrimitives and AvroOptionalPrimitives record types which are used by the ParquetAvroCompatibilitySuite. These values were populated by taking the same value as other type ("val_$i"), padding it to 8 bytes (the chosen fixed length), and storing it as the fixed type. I verified that before my fix the "required primitives" and "optional primitives" failed with the same exception we're seeing in our clusters. After my change the tests succeed with the expected results.

…e_array

HyukjinKwon · 2018-03-15T02:31:16Z

.../gen-java/org/apache/spark/sql/execution/datasources/parquet/test/avro/AvroArrayOfArray.java

   * Default constructor.  Note that this does not initialize fields
   * to their default values from the schema.  If that is desired then
-   * one should use <code>newBuilder()</code>. 
+   * one should use <code>newBuilder()</code>.


@aws-awinstan Can we remove unrelated changes? It looks hard to follow the changes.

@HyukjinKwon Done! I had debated leaving these changes in when submitting the PR but decided to leave the newly generated files since they had some minor improvements (e.g. to the JavaDocs). Ideally these Avro files would be generated during the build process rather than checked in but that's a separate issue.

…uired for the fixed_column to simplify the PR.

aws-awinstan · 2018-03-21T16:29:50Z

Just pinging on this for a review. Let me know if there are any questions or concerns.

HyukjinKwon · 2018-03-22T05:07:00Z

ok to test

HyukjinKwon · 2018-03-22T05:07:12Z

cc @liancheng and @marmbrus too.

SparkQA · 2018-03-22T07:05:02Z

Test build #88501 has finished for PR 20826 at commit 27ac6af.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2018-03-22T07:26:02Z

retest this please

SparkQA · 2018-03-22T10:47:07Z

Test build #88513 has finished for PR 20826 at commit 27ac6af.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

aws-awinstan · 2018-03-27T22:19:40Z

@HyukjinKwon @liancheng @marmbrus - Any comments on this PR? Can we get this merged?

praetp · 2018-04-24T12:54:58Z

Hope we can see this in the next release of Spark...

aws-awinstan · 2018-05-18T01:20:21Z

Pinging on this once again. Is there anything more you'd like to see before this is merged?

aws-awinstan · 2019-02-20T04:26:46Z

Pinging on this once again. It's been almost a year since this PR was opened.

praetp · 2019-02-21T12:37:28Z

Can this be merged in ?

HyukjinKwon · 2019-03-14T03:58:47Z

cc @rdblue for review since it;s a Parquet one.

ozars · 2019-08-12T18:44:38Z

Pinging this. Could you please have a look at this PR?

rdblue · 2019-08-12T18:58:02Z

+1

This looks good to me.

viirya · 2019-08-12T21:11:43Z

I am not sure about the comment #1737 (comment) said before. It seems a concern about using BinaryType for fixed_len_byte_array. What do you think about the argument?

viirya · 2019-08-12T20:52:52Z

...scala/org/apache/spark/sql/execution/datasources/parquet/ParquetAvroCompatibilitySuite.scala

      }

      logParquetSchema(path)
-


Could we revert this newline?

viirya · 2019-08-12T21:06:21Z

...src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedColumnReader.java

          column.putNull(rowId + i);
        }
      }
+    } else if (column.dataType() == DataTypes.BinaryType) {


Looks like this change doesn't have associated test. Should we add one?

viirya · 2019-08-12T21:27:39Z

...scala/org/apache/spark/sql/execution/datasources/parquet/ParquetAvroCompatibilitySuite.scala


      logParquetSchema(path)
-
      checkAnswer(spark.read.parquet(path), (0 until 10).map { i =>


I think this only tests against spark.sql.parquet.enableVectorizedReader is true (default), should we test non-vectorized reader too?

ozars · 2019-08-12T22:56:59Z

@rdblue @viirya Thanks for checking it out.

I am not sure about the comment #1737 (comment) said before. It seems a concern about using BinaryType for fixed_len_byte_array. What do you think about the argument?

I bisected for convertFromAttributes and figured out the code referred by that message (or at least some portion of it) was refactored by 02149ff in 2015. I'm not very familiar with the internal machinery of spark. Could someone familiar with the codebase confirm if this is still a concern?

cc @liancheng

rdblue · 2019-08-12T23:07:08Z

I think it is fine to use BinaryType to read fixed_len_byte_array. That's the best mapping, even though the type is technically wider because it includes shorter and longer binary sequences. This only allows reading that data, not writing values with a fixed length. All writes would use BinaryType regardless.

viirya · 2019-08-13T00:01:13Z

Thanks for clarifying. Although that comment concerned about reading problem, looks like convertFromAttributes was for writing. This PR works on reading fixed_len_byte_array as Spark BinaryType. I think it should be fine. cc @cloud-fan too

cloud-fan · 2019-08-13T04:12:58Z

The idea is fine. Spark reads Hive varchar as string type, although it's not an accurate mapping.

HyukjinKwon · 2019-08-14T01:37:38Z

ok to test

HyukjinKwon · 2019-08-14T01:40:11Z

...scala/org/apache/spark/sql/execution/datasources/parquet/ParquetAvroCompatibilitySuite.scala


+  private def generateFixedLengthByteArray(i : Int): Array[Byte] = {
+    val fixedLengthByteArray = Array[Byte](0, 0, 0, 0, 0, 0, 0, 0)
+    val fixedLengthByteArrayComponent = s"val_$i".getBytes(StandardCharsets.UTF_8)


nit s seems not needed

HyukjinKwon · 2019-08-14T01:42:03Z

Yea, let's have a parquet dedicated test (I guess) as already pointed out.

HyukjinKwon

I am okay with supporting this - don't particularly mind.

But I would like to note:

spark/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaConverter.scala

Lines 138 to 140 in 27ac6af

    
           case UINT_8 => typeNotSupported() 
        
           case UINT_16 => typeNotSupported() 
        
           case UINT_32 => typeNotSupported()

We don't support unsigned types although other types like long can contain (see #9646 - wow it's 4 years ago). cc @liancheng

SparkQA · 2019-08-14T05:03:23Z

Test build #109071 has finished for PR 20826 at commit 27ac6af.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2019-09-17T00:19:56Z

ok to test

SparkQA · 2019-09-17T05:17:46Z

Test build #110691 has finished for PR 20826 at commit 27ac6af.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2019-09-17T05:22:13Z

...src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedColumnReader.java

          column.putNull(rowId + i);
        }
      }
+    } else if (column.dataType() == DataTypes.BinaryType) {


We can merge it into the previous if:

else if (DecimalType.isByteArrayDecimalType(column.dataType()) || column.dataType() == DataTypes.BinaryType)

cloud-fan · 2019-09-17T05:26:06Z

The added tests only test parquet-avro compatibility. We should have a test for parquet.

github-actions · 2020-01-13T00:07:32Z

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

cloud-fan · 2020-01-13T03:32:59Z

Is anyone interested in this PR and want to take over?

jonbelanger-ns · 2020-01-17T17:51:23Z

If it helps, I have a fairly complex parquet file with a few nested fields as FIXED_LEN_BYTE_ARRAY, so this bug is a show stopper for spark on this dataset.

I tried to fix by cloning this repo with the PR (https://github.com/aws-awinstan/spark.git) to local machine and compiling.

I did the same for the master repo for spark which worked fine on a with a few of the columns (to test without parsing the FIXED_LEN_BYTE_ARRAY columns).

However, the aws-awinstan repo fails with on the same test columns:

[Stage 0:> (0 + 1) / 1]20/01/17 12:37:13 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, 192.168.42.107, executor 0): java.io.StreamCorruptedException: invalid stream header: 0000000F
at java.io.ObjectInputStream.readStreamHeader(ObjectInputStream.java:866)
at java.io.ObjectInputStream.(ObjectInputStream.java:358)
at org.apache.spark.serializer.JavaDeserializationStream$$anon$1.(JavaSerializer.scala:63)
at org.apache.spark.serializer.JavaDeserializationStream.(JavaSerializer.scala:63)
at org.apache.spark.serializer.JavaSerializerInstance.deserializeStream(JavaSerializer.scala:126)
at org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:113)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:313)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)

I'm using in the following in my client environment, with the HDFS and Spark remote in VM and standalone with a single worker.

$ pip freeze | grep spark
pyspark==2.4.4
spark==0.2.1

I'm surprised this bug was allowed to languish for as long as it has, it's not possible for us to serialize the upstream data and need this feature or have to move on...

Edit: further troubleshooting showed that it was the toPandas() call that is failing

github-actions · 2020-05-17T00:13:42Z

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

nicolaslrveiga · 2022-03-17T21:38:36Z

Can we reopen this issue or proceed with the work in #35902?

aws-awinstan added 2 commits March 13, 2018 16:38

[SPARK-2489][SQL] Unsupported parquet datatype optional fixed_len_byt…

0b6c9e8

…e_array

Remove unnecessary semicolon.

e5e4ade

HyukjinKwon reviewed Mar 15, 2018

View reviewed changes

aws-awinstan added 2 commits March 15, 2018 08:03

Revert all changes to the generated Avro classes except for those req…

05590f5

…uired for the fixed_column to simplify the PR.

Remove unnecessary comment.

27ac6af

kylebarron mentioned this pull request Oct 31, 2018

FixedLenByteArray support mcaceresb/stata-parquet#7

Open

viirya reviewed Aug 12, 2019

View reviewed changes

HyukjinKwon changed the title ~~[Spark-2489][SQL] Unsupported parquet datatype optional fixed_len_byte_array~~ [SPARK-2489][SQL] Support Parquet's optional fixed_len_byte_array Aug 14, 2019

HyukjinKwon reviewed Aug 14, 2019

View reviewed changes

cloud-fan reviewed Sep 17, 2019

View reviewed changes

github-actions bot added the Stale label Jan 13, 2020

cloud-fan removed the Stale label Jan 13, 2020

dongjoon-hyun added the SQL label Feb 5, 2020

github-actions bot added the Stale label May 17, 2020

github-actions bot closed this May 18, 2020

pacman82 mentioned this pull request May 14, 2021

Spark import error - not supporting FIXED_LEN_BYTE_ARRAY pacman82/odbc2parquet#63

Closed

nicolaslrveiga mentioned this pull request Mar 17, 2022

[SPARK-2489][SQL] Support Parquet's optional fixed_len_byte_array #35902

Closed

kazuyukitanimura mentioned this pull request Nov 14, 2022

[SPARK-41096][SQL] Support reading parquet FIXED_LEN_BYTE_ARRAY type #38628

Closed


		logParquetSchema(path)

		checkAnswer(spark.read.parquet(path), (0 until 10).map { i =>

	case UINT_8 => typeNotSupported()
	case UINT_16 => typeNotSupported()
	case UINT_32 => typeNotSupported()

[SPARK-2489][SQL] Support Parquet's optional fixed_len_byte_array #20826

[SPARK-2489][SQL] Support Parquet's optional fixed_len_byte_array #20826

Uh oh!

Conversation

aws-awinstan commented Mar 14, 2018

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

HyukjinKwon Mar 15, 2018

Choose a reason for hiding this comment

Uh oh!

aws-awinstan Mar 15, 2018

Choose a reason for hiding this comment

Uh oh!

aws-awinstan commented Mar 21, 2018

Uh oh!

HyukjinKwon commented Mar 22, 2018

Uh oh!

HyukjinKwon commented Mar 22, 2018

Uh oh!

SparkQA commented Mar 22, 2018

Uh oh!

HyukjinKwon commented Mar 22, 2018

Uh oh!

SparkQA commented Mar 22, 2018

Uh oh!

aws-awinstan commented Mar 27, 2018

Uh oh!

praetp commented Apr 24, 2018

Uh oh!

aws-awinstan commented May 18, 2018

Uh oh!

aws-awinstan commented Feb 20, 2019

Uh oh!

praetp commented Feb 21, 2019

Uh oh!

HyukjinKwon commented Mar 14, 2019

Uh oh!

ozars commented Aug 12, 2019

Uh oh!

rdblue commented Aug 12, 2019

Uh oh!

viirya commented Aug 12, 2019

Uh oh!

viirya Aug 12, 2019

Choose a reason for hiding this comment

Uh oh!

viirya Aug 12, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

viirya Aug 12, 2019

Choose a reason for hiding this comment

Uh oh!

ozars commented Aug 12, 2019

Uh oh!

rdblue commented Aug 12, 2019

Uh oh!

viirya commented Aug 13, 2019

Uh oh!

cloud-fan commented Aug 13, 2019

Uh oh!

HyukjinKwon commented Aug 14, 2019

Uh oh!

HyukjinKwon Aug 14, 2019

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon commented Aug 14, 2019

Uh oh!

HyukjinKwon left a comment

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Aug 14, 2019

Uh oh!

HyukjinKwon commented Sep 17, 2019

Uh oh!

SparkQA commented Sep 17, 2019

Uh oh!

cloud-fan Sep 17, 2019

Choose a reason for hiding this comment

viirya Aug 12, 2019 •

edited

Loading

jonbelanger-ns commented Jan 17, 2020 •

edited

Loading