[SPARK-34817][SQL] Read parquet unsigned types that stored as int32 physical type in parquet #31921

yaooqinn · 2021-03-22T03:40:50Z

What changes were proposed in this pull request?

Unsigned types may be used to produce smaller in-memory representations of the data. These types used by frameworks(e.g. hive, pig) using parquet. And parquet will map them to its base types.

see more https://github.com/apache/parquet-format/blob/master/LogicalTypes.md
https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift

  /**
   * An unsigned integer value.
   *
   * The number describes the maximum number of meaningful data bits in
   * the stored value. 8, 16 and 32 bit values are stored using the
   * INT32 physical type.  64 bit values are stored using the INT64
   * physical type.
   *
   */
  UINT_8 = 11;
  UINT_16 = 12;
  UINT_32 = 13;
  UINT_64 = 14;

UInt8-[0:255]
UInt16-[0:65535]
UInt32-[0:4294967295]
UInt64-[0:18446744073709551615]

In this PR, we support read UINT_8 as ShortType, UINT_16 as IntegerType, UINT_32 as LongType to fit their range. Support for UINT_64 will be in another PR.

Why are the changes needed?

better parquet support

Does this PR introduce any user-facing change?

yes, we can read unit[8/16/32] from parquet files

How was this patch tested?

new tests

…hysical type

SparkQA · 2021-03-22T04:31:16Z

Test build #136327 has finished for PR 31921 at commit 0c8b6d4.

This patch fails Java style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-03-22T05:48:40Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/40909/

SparkQA · 2021-03-22T06:01:41Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/40909/

SparkQA · 2021-03-22T06:23:14Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/40913/

yaooqinn · 2021-03-22T06:34:22Z

cc @HyukjinKwon @cloud-fan @dongjoon-hyun thanks for reviewing

SparkQA · 2021-03-22T06:35:07Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/40913/

HyukjinKwon · 2021-03-22T06:37:25Z

...c/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaConverter.scala

-          case UINT_8 => typeNotSupported()
-          case UINT_16 => typeNotSupported()
-          case UINT_32 => typeNotSupported()
+          case UINT_32 => LongType


These were explicitly unsupported at #9646 .. per @liancheng's advice (who's also Parquet committer). So I'm less sure if this is something we should support.

But it's very old. Almost 6 years ago lol. @liancheng do you have a different thought now?

Thanks, @HyukjinKwon,
Yea, I have checked that PR too. There's also a suggestion that we support them.
Lately, Wenchen created https://issues.apache.org/jira/browse/SPARK-34786 for reading uint64. As other unsigned types are not supported too and they are a bit more clear than uint64 which needs a decimal, I raised this PR to collect more opinions.

IMO, for Spark, it is worthwhile to be able to support more storage layer features without breaking our own rules.

My hunch is that Spark SQL didn't support unsigned integral types at all back then. As long as we support that now, it's OK to have.

It's mostly about compatibility. Spark won't have unsigned types, but spark should be able to read existing parquet files written by other systems that support unsigned types.

cloud-fan · 2021-03-22T07:24:07Z

...src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedColumnReader.java

          num, column, rowId, maxDefLevel, (VectorizedValuesReader) dataColumn);
+    } else if (column.dataType() == DataTypes.LongType) {
+      //  We use LongType to handle UINT32
+      defColumn.readIntegersAsUnsigned(


nit: readUnsighedIntegers

can we follow 38fbe56 and check if dictionary encoding also needs update?

OK, checking~

Looks irrelevant to me

I have added the dictionary decoding code path, change the parquet data generator a bit to produce right encoded/plain data

...src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedColumnReader.java

.../main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedRleValuesReader.java

SparkQA · 2021-03-22T09:03:11Z

Test build #136330 has finished for PR 31921 at commit 8ff3267.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-03-24T10:19:43Z

Test build #136451 has finished for PR 31921 at commit 0da5d07.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

yaooqinn · 2021-03-24T10:48:39Z

sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetIOSuite.scala

 import org.apache.spark.sql.types._
 import org.apache.spark.unsafe.types.UTF8String

-// Write support class for nested groups: ParquetWriter initializes GroupWriteSupport


we don't need this anymore, the ExampleParquetWriter meets our needs

SparkQA · 2021-03-24T11:29:34Z

Test build #136452 has finished for PR 31921 at commit efe9c4a.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-03-24T12:15:44Z

Test build #136455 has started for PR 31921 at commit 3642f91.

SparkQA · 2021-03-24T13:11:11Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41035/

SparkQA · 2021-03-24T13:19:17Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41036/

SparkQA · 2021-03-24T13:29:09Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41036/

SparkQA · 2021-03-24T14:20:58Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41039/

SparkQA · 2021-03-24T14:28:47Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41035/

yaooqinn · 2021-03-24T15:13:20Z

@cloud-fan @liancheng @HyukjinKwon @maropu please take another look

cloud-fan · 2021-03-24T15:23:43Z

...src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedColumnReader.java

              column.putInt(i, dictionary.decodeToInt(dictionaryIds.getDictId(i)));
            }
          }
+        } else if (column.dataType() == DataTypes.LongType) {


when will we hit this branch? it's case INT32 not unsigned.

On Parquet side, for signed and unsigned int (<=32) types they share the same PrimitiveType - INT32. The Unsigned ones are just logical types.

cloud-fan · 2021-03-24T15:24:35Z

...src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedColumnReader.java

        canReadAsIntDecimal(column.dataType())) {
      defColumn.readIntegers(
          num, column, rowId, maxDefLevel, (VectorizedValuesReader) dataColumn);
+    } else if (column.dataType() == DataTypes.LongType) {


shall we add an extra check to make sure we are reading unsigned values?

This is deterministic and controlled by our own, which seems not necessary. see https://github.com/apache/spark/pull/31921/files#diff-3730a913c4b95edf09fb78f8739c538bae53f7269555b6226efe7ccee1901b39R137

SparkQA · 2021-03-24T15:37:28Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41039/

yaooqinn · 2021-03-24T15:55:09Z

...ain/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedPlainValuesReader.java

+    int requiredBytes = total * 4;
+    ByteBuffer buffer = getBuffer(requiredBytes);
+    for (int i = 0; i < total; i += 1) {
+      c.putLong(rowId + i, Integer.toUnsignedLong(buffer.getInt()));


maybe we can improve here by coverting the buffer.array() to unsigned stuffs, but I am not sure it's faster and how to do that right now.

.../main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedRleValuesReader.java

SparkQA · 2021-03-24T19:07:57Z

Test build #136472 has finished for PR 31921 at commit d9afc79.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-03-24T21:40:48Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41056/

SparkQA · 2021-03-24T22:12:44Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41056/

SparkQA · 2021-03-25T04:44:29Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41076/

SparkQA · 2021-03-25T05:59:24Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41076/

SparkQA · 2021-03-25T06:22:43Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41081/

SparkQA · 2021-03-25T06:26:20Z

Test build #136490 has finished for PR 31921 at commit 02cee4f.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2021-03-25T06:58:05Z

thanks, merging to master!

SparkQA · 2021-03-25T07:14:57Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41081/

SparkQA · 2021-03-25T09:56:55Z

Test build #136497 has finished for PR 31921 at commit 71496bd.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

…ger types ### What changes were proposed in this pull request? JIRA: https://issues.apache.org/jira/browse/SPARK-43427 Protobuf supports unsigned integer types, including uint32 and uint64. When deserializing protobuf values with fields of these types, `from_protobuf` currently transforms them to the spark types of: ``` uint32 => IntegerType uint64 => LongType ``` IntegerType and LongType are [signed](https://spark.apache.org/docs/latest/sql-ref-datatypes.html) integer types, so this can lead to confusing results. Namely, if a uint32 value in a stored proto is above 2^31 or a uint64 value is above 2^63, their representation in binary will contain a 1 in the highest bit, which when interpreted as a signed integer will be negative (I.e. overflow). No information is lost, as `IntegerType` and `LongType` contain 32 and 64 bits respectively, however their representation can be confusing. In this PR, we add an option (`upcast.unsigned.ints`) to allow upcasting unsigned integer types into a larger integer type that can represent them natively, i.e. ``` uint32 => LongType uint64 => Decimal(20, 0) ``` I added an option so that it doesn't break any existing clients. **Example of current behavior** Consider a protobuf message like: ``` syntax = "proto3"; message Test { uint64 val = 1; } ``` If we compile the above and then generate a message with a value for `val` above 2^63: ``` import test_pb2 s = test_pb2.Test() s.val = 9223372036854775809 # 2**63 + 1 serialized = s.SerializeToString() print(serialized) ``` This generates the binary representation: b'\x08\x81\x80\x80\x80\x80\x80\x80\x80\x80\x01' Then, deserializing this using `from_protobuf`, we can see that it is represented as a negative number. I did this in a notebook so its easier to see, but could reproduce in a scala test as well: ![image](https://github.com/apache/spark/assets/1002986/7144e6a9-3f43-455e-94c3-9065ae88206e) **Precedent** I believe that unsigned integer types in parquet are deserialized in a similar manner, i.e. put into a larger type so that the unsigned representation natively fits. https://issues.apache.org/jira/browse/SPARK-34817 and #31921. So an option to get similar behavior would be useful. ### Why are the changes needed? Improve unsigned integer deserialization behavior. ### Does this PR introduce any user-facing change? Yes, adds a new option. ### How was this patch tested? Unit Testing ### Was this patch authored or co-authored using generative AI tooling? No Closes #43773 from justaparth/parth/43427-add-option-to-expand-unsigned-integers. Authored-by: Parth Upadhyay <parth.upadhyay@gmail.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

[SPARK-34817][SQL] Read parquet unsigned types that stored as int32 p…

cf4077d

…hysical type

github-actions bot added the SQL label Mar 22, 2021

yaooqinn changed the title ~~[SPARK-34817][SQL] Read parquet unsigned types that stored as int32 physical type~~ [SPARK-34817][SQL] Read parquet unsigned types that stored as int32 physical type in parquet Mar 22, 2021

nit

0c8b6d4

nit

8ff3267

HyukjinKwon reviewed Mar 22, 2021

View reviewed changes

cloud-fan reviewed Mar 22, 2021

View reviewed changes

...src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedColumnReader.java Show resolved Hide resolved

cloud-fan reviewed Mar 22, 2021

View reviewed changes

.../main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedRleValuesReader.java Show resolved Hide resolved

dictionary branch

0da5d07

yaooqinn added 2 commits March 24, 2021 18:38

activate dictionary decoding

10ed3b9

clean

efe9c4a

yaooqinn commented Mar 24, 2021

View reviewed changes

clean

3642f91

cloud-fan reviewed Mar 24, 2021

View reviewed changes

yaooqinn commented Mar 24, 2021

View reviewed changes

yaooqinn added 2 commits March 25, 2021 00:05

comments

869dc14

comments

d9afc79

cloud-fan reviewed Mar 24, 2021

View reviewed changes

.../main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedRleValuesReader.java Outdated Show resolved Hide resolved

cloud-fan approved these changes Mar 24, 2021

View reviewed changes

yaooqinn added 3 commits March 25, 2021 10:29

nit

f9ab2d5

Merge branch 'master' into SPARK-34817

02cee4f

Merge branch 'master' into SPARK-34817

71496bd

cloud-fan closed this in 8c6748f Mar 25, 2021

sunchao mentioned this pull request Aug 12, 2021

[SPARK-36645][SQL] Aggregate (Min/Max/Count) push down for Parquet #33639

Closed

justaparth mentioned this pull request May 9, 2023

[SPARK-43427][Protobuf] spark protobuf: modify serde behavior of unsigned integer types #41108

Closed

justaparth mentioned this pull request Nov 12, 2023

[SPARK-43427][PROTOBUF] spark protobuf: allow upcasting unsigned integer types #43773

Closed

parthchandra mentioned this pull request Feb 7, 2025

Allow Parquet reader to read incorrectly written (negative) uint8, uint16 values for compatibility apache/arrow-rs#7040

Open

[SPARK-34817][SQL] Read parquet unsigned types that stored as int32 physical type in parquet #31921

[SPARK-34817][SQL] Read parquet unsigned types that stored as int32 physical type in parquet #31921

Uh oh!

Conversation

yaooqinn commented Mar 22, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

SparkQA commented Mar 22, 2021

Uh oh!

SparkQA commented Mar 22, 2021

Uh oh!

SparkQA commented Mar 22, 2021

Uh oh!

SparkQA commented Mar 22, 2021

Uh oh!

yaooqinn commented Mar 22, 2021

Uh oh!

SparkQA commented Mar 22, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yaooqinn Mar 22, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cloud-fan Mar 22, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

SparkQA commented Mar 22, 2021

Uh oh!

SparkQA commented Mar 24, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Mar 24, 2021

Uh oh!

SparkQA commented Mar 24, 2021

Uh oh!

SparkQA commented Mar 24, 2021

Uh oh!

SparkQA commented Mar 24, 2021

Uh oh!

SparkQA commented Mar 24, 2021

Uh oh!

SparkQA commented Mar 24, 2021

Uh oh!

SparkQA commented Mar 24, 2021

Uh oh!

yaooqinn commented Mar 24, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yaooqinn Mar 24, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

yaooqinn commented Mar 22, 2021 •

edited

Loading

yaooqinn Mar 22, 2021 •

edited

Loading

cloud-fan Mar 22, 2021 •

edited

Loading

yaooqinn Mar 24, 2021 •

edited

Loading

yaooqinn Mar 24, 2021 •

edited

Loading