-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-34817][SQL] Read parquet unsigned types that stored as int32 physical type in parquet #31921
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #136327 has finished for PR 31921 at commit
|
|
Kubernetes integration test starting |
|
Kubernetes integration test status failure |
|
Kubernetes integration test starting |
|
cc @HyukjinKwon @cloud-fan @dongjoon-hyun thanks for reviewing |
|
Kubernetes integration test status failure |
| case UINT_8 => typeNotSupported() | ||
| case UINT_16 => typeNotSupported() | ||
| case UINT_32 => typeNotSupported() | ||
| case UINT_32 => LongType |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These were explicitly unsupported at #9646 .. per @liancheng's advice (who's also Parquet committer). So I'm less sure if this is something we should support.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But it's very old. Almost 6 years ago lol. @liancheng do you have a different thought now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, @HyukjinKwon,
Yea, I have checked that PR too. There's also a suggestion that we support them.
Lately, Wenchen created https://issues.apache.org/jira/browse/SPARK-34786 for reading uint64. As other unsigned types are not supported too and they are a bit more clear than uint64 which needs a decimal, I raised this PR to collect more opinions.
IMO, for Spark, it is worthwhile to be able to support more storage layer features without breaking our own rules.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My hunch is that Spark SQL didn't support unsigned integral types at all back then. As long as we support that now, it's OK to have.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's mostly about compatibility. Spark won't have unsigned types, but spark should be able to read existing parquet files written by other systems that support unsigned types.
| num, column, rowId, maxDefLevel, (VectorizedValuesReader) dataColumn); | ||
| } else if (column.dataType() == DataTypes.LongType) { | ||
| // We use LongType to handle UINT32 | ||
| defColumn.readIntegersAsUnsigned( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: readUnsighedIntegers
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we follow 38fbe56 and check if dictionary encoding also needs update?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, checking~
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks irrelevant to me
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have added the dictionary decoding code path, change the parquet data generator a bit to produce right encoded/plain data
...src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedColumnReader.java
Show resolved
Hide resolved
.../main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedRleValuesReader.java
Show resolved
Hide resolved
|
Test build #136330 has finished for PR 31921 at commit
|
|
Test build #136451 has finished for PR 31921 at commit
|
| import org.apache.spark.sql.types._ | ||
| import org.apache.spark.unsafe.types.UTF8String | ||
|
|
||
| // Write support class for nested groups: ParquetWriter initializes GroupWriteSupport |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we don't need this anymore, the ExampleParquetWriter meets our needs
|
Test build #136452 has finished for PR 31921 at commit
|
|
Test build #136455 has started for PR 31921 at commit |
|
Kubernetes integration test starting |
|
Kubernetes integration test starting |
|
Kubernetes integration test status failure |
|
Kubernetes integration test starting |
|
Kubernetes integration test status failure |
|
@cloud-fan @liancheng @HyukjinKwon @maropu please take another look |
| column.putInt(i, dictionary.decodeToInt(dictionaryIds.getDictId(i))); | ||
| } | ||
| } | ||
| } else if (column.dataType() == DataTypes.LongType) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
when will we hit this branch? it's case INT32 not unsigned.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On Parquet side, for signed and unsigned int (<=32) types they share the same PrimitiveType - INT32. The Unsigned ones are just logical types.
| canReadAsIntDecimal(column.dataType())) { | ||
| defColumn.readIntegers( | ||
| num, column, rowId, maxDefLevel, (VectorizedValuesReader) dataColumn); | ||
| } else if (column.dataType() == DataTypes.LongType) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shall we add an extra check to make sure we are reading unsigned values?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is deterministic and controlled by our own, which seems not necessary. see https://github.com/apache/spark/pull/31921/files#diff-3730a913c4b95edf09fb78f8739c538bae53f7269555b6226efe7ccee1901b39R137
|
Kubernetes integration test status failure |
| int requiredBytes = total * 4; | ||
| ByteBuffer buffer = getBuffer(requiredBytes); | ||
| for (int i = 0; i < total; i += 1) { | ||
| c.putLong(rowId + i, Integer.toUnsignedLong(buffer.getInt())); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe we can improve here by coverting the buffer.array() to unsigned stuffs, but I am not sure it's faster and how to do that right now.
.../main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedRleValuesReader.java
Outdated
Show resolved
Hide resolved
|
Test build #136472 has finished for PR 31921 at commit
|
|
Kubernetes integration test starting |
|
Kubernetes integration test status failure |
|
Kubernetes integration test starting |
|
Kubernetes integration test status failure |
|
Kubernetes integration test starting |
|
Test build #136490 has finished for PR 31921 at commit
|
|
thanks, merging to master! |
|
Kubernetes integration test status failure |
|
Test build #136497 has finished for PR 31921 at commit
|
…ger types ### What changes were proposed in this pull request? JIRA: https://issues.apache.org/jira/browse/SPARK-43427 Protobuf supports unsigned integer types, including uint32 and uint64. When deserializing protobuf values with fields of these types, `from_protobuf` currently transforms them to the spark types of: ``` uint32 => IntegerType uint64 => LongType ``` IntegerType and LongType are [signed](https://spark.apache.org/docs/latest/sql-ref-datatypes.html) integer types, so this can lead to confusing results. Namely, if a uint32 value in a stored proto is above 2^31 or a uint64 value is above 2^63, their representation in binary will contain a 1 in the highest bit, which when interpreted as a signed integer will be negative (I.e. overflow). No information is lost, as `IntegerType` and `LongType` contain 32 and 64 bits respectively, however their representation can be confusing. In this PR, we add an option (`upcast.unsigned.ints`) to allow upcasting unsigned integer types into a larger integer type that can represent them natively, i.e. ``` uint32 => LongType uint64 => Decimal(20, 0) ``` I added an option so that it doesn't break any existing clients. **Example of current behavior** Consider a protobuf message like: ``` syntax = "proto3"; message Test { uint64 val = 1; } ``` If we compile the above and then generate a message with a value for `val` above 2^63: ``` import test_pb2 s = test_pb2.Test() s.val = 9223372036854775809 # 2**63 + 1 serialized = s.SerializeToString() print(serialized) ``` This generates the binary representation: b'\x08\x81\x80\x80\x80\x80\x80\x80\x80\x80\x01' Then, deserializing this using `from_protobuf`, we can see that it is represented as a negative number. I did this in a notebook so its easier to see, but could reproduce in a scala test as well:  **Precedent** I believe that unsigned integer types in parquet are deserialized in a similar manner, i.e. put into a larger type so that the unsigned representation natively fits. https://issues.apache.org/jira/browse/SPARK-34817 and #31921. So an option to get similar behavior would be useful. ### Why are the changes needed? Improve unsigned integer deserialization behavior. ### Does this PR introduce any user-facing change? Yes, adds a new option. ### How was this patch tested? Unit Testing ### Was this patch authored or co-authored using generative AI tooling? No Closes #43773 from justaparth/parth/43427-add-option-to-expand-unsigned-integers. Authored-by: Parth Upadhyay <parth.upadhyay@gmail.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
What changes were proposed in this pull request?
Unsigned types may be used to produce smaller in-memory representations of the data. These types used by frameworks(e.g. hive, pig) using parquet. And parquet will map them to its base types.
see more https://github.com/apache/parquet-format/blob/master/LogicalTypes.md
https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift
In this PR, we support read UINT_8 as ShortType, UINT_16 as IntegerType, UINT_32 as LongType to fit their range. Support for UINT_64 will be in another PR.
Why are the changes needed?
better parquet support
Does this PR introduce any user-facing change?
yes, we can read unit[8/16/32] from parquet files
How was this patch tested?
new tests