-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-43427][Protobuf] spark protobuf: modify serde behavior of unsigned integer types #41108
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
cc @rangadi i've made a draft implementation here but just wanted to get your thoughts quickly on:
thanks 🙏 |
connector/protobuf/src/test/scala/org/apache/spark/sql/protobuf/ProtobufFunctionsSuite.scala
Outdated
Show resolved
Hide resolved
11e449f to
05b2d74
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
using MIN_VALUE as its all 1s in binary and is the largest possible number if its interpreted as "unsigned"
|
also cc @HyukjinKwon as you reviewed #31921 and have reviewed many proto changes too 😅 |
|
Where is the information loss or overflow? Java code generated by Protobuf for a uint32 field also returns an |
uint32 => Long instead of Integer uint64 => Decimal(20, 0) instead of Long
05b2d74 to
8f542a1
Compare
What if you have a UDF that converts this to BigDecimal? Will you get the value back? |
Yeah, there is no information loss so you can get the right value the way I did in this PR (Integer.toUnsignedLong, Long.toUnsignedString). I think, though, it's useful if the However, one additional piece of information is that for unsigned types in parquet, the default behavior is to represent them in larger types. I put this in the PR description but see this ticket https://issues.apache.org/jira/browse/SPARK-34817 implemented in this PR: #31921. Or the existing code today https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaConverter.scala#L243-L247 which shows that by default parquet unsigned values are actually expanded to larger types in spark. So, since this same problem/solution exists in another storage format, I think its useful to implement this behavior here as well. I also think that it actually does make sense to do it by default, as parquet already does this. However, i'm open also to doing this transformation behind an option so that no existing usages are broken. Mainly, I want to just make sure we do what is the most correct and broadly consistent thing to do (and i'm not really sure exactly what that is, and would love some other inputs). cc @HyukjinKwon as well here since you reviewed the original PR doing this for parquet! |
|
We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable. |

https://issues.apache.org/jira/browse/SPARK-43427
What changes were proposed in this pull request?
Explanation
Protobuf supports unsigned integer types, including
uint32anduint64. When deserializing protobuf values with fields of these types, thefrom_protobuflibrary currently transforms them to the spark type of:uint32 =>
IntegerTypeuint64 =>
LongTypeIntegerTypeandLongTypeare signed integer types, so this can lead to confusing results. Namely, if a uint32 value in a stored proto is above 2^31 or a uint64 value is above 2^63, their representation in binary will contain a 1 in the highest bit, which when interpreted as a signed integer will come out as negative (I.e. overflow).I propose that we deserialize unsigned integer types into a type that can contain them correctly, e.g.
uint32 =>
LongTypeuint64 =>
Decimal(20, 0)Precedent
I believe that unsigned integer types in parquet are deserialized in a similar manner, i.e. put into a larger type so that the unsigned representation natively fits. https://issues.apache.org/jira/browse/SPARK-34817 and #31921
** Example to reproduce **
Consider a protobuf message like:
Generate a protobuf with a value above 2^63. I did this in python with a small script like:
This generates the binary representation:
Then, deserialize this using
from_protobuf. I did this in a notebook so its easier to see, but could reproduce in a scala test as well:Backwards Compatibility / Default Behavior
Should we maintain backwards compatibility and add an option that allows deserializing these types differently? OR should we change change the default behavior (with an option to go back to the old way)? Would love some thoughts here!
I think by default it makes more sense to deserialize them as the larger types so that it's semantically more correct. However, there may be existing users of this library that would be affected by this behavior change. Though, maybe we can justify the change since the function is tagged as
Experimental(and spark 3.4.0 was only released very recently).Why are the changes needed?
Improve unsigned integer deserialization behavior.
Does this PR introduce any user-facing change?
Yes, as written it would change the deserialization behavior of unsigned integer field types. However, please see the above section about whether we should or should not maintain backwards compatibility.
How was this patch tested?
Unit Testing