[SPARK-43427][Protobuf] spark protobuf: modify serde behavior of unsigned integer types #41108

justaparth · 2023-05-09T17:12:54Z

https://issues.apache.org/jira/browse/SPARK-43427

What changes were proposed in this pull request?

Explanation
Protobuf supports unsigned integer types, including uint32 and uint64. When deserializing protobuf values with fields of these types, the from_protobuf library currently transforms them to the spark type of:

uint32 => IntegerType
uint64 => LongType

IntegerType and LongType are signed integer types, so this can lead to confusing results. Namely, if a uint32 value in a stored proto is above 2^31 or a uint64 value is above 2^63, their representation in binary will contain a 1 in the highest bit, which when interpreted as a signed integer will come out as negative (I.e. overflow).

I propose that we deserialize unsigned integer types into a type that can contain them correctly, e.g.
uint32 => LongType
uint64 => Decimal(20, 0)

Precedent
I believe that unsigned integer types in parquet are deserialized in a similar manner, i.e. put into a larger type so that the unsigned representation natively fits. https://issues.apache.org/jira/browse/SPARK-34817 and #31921

** Example to reproduce **

Consider a protobuf message like:

syntax = "proto3";

message Test {
  uint64 val = 1;
}

Generate a protobuf with a value above 2^63. I did this in python with a small script like:

import test_pb2

s = test_pb2.Test()
s.val = 9223372036854775809 # 2**63 + 1
serialized = s.SerializeToString()
print(serialized)

This generates the binary representation:

b'\x08\x81\x80\x80\x80\x80\x80\x80\x80\x80\x01'

Then, deserialize this using from_protobuf. I did this in a notebook so its easier to see, but could reproduce in a scala test as well:

Backwards Compatibility / Default Behavior
Should we maintain backwards compatibility and add an option that allows deserializing these types differently? OR should we change change the default behavior (with an option to go back to the old way)? Would love some thoughts here!

I think by default it makes more sense to deserialize them as the larger types so that it's semantically more correct. However, there may be existing users of this library that would be affected by this behavior change. Though, maybe we can justify the change since the function is tagged as Experimental (and spark 3.4.0 was only released very recently).

Why are the changes needed?

Improve unsigned integer deserialization behavior.

Does this PR introduce any user-facing change?

Yes, as written it would change the deserialization behavior of unsigned integer field types. However, please see the above section about whether we should or should not maintain backwards compatibility.

How was this patch tested?

Unit Testing

justaparth · 2023-05-09T17:14:20Z

cc @rangadi i've made a draft implementation here but just wanted to get your thoughts quickly on:

does the problem / solution make sense?
should we make this behavior the default or should we add an option to turn it on?

thanks 🙏

connector/protobuf/src/test/scala/org/apache/spark/sql/protobuf/ProtobufFunctionsSuite.scala

justaparth · 2023-05-10T07:22:56Z

connector/protobuf/src/test/scala/org/apache/spark/sql/protobuf/ProtobufFunctionsSuite.scala

using MIN_VALUE as its all 1s in binary and is the largest possible number if its interpreted as "unsigned"

justaparth · 2023-05-11T08:54:03Z

also cc @HyukjinKwon as you reviewed #31921 and have reviewed many proto changes too 😅

rangadi · 2023-05-15T05:59:42Z

Where is the information loss or overflow? Java code generated by Protobuf for a uint32 field also returns an int, not long.

justaparth · 2023-05-21T21:05:40Z

Where is the information loss or overflow? Java code generated by Protobuf for a uint32 field also returns an int, not long.

sorry i didn't get a chance to reply to this until now. There is no information loss, technically, as uint32 is 4 bytes and uint64 is 8 bytes, same as int and long respectively. However, there is overflow in the representation.

Here's an example:

Consider a protobuf message like:

syntax = "proto3";

message Test {
  uint64 val = 1;
}

Generate a protobuf with a value above 2^63. I did this in python with a small script like:

import test_pb2

s = test_pb2.Test()
s.val = 9223372036854775809 # 2**63 + 1
serialized = s.SerializeToString()
print(serialized)

This generates the binary representation:

b'\x08\x81\x80\x80\x80\x80\x80\x80\x80\x80\x01'

Then, deserialize this using from_protobuf. I did this in a notebook so its easier to see, but could reproduce in a scala test as well:

This is exactly what we'd expect when you take a 64 bit number with the highest bit as 1 and then try to interpret it as a signed number (long), it becomes negative.

So this PR proposes some changes to the deserialization behavior. However, I don't know if its right to change the default or have an option to allow unpacking as a larger type.

uint32 => Long instead of Integer uint64 => Decimal(20, 0) instead of Long

rangadi · 2023-05-22T17:52:50Z

So this PR proposes some changes to the deserialization behavior. However, I don't know if its right to change the default or have an option to allow unpacking as a larger type.

What if you have a UDF that converts this to BigDecimal? Will you get the value back?
I guess that is the intention behind why protobuf-java casts unsiged to signed in its Java methods.
I think it simpler to go this way. Given these kinds of issues, I guess it is not a good practice to use unsiged in protobuf. It can be intepreted correctly at application level when they are infact used this way.

justaparth · 2023-05-22T18:45:10Z

What if you have a UDF that converts this to BigDecimal? Will you get the value back? I guess that is the intention behind why protobuf-java casts unsiged to signed in its Java methods. I think it simpler to go this way.

Yeah, there is no information loss so you can get the right value the way I did in this PR (Integer.toUnsignedLong, Long.toUnsignedString). I think, though, it's useful if the spark-protobuf library can do this; the burden of taking a struct and trying to do this transformation is cumbersome, in my opinion.

However, one additional piece of information is that for unsigned types in parquet, the default behavior is to represent them in larger types. I put this in the PR description but see this ticket https://issues.apache.org/jira/browse/SPARK-34817 implemented in this PR: #31921. Or the existing code today https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaConverter.scala#L243-L247 which shows that by default parquet unsigned values are actually expanded to larger types in spark.

So, since this same problem/solution exists in another storage format, I think its useful to implement this behavior here as well. I also think that it actually does make sense to do it by default, as parquet already does this. However, i'm open also to doing this transformation behind an option so that no existing usages are broken. Mainly, I want to just make sure we do what is the most correct and broadly consistent thing to do (and i'm not really sure exactly what that is, and would love some other inputs). cc @HyukjinKwon as well here since you reviewed the original PR doing this for parquet!

github-actions · 2023-09-17T00:18:45Z

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

github-actions bot added PROTOBUF SQL labels May 9, 2023

justaparth commented May 9, 2023

View reviewed changes

connector/protobuf/src/test/scala/org/apache/spark/sql/protobuf/ProtobufFunctionsSuite.scala Outdated Show resolved Hide resolved

justaparth force-pushed the parth/uint64-as-bigdecimal branch from 11e449f to 05b2d74 Compare May 10, 2023 06:43

justaparth commented May 10, 2023

View reviewed changes

justaparth changed the title ~~[SPARK-43427] spark protobuf: modify deserialization behavior of unsigned integer types~~ [SPARK-43427] spark protobuf: modify serde behavior of unsigned integer types May 11, 2023

spark protobuf: change how unsigned is deserialized

8f542a1

uint32 => Long instead of Integer uint64 => Decimal(20, 0) instead of Long

justaparth force-pushed the parth/uint64-as-bigdecimal branch from 05b2d74 to 8f542a1 Compare May 22, 2023 07:44

justaparth changed the title ~~[SPARK-43427] spark protobuf: modify serde behavior of unsigned integer types~~ [SPARK-43427][Protobuf] spark protobuf: modify serde behavior of unsigned integer types Jun 8, 2023

github-actions bot added the Stale label Sep 17, 2023

github-actions bot closed this Sep 18, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-43427][Protobuf] spark protobuf: modify serde behavior of unsigned integer types #41108

[SPARK-43427][Protobuf] spark protobuf: modify serde behavior of unsigned integer types #41108

Uh oh!

justaparth commented May 9, 2023 •

edited

Loading

Uh oh!

justaparth commented May 9, 2023

Uh oh!

Uh oh!

justaparth May 10, 2023

Uh oh!

justaparth commented May 11, 2023

Uh oh!

rangadi commented May 15, 2023

Uh oh!

justaparth commented May 21, 2023 •

edited

Loading

Uh oh!

rangadi commented May 22, 2023

Uh oh!

justaparth commented May 22, 2023

Uh oh!

github-actions bot commented Sep 17, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

[SPARK-43427][Protobuf] spark protobuf: modify serde behavior of unsigned integer types #41108

[SPARK-43427][Protobuf] spark protobuf: modify serde behavior of unsigned integer types #41108

Uh oh!

Conversation

justaparth commented May 9, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

justaparth commented May 9, 2023

Uh oh!

Uh oh!

justaparth May 10, 2023

Choose a reason for hiding this comment

Uh oh!

justaparth commented May 11, 2023

Uh oh!

rangadi commented May 15, 2023

Uh oh!

justaparth commented May 21, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rangadi commented May 22, 2023

Uh oh!

justaparth commented May 22, 2023

Uh oh!

github-actions bot commented Sep 17, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

justaparth commented May 9, 2023 •

edited

Loading

justaparth commented May 21, 2023 •

edited

Loading