Skip to content

Conversation

@justaparth
Copy link
Contributor

@justaparth justaparth commented May 9, 2023

https://issues.apache.org/jira/browse/SPARK-43427

What changes were proposed in this pull request?

Explanation
Protobuf supports unsigned integer types, including uint32 and uint64. When deserializing protobuf values with fields of these types, the from_protobuf library currently transforms them to the spark type of:

uint32 => IntegerType
uint64 => LongType

IntegerType and LongType are signed integer types, so this can lead to confusing results. Namely, if a uint32 value in a stored proto is above 2^31 or a uint64 value is above 2^63, their representation in binary will contain a 1 in the highest bit, which when interpreted as a signed integer will come out as negative (I.e. overflow).

I propose that we deserialize unsigned integer types into a type that can contain them correctly, e.g.
uint32 => LongType
uint64 => Decimal(20, 0)

Precedent
I believe that unsigned integer types in parquet are deserialized in a similar manner, i.e. put into a larger type so that the unsigned representation natively fits. https://issues.apache.org/jira/browse/SPARK-34817 and #31921

** Example to reproduce **

Consider a protobuf message like:

syntax = "proto3";

message Test {
  uint64 val = 1;
}

Generate a protobuf with a value above 2^63. I did this in python with a small script like:

import test_pb2

s = test_pb2.Test()
s.val = 9223372036854775809 # 2**63 + 1
serialized = s.SerializeToString()
print(serialized)

This generates the binary representation:

b'\x08\x81\x80\x80\x80\x80\x80\x80\x80\x80\x01'

Then, deserialize this using from_protobuf. I did this in a notebook so its easier to see, but could reproduce in a scala test as well:

image

Backwards Compatibility / Default Behavior
Should we maintain backwards compatibility and add an option that allows deserializing these types differently? OR should we change change the default behavior (with an option to go back to the old way)? Would love some thoughts here!

I think by default it makes more sense to deserialize them as the larger types so that it's semantically more correct. However, there may be existing users of this library that would be affected by this behavior change. Though, maybe we can justify the change since the function is tagged as Experimental (and spark 3.4.0 was only released very recently).

Why are the changes needed?

Improve unsigned integer deserialization behavior.

Does this PR introduce any user-facing change?

Yes, as written it would change the deserialization behavior of unsigned integer field types. However, please see the above section about whether we should or should not maintain backwards compatibility.

How was this patch tested?

Unit Testing

@justaparth
Copy link
Contributor Author

cc @rangadi i've made a draft implementation here but just wanted to get your thoughts quickly on:

  1. does the problem / solution make sense?
  2. should we make this behavior the default or should we add an option to turn it on?

thanks 🙏

@justaparth justaparth force-pushed the parth/uint64-as-bigdecimal branch from 11e449f to 05b2d74 Compare May 10, 2023 06:43
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

using MIN_VALUE as its all 1s in binary and is the largest possible number if its interpreted as "unsigned"

@justaparth
Copy link
Contributor Author

also cc @HyukjinKwon as you reviewed #31921 and have reviewed many proto changes too 😅

@justaparth justaparth changed the title [SPARK-43427] spark protobuf: modify deserialization behavior of unsigned integer types [SPARK-43427] spark protobuf: modify serde behavior of unsigned integer types May 11, 2023
@rangadi
Copy link

rangadi commented May 15, 2023

Where is the information loss or overflow? Java code generated by Protobuf for a uint32 field also returns an int, not long.

@justaparth
Copy link
Contributor Author

justaparth commented May 21, 2023

Where is the information loss or overflow? Java code generated by Protobuf for a uint32 field also returns an int, not long.

sorry i didn't get a chance to reply to this until now. There is no information loss, technically, as uint32 is 4 bytes and uint64 is 8 bytes, same as int and long respectively. However, there is overflow in the representation.

Here's an example:

Consider a protobuf message like:

syntax = "proto3";

message Test {
  uint64 val = 1;
}

Generate a protobuf with a value above 2^63. I did this in python with a small script like:

import test_pb2

s = test_pb2.Test()
s.val = 9223372036854775809 # 2**63 + 1
serialized = s.SerializeToString()
print(serialized)

This generates the binary representation:

b'\x08\x81\x80\x80\x80\x80\x80\x80\x80\x80\x01'

Then, deserialize this using from_protobuf. I did this in a notebook so its easier to see, but could reproduce in a scala test as well:

image

This is exactly what we'd expect when you take a 64 bit number with the highest bit as 1 and then try to interpret it as a signed number (long), it becomes negative.

So this PR proposes some changes to the deserialization behavior. However, I don't know if its right to change the default or have an option to allow unpacking as a larger type.

uint32 => Long instead of Integer
uint64 => Decimal(20, 0) instead of Long
@justaparth justaparth force-pushed the parth/uint64-as-bigdecimal branch from 05b2d74 to 8f542a1 Compare May 22, 2023 07:44
@rangadi
Copy link

rangadi commented May 22, 2023

So this PR proposes some changes to the deserialization behavior. However, I don't know if its right to change the default or have an option to allow unpacking as a larger type.

What if you have a UDF that converts this to BigDecimal? Will you get the value back?
I guess that is the intention behind why protobuf-java casts unsiged to signed in its Java methods.
I think it simpler to go this way. Given these kinds of issues, I guess it is not a good practice to use unsiged in protobuf. It can be intepreted correctly at application level when they are infact used this way.

@justaparth
Copy link
Contributor Author

What if you have a UDF that converts this to BigDecimal? Will you get the value back? I guess that is the intention behind why protobuf-java casts unsiged to signed in its Java methods. I think it simpler to go this way.

Yeah, there is no information loss so you can get the right value the way I did in this PR (Integer.toUnsignedLong, Long.toUnsignedString). I think, though, it's useful if the spark-protobuf library can do this; the burden of taking a struct and trying to do this transformation is cumbersome, in my opinion.

However, one additional piece of information is that for unsigned types in parquet, the default behavior is to represent them in larger types. I put this in the PR description but see this ticket https://issues.apache.org/jira/browse/SPARK-34817 implemented in this PR: #31921. Or the existing code today https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaConverter.scala#L243-L247 which shows that by default parquet unsigned values are actually expanded to larger types in spark.

So, since this same problem/solution exists in another storage format, I think its useful to implement this behavior here as well. I also think that it actually does make sense to do it by default, as parquet already does this. However, i'm open also to doing this transformation behind an option so that no existing usages are broken. Mainly, I want to just make sure we do what is the most correct and broadly consistent thing to do (and i'm not really sure exactly what that is, and would love some other inputs). cc @HyukjinKwon as well here since you reviewed the original PR doing this for parquet!

@justaparth justaparth changed the title [SPARK-43427] spark protobuf: modify serde behavior of unsigned integer types [SPARK-43427][Protobuf] spark protobuf: modify serde behavior of unsigned integer types Jun 8, 2023
@github-actions
Copy link

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

@github-actions github-actions bot added the Stale label Sep 17, 2023
@github-actions github-actions bot closed this Sep 18, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants