-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-44001][PROTOBUF] Add option to allow unwrapping protobuf well known wrapper types #43767
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-44001][PROTOBUF] Add option to allow unwrapping protobuf well known wrapper types #43767
Conversation
8b7f72f to
512d6ac
Compare
|
@rangadi do you mind taking a look? i've modified this to make it an option so that the default behavior doesn't change |
connector/protobuf/src/main/scala/org/apache/spark/sql/protobuf/utils/ProtobufOptions.scala
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add test where int32_val is not set. What should the Spark struct contain:
- When
emit.default.valuesis false (default) - When
emit.default.valuesis true
Please comment on the expected behavior.
Another similar test with int32_val.value set to 0 (default value).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, great suggestion. I added that test!
basically the internal value field should operate the same as any primitive, and when unwrapping it should unwrap that value correctly. I added a test that confirms every configuration of emit.default.values and unwrap.primitive.wrapper.types
connector/protobuf/src/main/scala/org/apache/spark/sql/protobuf/ProtobufDeserializer.scala
Outdated
Show resolved
Hide resolved
512d6ac to
38d6df3
Compare
38d6df3 to
4db3454
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @justaparth, I had these comments written quite some time back, but forgot to 'Submit' now.
PTAL. lets close this PR soon. It is pretty close to be ready. Please ping me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: rename this to int_val ('val' is confusing for Scala code).
Also use a value of 5 or 10 in the example below (just to make it different from field value in message definition.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove =
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix the comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add a short description of what the following is doing? It will save time for future readers. The above above gives the goal for the test. The description here would help with how it tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is fine. Just to make sure, this is different from primitive int, right?
In the case of primitive int, value would be 0, not null when emit.default.values is true.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes! exactly
f468688 to
bf11b16
Compare
|
thanks for the review @rangadi , and sorry for the dealy in getting back to this. i've just updated and i think i've responded to all comments, do you mind taking another look? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you update this to clarify that emit.default.values does not apply here? I.e. an Int32Value value field would be null if unset, even if emit.default.values is set to true.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure! just added a comment with an example
bf11b16 to
e623746
Compare
|
hey @rangadi just wanted to bump this! i think i've responded to everything |
Signed-off-by: Parth Upadhyay <parth.upadhyay@gmail.com>
e623746 to
0483642
Compare
|
hey @rangadi just wanted to bump this again in case you get a chance 🙏 |
rangadi
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @justaparth. Sorry about the delay.
cc: @HyukjinKwon please merge when you can.
|
Merged to master. |
What changes were proposed in this pull request?
JIRA: https://issues.apache.org/jira/browse/SPARK-44001
Under com.google.protobuf, there are some well known primitive wrapper types, useful for distinguishing between absence of primitive fields and their default values, as well as for use within
google.protobuf.Anytypes. These types are:Currently, when we deserialize these from a serialized protobuf into a spark struct, we expand them as if they were normal messages. Concretely, if we have
And a message like
Then the behavior today is to deserialize this as.
This is quite difficult to work with and not in the spirit of the wrapper type, so it would be nice to have an option to unwrap them when deserializing, i.e.:
This is also the default behavior by other popular deserialization libraries, including java protobuf util Jsonformat and golangs jsonpb.
So for consistency with other libraries and improved usability, I add a deserialization option,
unwrap.protobuf.wktto enable this behavior. I add an option to avoid breaking any existing clients.Why are the changes needed?
Improved usability and consistency with other deserialization libraries.
Does this PR introduce any user-facing change?
Yes, deserialization of well known types will change from the struct format to the inline format.
How was this patch tested?
Added a unit test testing every well known type deserialization explicitly, as well as testing roundtrip.
Was this patch authored or co-authored using generative AI tooling?
No