Skip to content

Conversation

@justaparth
Copy link
Contributor

@justaparth justaparth commented Nov 11, 2023

What changes were proposed in this pull request?

JIRA: https://issues.apache.org/jira/browse/SPARK-44001

Under com.google.protobuf, there are some well known primitive wrapper types, useful for distinguishing between absence of primitive fields and their default values, as well as for use within google.protobuf.Any types. These types are:

DoubleValue
FloatValue
Int64Value
Uint64Value
Int32Value
Uint32Value
BoolValue
StringValue
BytesValue

Currently, when we deserialize these from a serialized protobuf into a spark struct, we expand them as if they were normal messages. Concretely, if we have

syntax = "proto3";

import "google/protobuf/wrappers.proto"

message WktExample {
  google.protobuf.BoolValue bool_val = 1;
  google.protobuf.Int32Value int32_val = 2;
}

And a message like

WktExample(true, 100)

Then the behavior today is to deserialize this as.

{"bool_val": {"value": true}, "int32_val": {"value": 100}}

This is quite difficult to work with and not in the spirit of the wrapper type, so it would be nice to have an option to unwrap them when deserializing, i.e.:

{"bool_val": true, "int32_val": 100}

This is also the default behavior by other popular deserialization libraries, including java protobuf util Jsonformat and golangs jsonpb.

So for consistency with other libraries and improved usability, I add a deserialization option, unwrap.protobuf.wkt to enable this behavior. I add an option to avoid breaking any existing clients.

Why are the changes needed?

Improved usability and consistency with other deserialization libraries.

Does this PR introduce any user-facing change?

Yes, deserialization of well known types will change from the struct format to the inline format.

How was this patch tested?

Added a unit test testing every well known type deserialization explicitly, as well as testing roundtrip.

Was this patch authored or co-authored using generative AI tooling?

No

@justaparth justaparth changed the title test [SPARK-44001] Add option to allow unwrapping protobuf well known wrapper types Nov 11, 2023
@justaparth justaparth force-pushed the parth/44001-option-to-unwrap-wkt branch 2 times, most recently from 8b7f72f to 512d6ac Compare November 12, 2023 22:59
@justaparth justaparth changed the title [SPARK-44001] Add option to allow unwrapping protobuf well known wrapper types [SPARK-44001][PROTOBUF] Add option to allow unwrapping protobuf well known wrapper types Nov 12, 2023
@justaparth
Copy link
Contributor Author

@rangadi do you mind taking a look? i've modified this to make it an option so that the default behavior doesn't change

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add test where int32_val is not set. What should the Spark struct contain:

  • When emit.default.values is false (default)
  • When emit.default.values is true
    Please comment on the expected behavior.

Another similar test with int32_val.value set to 0 (default value).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, great suggestion. I added that test!

basically the internal value field should operate the same as any primitive, and when unwrapping it should unwrap that value correctly. I added a test that confirms every configuration of emit.default.values and unwrap.primitive.wrapper.types

@justaparth justaparth force-pushed the parth/44001-option-to-unwrap-wkt branch from 512d6ac to 38d6df3 Compare November 21, 2023 15:58
@justaparth justaparth requested a review from rangadi November 21, 2023 16:00
@justaparth
Copy link
Contributor Author

thanks for the review @rangadi ! i updated based on your comments, lmk what you think. Also it could be good to check #43773 first, since that could also affect this PR 🤔

@justaparth justaparth force-pushed the parth/44001-option-to-unwrap-wkt branch from 38d6df3 to 4db3454 Compare November 21, 2023 22:49
Copy link

@rangadi rangadi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @justaparth, I had these comments written quite some time back, but forgot to 'Submit' now.
PTAL. lets close this PR soon. It is pretty close to be ready. Please ping me.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: rename this to int_val ('val' is confusing for Scala code).
Also use a value of 5 or 10 in the example below (just to make it different from field value in message definition.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove =

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove?

Comment on lines +1676 to +1724
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fix the comment.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add a short description of what the following is doing? It will save time for future readers. The above above gives the goal for the test. The description here would help with how it tests.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is fine. Just to make sure, this is different from primitive int, right?
In the case of primitive int, value would be 0, not null when emit.default.values is true.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes! exactly

@justaparth justaparth force-pushed the parth/44001-option-to-unwrap-wkt branch 2 times, most recently from f468688 to bf11b16 Compare December 18, 2023 20:28
@justaparth
Copy link
Contributor Author

thanks for the review @rangadi , and sorry for the dealy in getting back to this. i've just updated and i think i've responded to all comments, do you mind taking another look?

@justaparth justaparth requested a review from rangadi December 18, 2023 20:28
Comment on lines +186 to +187
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you update this to clarify that emit.default.values does not apply here? I.e. an Int32Value value field would be null if unset, even if emit.default.values is set to true.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure! just added a comment with an example

@justaparth justaparth force-pushed the parth/44001-option-to-unwrap-wkt branch from bf11b16 to e623746 Compare December 18, 2023 22:00
@justaparth justaparth requested a review from rangadi December 18, 2023 23:03
@justaparth
Copy link
Contributor Author

hey @rangadi just wanted to bump this! i think i've responded to everything

Signed-off-by: Parth Upadhyay <parth.upadhyay@gmail.com>
@justaparth justaparth force-pushed the parth/44001-option-to-unwrap-wkt branch from e623746 to 0483642 Compare January 2, 2024 20:25
@justaparth
Copy link
Contributor Author

hey @rangadi just wanted to bump this again in case you get a chance 🙏

Copy link

@rangadi rangadi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @justaparth. Sorry about the delay.
cc: @HyukjinKwon please merge when you can.

@HyukjinKwon
Copy link
Member

Merged to master.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants