-
Notifications
You must be signed in to change notification settings - Fork 15.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
proto3 and unknown fields #272
Comments
I too am wondering about this. I am looking into migrating what is essentially a messaging system to gRPC (where proto3 seems to be recommended). In my case, clients send messages (text plus rendering information) to each other via a server where the server needs to understand the text and certain parts of the rendering info. I want to allow client developers to experiment with new features (pre-release) without having to deploy server code for every change. Essentially, its a case where I want a shared proto definition between the client(s) and server, but dont want to require the server proto definition to be the latest to process requests. |
I'd like to hear about the explanation, too. The behavior of proto2 makes sense to me. |
I have a lot of concerns about silently deleting data upon deserialization, to the point that even though we have internally been using proto3 for several months, I am considering changing things back to proto2. This change would be a lot easier to stomach if there was a message option to allow serialization and deserialization of unknown fields instead of discarding them. |
Being unable to add unknown fields that persist is also unacceptable for us. Reading the code, it's pretty clear the decision to omit unknown fields happens at compile time rather than at runtime (based on the generated code), so it seems proto3 is a no-go. Personally, I very much liked most of the changes to the new version except this one. Changing the default behavior alone might have been ok, especially given that the new behavior is well-documented, but doing so without a way to restore old behavior seems like a misstep. Supporting a plugin that reverts that behavior seems too expensive relative to the cost of just using proto2 with restrictions (optional only, etc). |
Still no answers to this? This is a fundamental issue which is seriously hindering our the adoption of protobuf in many areas. |
+1 proto2 is a permanent fixture for us. Changing default behavior is one thing but changing it in a way that doesn't let the user even control it is a strict loss in my opinion. What I foresee moving forward is a huge fragmentation in the client ecosystem. Maintaining support for both proto2 and proto3 semantics is too much to chew for most developers, and I'm already seeing some client libraries do this awkward dance where they have some proto2 properties and some proto3 properties. The easiest example of this causing a problem in history is the move from Python2 to Python3. One possible solution might be a file level option that informs the protobuf compiler not to strip unknown fields. |
The proto3 spec doesn't forbid preserving unknown fields. Instead, it allows implementation to choose whether to preserve unknowns. The current C++/Java chose to drop the unknowns though. We are currently looking the issue and will keep this thread posted. |
Thanks @pherl for providing the update. FWIW, I think it is worth considering how the behavior might be standardized, for the same reason people argue against undefined behavior in C or C++. Undefined behavior (if present) should really be due to a lack of foresight if it exists, but for something like this, we might as well come up with an actual solution since we're already aware of the problem. |
Thanks for keeping this issue alive. I'd just like to add that we are interested in support for Go, but that might need to be addressed in golang/protobuf. |
@pherl Any progress on this front? |
+1 for preserving unknown fields. I accept that you can not trivially maintain compatibility with the JSON format (at least as long as you want to marshal fields with their names), but I think a lot of shops would be happy to pay this price for not having to release their low-level infrastructure in lock step with their newest clients. In fact Kenton seems to wonder himself (https://capnproto.org/news/2014-06-17-capnproto-flatbuffers-sbe.html): Apparently, version 3 of Protocol Buffers, aka “proto3”, removes this feature. I honestly don’t know what they’re thinking. This feature has been absolutely essential in many of Google’s internal systems. In my opinion the right approach would be to make this an option of the proto compiler on compiling the proto: this way everybody can decide for themselves whether the benefits outweigh the downsides. For now I have overridden the PreserveUnknownFields function in both cpp_helpers.h and java_helpers.h in the compiler code to always return true and this seems to work, but I would appreciate it if someone from google could confirm. |
Some updates: we tried to gather data to prove "unknown fields are essential for Google systems", but the result is not so convincing (the experiment is done in a Google sub-system, not the whole of Google). For those of you who are interested in adding back unknown fields in proto3, could you describe your use case in more details and explain why unknown fields is required (e.g., can the same use case be supported using some other proto3 features)? We need to prove unknown fields are needed in some common use cases in order to add it back. |
Here is a use case I developed internally that makes heavy usage of unknown In addition to the message itself, we often annotate the message before Generalizing this use case, any protobuf message that is derived from the On Sun, Jun 12, 2016 at 11:30 AM, Feng Xiao notifications@github.com
Jeremy Ong |
Hi, We have a use case with a mixture of data validation/data transformation and storage. In general any component could benefit from preserving unknown fields where only a partial understanding of the message is needed, especially where the bits the component does care about does not change often, but the rest of the schema does. I can think of routing, storage, certain types of data transformation, etc. I would be interested in knowing how you managed to solve these use cases (which I'm sure you have internally at google) without preserving unknown fields. |
We need unknown fields, because it's one of the ways we know on the server-side that our proto definition is out of date, and needs to be re-synchronized. Without unknown fields, we would have to resort to polling or some other less authoritative way of detecting when the client has added fields. Also while I understand trying to reduce feature surface area, unknown fields don't exactly cause a problem, do they? Dropping them has more negatives than positives, please add them back to proto3. |
If the proto3 way was to set some option, like Best of both worlds. :) |
I would absolutely want the ability to preserve or strip unknown fields at runtime. There are levels of our system which get deployed regularly, are kept up to date, and should be validating the well known schema (and stripping unknown fields), but there are other internal layers which get deployed far less frequently, that are not directly exposed to clients or potentially malicious actors where preserving unknown fields is highly desirable so we dont have to do full and extensive deploys for every little change. |
Hey guys, We would love to have this feature, too :) During my relatively long time at Google, I was aware of many services that relied on this behavior from proto2. Essentially, think of any set of three or more services where A talks to C via B, and we don't want to redeploy B when a proto that is being passed between A and C gets a new field added to it. (I also posted this as a question on stackoverflow.) Would be great to have an update for supporting this feature and/or an alternative mechanism that you believe can solve this problem for us. Thanks, |
Still no word on what the original justification was too. |
The use-case we have is the following: We use Stream Processors, namely kafka-streams, that rearranges protobuf messages. For example we have 2 streams of protobuf messages that we join with each other. The join will just output a joined message having the two others as fields. Sometimes we also aggregate streams to list of messages of previous streams. The stream processors only know about the fields relevant for them (join fields, group by fields ...) all the other fields are carried along as unknown-fields. This allows the stream processor to continue working even when upstream schema changes happen, we do not need to redeploy our stream processing application, and the new fields end up in the output for free. To add some drama: I think loosing the unknown fields will force us to move to avro |
This is a bit of a deal breaker for us too. We have the same use case where A sends data to B which reads some fields and forwards the message to C. We don't want to have to constantly update B when the schema changes even though it doesn't read any of the new fields. The current behaviour is quite dangerous since C can't tell if one of the new fields was set to the default value or if B is just out of date and lost data. |
Would really appreciate an update on the feedback here. Whether Proto3 is going to ever support unknown fields can impact decisions being made even for folks still on Proto2, because if it isn't, we may need to invent other ways of solving our problems in order to avoid rearchitecting things when/if we move to proto3. |
I have two use cases, both of which have sub-optimal workarounds:
|
One thing to keep in mind is that proto2 is not going away. We are still actively improving it and plan to keep doing so indefinitely, so proto2 is still a good choice if you have a use case that depends on unknown fields. The one main drawback is that a few languages (such as C# and Ruby) are currently proto3-only, but if you're not using those languages then that's not a problem. @chmod007 , have you thought about using proto2 for your two use cases? Is that possible or do your schemas have to be proto3 for another reason? |
I'll add a few usecases.
re: proto2 vs. proto3, it's kind of annoying to mix and match. It's pretty counterintuitive to only use proto2 to maintain unknown fields, but have proto3 definitions for gRPC servers. I agree with most of the design choices in proto3 (e.g. removing optional/required fields, map types), but not this. I'd actually been unaware proto3 removed unknown field support until I expected it to maintain an unknown field and it didn't (and came to report it as an issue). I'd touted unknown field support as a huge selling point for protobufs when we'd first implemented them. The protobuf website originally recommended that new projects use proto3, which is why we'd adopted it, but this is a pretty huge issue for us. We'll likely be forking the compiler similarly to @gfecher as the proto3 ship has long since sailed and this behavior is very important to helping us produce robust infrastructure. |
@pherl @xfxyjwf Do you have suggestions for how to work around this with proto3? If this was removed, what techniques were used to avoid requiring this pattern within Google? As far as I see it, this was the chief benefit of protobuf:
Producer and Consumer could be updated with new fields, while intermediate can remain on the same version. If intermediate is a proxy of sorts, then this is important. |
@stevvooe We've been continuing to use proto2 for the intermediate proxy type thing since they are binary compatible. Throughout our codebase, we've been propagating proto2 everywhere since it's really annoying to maintain two different semantics for the proto definitions themselves but if you wanted, producer and consumer could use proto3. I do have some plans eventually to do a separate C++ compiler entirely that consumes proto3 syntax but retains the API of the unknown fields unless someone else gets to it first. I want to do other changes like using more STL containers (vectors and maps) as the backing in-memory storage and fix the oddities with the arenas we've been seeing. |
@jeremyherbert This is the same issue as proto3 currently doesn't preserve unknown fields. It should be addressed in the next couple releases. Note that adding new fields into oneofs is risky. Even with unknown fields preserved, the new field will not be visible in the oneof of the old binary. Instead of seeing an unrecognized type, the old message will treat the oneof as not set. You would have to dig into the unknown fields to distinguish between an unset oneof vs an unrecognized oneof. |
@pherl, the pattern "save unknown fields and then discard it" seems excessive for me. Isn't it better just to pass a flag to parsing function telling it to save or not to save unknown fields while parsing? It will save you memory and CPU in case you don't need these fields while will retain all desired benefits. In our workflows we sometimes have most of fields in message as unknown, and I'm afraid that parsing it will degrade our performance. Actually, I would like to have such flag in proto2 too. |
@vozbu what language are you using? We do have API to skip unknowns fields in Java. Other languages chose to have a discard unknown fields API after parsing is finished mostly to reduce the complexity in implementation. |
@pherl, I'm talking about C++. I haven't seen the implementation to judge about it. I speak my thoughts as a user. |
@pherl, the doc you shared states "3.4 release (ETA: Q3 2017): Google protobuf implementation for each language will provide APIs to explicitly drop or preserve unknowns for proto3. A temporary flag will be introduced for the default parsing behavior - default to drop unknowns." 3.4 is released. Did that actually make it in? I'm using Java and I see the flag for retaining unknowns, explicitDiscardUnknownFields in CodedInputStream, but the parsing code I see is using: |
The plan would be only to provide APIs for explicitly drop unknowns, for
those who depend on the behavior. The default is only for testing only. In
3.5 we will flip the default.
…On Wed, Sep 13, 2017 at 4:32 PM jbolla ***@***.***> wrote:
@pherl <https://github.com/pherl>, the doc you shared states "3.4 release
(ETA: Q3 2017): Google protobuf implementation for each language will
provide APIs to explicitly drop or preserve unknowns for proto3. A
temporary flag will be introduced for the default parsing behavior -
default to drop unknowns."
3.4 is released. Did that actually make it in? I'm using Java and I see
the flag for retaining unknowns, explicitDiscardUnknownFields in
CodedInputStream, but the parsing code I see is using:
final boolean shouldDiscardUnknownFieldsProto3() { return
explicitDiscardUnknownFields ? true : proto3DiscardUnknownFieldsDefault; }
So even if you don't set that flag you get
proto3DiscardUnknownFieldsDefault, which defaults to false and appears not
to have any way for external users to change.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#272 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AATQyUtZy8f6n6c-aVRnPPPSXV0oKyGuks5siGYYgaJpZM4D8C3u>
.
|
All languages will be fixed in 3.5.x releases. |
@liujisi Now that direction has changed and support added for preserving field to some implementations, will this recommendation in the official proto3 documentation be changing?
Ref: https://developers.google.com/protocol-buffers/docs/proto3#unknowns |
@leighmcculloch Good catch, I'll update that documentation to say that unknown fields are now preserved for proto3 messages as of version 3.5. |
Is there a public method to detect if a deserialized message has unknown fields? This would be useful to check a message which is coming from an untrusted source. I'm about to replace protobuf with JWT for this :( |
There are methods to get a list of unknown fields. But: In Go the parameter name suggests it should not be used ("XXX_unrecognized").
|
In Go, there is not currently a reliable way to programmatically interact with unknown fields. At best, you can use Furthermore, not all unknown fields are stored in |
I'm coming to this party rather late... I've just upgraded a C# application that uses protobuffers from version 3.4.0 to 3.6.1. The application relies on unknown fields not being preserved. Now by default they ARE preserved and I've seen a significant and unacceptable increase in memory consumption. (The ratio of known to unknown fields is about 1:5.) There is mention here of APIs being available to explicitly discard the unknown fields but its not clear to me whether these were temporary and have now been removed or still exist. What is the current situation? Do these APIs still exist in the version 3.6.1 C# distribution? If so where can I find details? |
From my understanding (though I don't work on protobufs, I've just been a part of this thread for a long time), these APIs are here to stay -- you will be able to keep or discard unknown fields depending on your use case. protobuf/csharp/src/Google.Protobuf/MessageParser.cs Lines 333 to 340 in e479410
|
Thanks for the reply. Found it, tried it, code now works again. |
I know that unknown fields have been removed from proto3, but I am trying to get an explanation about why this change was made and if there is any way to replicate that behavior in proto3.
Thanks so much.
referred from golang/protobuf#25
The text was updated successfully, but these errors were encountered: