Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove canonical option bytes and split up option interpretation into two phases #261

Merged
merged 8 commits into from
Mar 27, 2024

Conversation

jhump
Copy link
Member

@jhump jhump commented Mar 23, 2024

There were several Editions-specific issues with the "canonical bytes" feature, in cases where the custom option value being interpreted is in the same file that defines the custom option. In order to know how to encode "delimited" message fields, we have to have already interpreted the options for that field, but when they are defined in the same file, we may have not have done so yet. There was also a pre-existing issue that was related: knowing whether or not a repeated field is packed also relies on interpreting options. Finally, there was a pre-existing issue that was unrelated: message literals that used the expanded Any form were completely absent from the canonical bytes representation 🤦.

In addition to the "canonical bytes" encoding issues described above, there was a related issue with the (non-canonical) encoding of Any messages in an option value: if an expanded Any literal refers to a packed field, it could previously have been encoded in non-packed form since we may not yet know if it's packed (because we're still interpreting options).

This issue of not yet knowing how to encode or handle a type because we're still interpreting options also came up in #260, where we needed to know if an enum was open or closed. But in an Editions file, this requires interpreting the enum's options. In that PR, we deferred the decision until the end. So we could accumulate the set of enums that needed to be open, and validate that they are open after all options have been interpreted. I looked into solving the other issues in the same way: defer checks until all options have been interpreted. But the real problem with this approach is that the text format (which is what is used in message literals) is influenced by whether a field is a group or not. So if we don't yet know if the field is a group (because, in an editions file, we have to interpret the features.message_encoding option first), we can't even process message literals. One possible solution I thought of would be to interpret options in dependency order (so go interpret options for the element if we need to). This would be a bit tricky to implement, and might incur a performance hit because we'd need more book-keeping to track what's been interpreted and what hasn't. But even this approach couldn't handle cases where this is a dependency cycle: where interpreting the options for an element depend on that element already having its options interpreted 🤯 (see comment below for such an example).

This PR does two things.

  1. The main solution to the "needing options in the same file to already be interpreted" problem is to interpret the non-custom options first. That way features and other relevant options get interpreted first, so we can then assume, when interpreting custom options, that we know enough to correctly interpret and serialize everything. This is also the strategy used by protoc and is adequate for now (not necessarily bullet-proof, but will work as long as Google doesn't make certain kinds of changes to descriptor.proto).
  2. The bigger change in here, that accounts for the vast majority of lines of code changed/deleted, is removal of the "canonical option bytes" feature. This was a feature of protocompile to generate the same output as protoc, byte-for-byte. Special handling was needed to do so because the protobuf-go library serializes the option messages differently than the way they are serialized by the C++ libprotoc. So we basically re-implemented serialization in the option sub-package with the interpreted* types. Since this was a source of bugs and Editions-related issues, as described in the first paragraph above, and the feature is basically unused (it was never actually incorporated into the buf CLI), it will be a quality of life improvement to not maintain this complicated code anymore.

jhump added 5 commits March 22, 2024 13:21
…tion message encoding or repeated field encoding
…s what protoc does and works around certain kinds of dependency cycles (like just added to test_editions.proto test file)
Comment on lines +8 to +18
Foo foo = 2 [
(any) = {
[type.googleapis.com/Foo]: {
Foo: {
name: "abc",
Foo: { name: "xyz" }
}
}
},
features.message_encoding = DELIMITED
];
Copy link
Member Author

@jhump jhump Mar 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the dependency cycle mentioned in the PR description: we have to interpret the options for this field (to see this feature.message_encoding option) before we can know how to interpret the other options for the field (since the above Any option refers to this very field in the message literal).

options/options.go Outdated Show resolved Hide resolved
options/options.go Outdated Show resolved Hide resolved
@@ -459,35 +459,19 @@ func TestLinkerValidation(t *testing.T) {
"foo.proto": "message Foo { option message_set_wire_format = true; extensions 1 to max; } extend Foo { optional Foo bar = 536870912; }",
},
},
"failure_message_set_wire_format_scalar2": {
input: map[string]string{
"foo.proto": "message Foo { option message_set_wire_format = true; extensions 1 to 100; } extend Foo { optional int32 bar = 1; }",
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops! This was a duplicate of the test above (line 442)

},
"success_message_set_wire_format2": {
input: map[string]string{
"foo.proto": "message Foo { option message_set_wire_format = true; extensions 1 to 100; } extend Foo { optional Foo bar = 1; }",
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This, too (duplicate of line 448).

"failure_message_set_wire_format_repeated": {
input: map[string]string{
"foo.proto": "message Foo { option message_set_wire_format = true; extensions 1 to 100; } extend Foo { repeated Foo bar = 1; }",
},
expectedErr: "foo.proto:1:90: messages with message-set wire format cannot contain repeated extensions, only optional",
},
"success_large_extension_message_set_wire_format": {
input: map[string]string{
"foo.proto": "message Foo { option message_set_wire_format = true; extensions 1 to max; } extend Foo { optional Foo bar = 536870912; }",
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And this one, too (duplicate of line 459).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When reviewing group-related test cases and fixing up the expected error messages, I decided to audit this list a bit. I found some redundant tests to remove (oops, copy+pasta) and also tried to clarify some of the name to better match the test's intent (and I added some comments in a couple of places, where it seemed particularly useful). So many of these changes are orthogonal to this PR and are extra "clean up" while I was in this file.

This was never actually used in the buf CLI. And it's rather complicated
and kind of a lot to maintain for an effectively unused feature. It also
had several bugs, so instead of merging fixes for those bugs, we're just
removing the whole feature.
@jhump jhump changed the title Fix canonical option bytes and Any message encoding in custom options to work with Editions Remove canonical option bytes and split up option interpretation into two phases Mar 27, 2024
internal/prototest/util.go Outdated Show resolved Hide resolved
Comment on lines +149 to +155
// We have to do this in two phases. First we interpret non-custom options.
// This allows us to handle standard options and features that may needed to
// correctly reference the custom options in the second phase.
if err := interp.interpretFileOptions(file, false); err != nil {
return nil, err
}
// Now we can do custom options.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The two passes end up reading well. And with the constraints you note in the description, it does seem likely that any other compiler would do the same, so it looks like a good approach going forward.

if err != nil {
return err
}
}
return nil
}

// interpretedOption represents the result of interpreting an option.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a whole lot of lines removed. I see your point.

Co-authored-by: Timo Stamm <ts@timostamm.de>
@jhump jhump enabled auto-merge (squash) March 27, 2024 18:19
@jhump jhump merged commit 869ef58 into main Mar 27, 2024
8 checks passed
@jhump jhump deleted the jh/fix-canonical-option-bytes-w-delimited-encoding branch March 27, 2024 18:28
kralicky added a commit to kralicky/protocompile that referenced this pull request Apr 1, 2024
… two phases (bufbuild#261)

There were several Editions-specific issues with the "canonical bytes"
feature, in cases where the custom option value being interpreted is in
the same file that defines the custom option. In order to know how to
encode "delimited" message fields, we have to have already interpreted
the options for that field, but when they are defined in the same file,
we may have not have done so yet. There was also a pre-existing issue
that was related: knowing whether or not a repeated field is packed also
relies on interpreting options. Finally, there was a pre-existing issue
that was unrelated: message literals that used the expanded `Any` form
were completely absent from the canonical bytes representation 🤦.

In addition to the "canonical bytes" encoding issues described above,
there was a related issue with the (non-canonical) encoding of `Any`
messages in an option value: if an expanded `Any` literal refers to a
packed field, it could previously have been encoded in non-packed form
since we may not yet know if it's packed (because we're still
interpreting options).

This issue of not yet knowing how to encode or handle a type because
we're still interpreting options also came up in bufbuild#260, where we needed
to know if an enum was open or closed. But in an Editions file, this
requires interpreting the enum's options. In that PR, we deferred the
decision until the end. So we could accumulate the set of enums that
needed to be open, and validate that they are open after all options
have been interpreted. I looked into solving the other issues in the
same way: defer checks until all options have been interpreted. But the
real problem with this approach is that the text format (which is what
is used in message literals) is influenced by whether a field is a group
or not. So if we don't yet know if the field is a group (because, in an
editions file, we have to interpret the `features.message_encoding`
option first), we can't even process message literals. One possible
solution I thought of would be to interpret options in dependency order
(so go interpret options for the element if we need to). This would be a
bit tricky to implement, and might incur a performance hit because we'd
need more book-keeping to track what's been interpreted and what hasn't.
But even this approach couldn't handle cases where this is a
**dependency cycle**: where interpreting the options for an element
depend on that element already having its options interpreted 🤯 (see
comment below for such an example).

This PR does two things.

1. The main solution to the "needing options in the same file to already
be interpreted" problem is to interpret the **non**-custom options
first. That way features and other relevant options get interpreted
first, so we can then assume, when interpreting **custom** options, that
we know enough to correctly interpret and serialize everything. This is
also [the strategy used by
`protoc`](https://github.com/protocolbuffers/protobuf/blob/main/src/google/protobuf/descriptor.cc#L5899-L5943)
and is adequate for now (not necessarily bullet-proof, but will work as
long as Google doesn't make certain kinds of changes to
`descriptor.proto`).

2. The bigger change in here, that accounts for the vast majority of
lines of code changed/deleted, is removal of the "canonical option
bytes" feature. This was a feature of protocompile to generate the same
output as `protoc`, byte-for-byte. Special handling was needed to do so
because the protobuf-go library serializes the option messages
differently than the way they are serialized by the C++ `libprotoc`. So
we basically re-implemented serialization in the `option` sub-package
with the `interpreted*` types. Since this was a source of bugs and
Editions-related issues, as described in the first paragraph above, and
the feature is basically unused (it was never actually incorporated into
the buf CLI), it will be a quality of life improvement to not maintain
this complicated code anymore.

(cherry picked from commit 869ef58)
kralicky pushed a commit to kralicky/protocompile that referenced this pull request Apr 12, 2024
… two phases (bufbuild#261)

There were several Editions-specific issues with the "canonical bytes"
feature, in cases where the custom option value being interpreted is in
the same file that defines the custom option. In order to know how to
encode "delimited" message fields, we have to have already interpreted
the options for that field, but when they are defined in the same file,
we may have not have done so yet. There was also a pre-existing issue
that was related: knowing whether or not a repeated field is packed also
relies on interpreting options. Finally, there was a pre-existing issue
that was unrelated: message literals that used the expanded `Any` form
were completely absent from the canonical bytes representation 🤦.

In addition to the "canonical bytes" encoding issues described above,
there was a related issue with the (non-canonical) encoding of `Any`
messages in an option value: if an expanded `Any` literal refers to a
packed field, it could previously have been encoded in non-packed form
since we may not yet know if it's packed (because we're still
interpreting options).

This issue of not yet knowing how to encode or handle a type because
we're still interpreting options also came up in bufbuild#260, where we needed
to know if an enum was open or closed. But in an Editions file, this
requires interpreting the enum's options. In that PR, we deferred the
decision until the end. So we could accumulate the set of enums that
needed to be open, and validate that they are open after all options
have been interpreted. I looked into solving the other issues in the
same way: defer checks until all options have been interpreted. But the
real problem with this approach is that the text format (which is what
is used in message literals) is influenced by whether a field is a group
or not. So if we don't yet know if the field is a group (because, in an
editions file, we have to interpret the `features.message_encoding`
option first), we can't even process message literals. One possible
solution I thought of would be to interpret options in dependency order
(so go interpret options for the element if we need to). This would be a
bit tricky to implement, and might incur a performance hit because we'd
need more book-keeping to track what's been interpreted and what hasn't.
But even this approach couldn't handle cases where this is a
**dependency cycle**: where interpreting the options for an element
depend on that element already having its options interpreted 🤯 (see
comment below for such an example).

This PR does two things.

1. The main solution to the "needing options in the same file to already
be interpreted" problem is to interpret the **non**-custom options
first. That way features and other relevant options get interpreted
first, so we can then assume, when interpreting **custom** options, that
we know enough to correctly interpret and serialize everything. This is
also [the strategy used by
`protoc`](https://github.com/protocolbuffers/protobuf/blob/main/src/google/protobuf/descriptor.cc#L5899-L5943)
and is adequate for now (not necessarily bullet-proof, but will work as
long as Google doesn't make certain kinds of changes to
`descriptor.proto`).

2. The bigger change in here, that accounts for the vast majority of
lines of code changed/deleted, is removal of the "canonical option
bytes" feature. This was a feature of protocompile to generate the same
output as `protoc`, byte-for-byte. Special handling was needed to do so
because the protobuf-go library serializes the option messages
differently than the way they are serialized by the C++ `libprotoc`. So
we basically re-implemented serialization in the `option` sub-package
with the `interpreted*` types. Since this was a source of bugs and
Editions-related issues, as described in the first paragraph above, and
the feature is basically unused (it was never actually incorporated into
the buf CLI), it will be a quality of life improvement to not maintain
this complicated code anymore.

(cherry picked from commit 869ef58)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants