Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update descriptor implementations to work with editions #260

Merged
merged 10 commits into from
Mar 21, 2024

Conversation

jhump
Copy link
Member

@jhump jhump commented Mar 19, 2024

This updates this library to use the latest protobuf-go, which provides support for representing Editions source files via protoreflect.FileDescriptor. It also updates the implementations of protoreflect.Descriptor in this module to properly utilize editions features to implement the semantic accessors FieldDescriptor.HasPresence, FieldDescriptor.IsPacked, and the newly-added EnumDescriptor.IsClosed.

This supercedes #258.

@jhump jhump force-pushed the jh/feature-resolution-helpers branch 2 times, most recently from 0c081bf to 8282283 Compare March 19, 2024 22:05
@jhump jhump requested a review from timostamm March 19, 2024 22:12
…emantic accessors for FieldDescriptor and EnumDescriptor impls
@jhump jhump force-pushed the jh/feature-resolution-helpers branch from 8282283 to a09fba7 Compare March 19, 2024 22:22
linker/editions.go Outdated Show resolved Hide resolved
linker/descriptors.go Outdated Show resolved Hide resolved
linker/editions.go Outdated Show resolved Hide resolved
linker/descriptors.go Outdated Show resolved Hide resolved
linker/editions.go Show resolved Hide resolved
@jhump
Copy link
Member Author

jhump commented Mar 20, 2024

@emcfarlane, @timostamm, I just found some more fixes that were necessary. I went through all of the checks in the parser package (where we create and then validate descriptor protos from an AST) and in the linker (where we do validations that require the descriptors be linked and options interpreted), looking for references to "proto2" or "proto3". Doing so revealed some other checks that needed updates to work with editions.

Unfortunately, applying editions features means we need options interpreted (since features are configured via options). That means we had to defer some of the checks to the end. Previously, we could validate that proto3 enums had an initial value of zero after creating the descriptor (before linking). But now we need to interpret options in order to access the enum type feature to make this check. Similarly, we previously could validate that proto3 fields didn't refer to proto2 enum types during linking. But now we need options interpreted to access the enum type and field presence features.

In addition to the above two checks, I also changed the JSON name checks -- so we decide whether to fail or warn for certain kinds of invalid JSON names based on the json format feature.

These changes are all in 1bd6edc. Please take a look.

linker/descriptors.go Outdated Show resolved Hide resolved
Copy link
Member

@timostamm timostamm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These changes are all in 1bd6edc. Please take a look.

They look good to me.

I believe that feature validation should to be extended to constrain features to certain field types, see comment below.

linker/linker_test.go Show resolved Hide resolved
Copy link
Member

@timostamm timostamm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While searching for other uses of syntax, I caught another one other that needed to be updated to use features.

Thanks for the explanation. Makes sense to me.

linker/descriptors.go Outdated Show resolved Hide resolved
@jhump jhump merged commit 3024677 into main Mar 21, 2024
8 checks passed
@jhump jhump deleted the jh/feature-resolution-helpers branch March 21, 2024 14:46
jhump added a commit that referenced this pull request Mar 27, 2024
… two phases (#261)

There were several Editions-specific issues with the "canonical bytes"
feature, in cases where the custom option value being interpreted is in
the same file that defines the custom option. In order to know how to
encode "delimited" message fields, we have to have already interpreted
the options for that field, but when they are defined in the same file,
we may have not have done so yet. There was also a pre-existing issue
that was related: knowing whether or not a repeated field is packed also
relies on interpreting options. Finally, there was a pre-existing issue
that was unrelated: message literals that used the expanded `Any` form
were completely absent from the canonical bytes representation 🤦.

In addition to the "canonical bytes" encoding issues described above,
there was a related issue with the (non-canonical) encoding of `Any`
messages in an option value: if an expanded `Any` literal refers to a
packed field, it could previously have been encoded in non-packed form
since we may not yet know if it's packed (because we're still
interpreting options).

This issue of not yet knowing how to encode or handle a type because
we're still interpreting options also came up in #260, where we needed
to know if an enum was open or closed. But in an Editions file, this
requires interpreting the enum's options. In that PR, we deferred the
decision until the end. So we could accumulate the set of enums that
needed to be open, and validate that they are open after all options
have been interpreted. I looked into solving the other issues in the
same way: defer checks until all options have been interpreted. But the
real problem with this approach is that the text format (which is what
is used in message literals) is influenced by whether a field is a group
or not. So if we don't yet know if the field is a group (because, in an
editions file, we have to interpret the `features.message_encoding`
option first), we can't even process message literals. One possible
solution I thought of would be to interpret options in dependency order
(so go interpret options for the element if we need to). This would be a
bit tricky to implement, and might incur a performance hit because we'd
need more book-keeping to track what's been interpreted and what hasn't.
But even this approach couldn't handle cases where this is a
**dependency cycle**: where interpreting the options for an element
depend on that element already having its options interpreted 🤯 (see
comment below for such an example).

This PR does two things.

1. The main solution to the "needing options in the same file to already
be interpreted" problem is to interpret the **non**-custom options
first. That way features and other relevant options get interpreted
first, so we can then assume, when interpreting **custom** options, that
we know enough to correctly interpret and serialize everything. This is
also [the strategy used by
`protoc`](https://github.com/protocolbuffers/protobuf/blob/main/src/google/protobuf/descriptor.cc#L5899-L5943)
and is adequate for now (not necessarily bullet-proof, but will work as
long as Google doesn't make certain kinds of changes to
`descriptor.proto`).

2. The bigger change in here, that accounts for the vast majority of
lines of code changed/deleted, is removal of the "canonical option
bytes" feature. This was a feature of protocompile to generate the same
output as `protoc`, byte-for-byte. Special handling was needed to do so
because the protobuf-go library serializes the option messages
differently than the way they are serialized by the C++ `libprotoc`. So
we basically re-implemented serialization in the `option` sub-package
with the `interpreted*` types. Since this was a source of bugs and
Editions-related issues, as described in the first paragraph above, and
the feature is basically unused (it was never actually incorporated into
the buf CLI), it will be a quality of life improvement to not maintain
this complicated code anymore.
kralicky added a commit to kralicky/protocompile that referenced this pull request Apr 1, 2024
This updates this library to use the latest protobuf-go, which provides
support for representing Editions source files via
`protoreflect.FileDescriptor`. It also updates the implementations of
`protoreflect.Descriptor` in this module to properly utilize editions
features to implement the semantic accessors
`FieldDescriptor.HasPresence`, `FieldDescriptor.IsPacked`, and the
newly-added `EnumDescriptor.IsClosed`. All other conditionals
that previously examined syntax have now been updated to be
editions-aware.

(cherry picked from commit 3024677)
kralicky added a commit to kralicky/protocompile that referenced this pull request Apr 1, 2024
… two phases (bufbuild#261)

There were several Editions-specific issues with the "canonical bytes"
feature, in cases where the custom option value being interpreted is in
the same file that defines the custom option. In order to know how to
encode "delimited" message fields, we have to have already interpreted
the options for that field, but when they are defined in the same file,
we may have not have done so yet. There was also a pre-existing issue
that was related: knowing whether or not a repeated field is packed also
relies on interpreting options. Finally, there was a pre-existing issue
that was unrelated: message literals that used the expanded `Any` form
were completely absent from the canonical bytes representation 🤦.

In addition to the "canonical bytes" encoding issues described above,
there was a related issue with the (non-canonical) encoding of `Any`
messages in an option value: if an expanded `Any` literal refers to a
packed field, it could previously have been encoded in non-packed form
since we may not yet know if it's packed (because we're still
interpreting options).

This issue of not yet knowing how to encode or handle a type because
we're still interpreting options also came up in bufbuild#260, where we needed
to know if an enum was open or closed. But in an Editions file, this
requires interpreting the enum's options. In that PR, we deferred the
decision until the end. So we could accumulate the set of enums that
needed to be open, and validate that they are open after all options
have been interpreted. I looked into solving the other issues in the
same way: defer checks until all options have been interpreted. But the
real problem with this approach is that the text format (which is what
is used in message literals) is influenced by whether a field is a group
or not. So if we don't yet know if the field is a group (because, in an
editions file, we have to interpret the `features.message_encoding`
option first), we can't even process message literals. One possible
solution I thought of would be to interpret options in dependency order
(so go interpret options for the element if we need to). This would be a
bit tricky to implement, and might incur a performance hit because we'd
need more book-keeping to track what's been interpreted and what hasn't.
But even this approach couldn't handle cases where this is a
**dependency cycle**: where interpreting the options for an element
depend on that element already having its options interpreted 🤯 (see
comment below for such an example).

This PR does two things.

1. The main solution to the "needing options in the same file to already
be interpreted" problem is to interpret the **non**-custom options
first. That way features and other relevant options get interpreted
first, so we can then assume, when interpreting **custom** options, that
we know enough to correctly interpret and serialize everything. This is
also [the strategy used by
`protoc`](https://github.com/protocolbuffers/protobuf/blob/main/src/google/protobuf/descriptor.cc#L5899-L5943)
and is adequate for now (not necessarily bullet-proof, but will work as
long as Google doesn't make certain kinds of changes to
`descriptor.proto`).

2. The bigger change in here, that accounts for the vast majority of
lines of code changed/deleted, is removal of the "canonical option
bytes" feature. This was a feature of protocompile to generate the same
output as `protoc`, byte-for-byte. Special handling was needed to do so
because the protobuf-go library serializes the option messages
differently than the way they are serialized by the C++ `libprotoc`. So
we basically re-implemented serialization in the `option` sub-package
with the `interpreted*` types. Since this was a source of bugs and
Editions-related issues, as described in the first paragraph above, and
the feature is basically unused (it was never actually incorporated into
the buf CLI), it will be a quality of life improvement to not maintain
this complicated code anymore.

(cherry picked from commit 869ef58)
kralicky pushed a commit to kralicky/protocompile that referenced this pull request Apr 12, 2024
This updates this library to use the latest protobuf-go, which provides
support for representing Editions source files via
`protoreflect.FileDescriptor`. It also updates the implementations of
`protoreflect.Descriptor` in this module to properly utilize editions
features to implement the semantic accessors
`FieldDescriptor.HasPresence`, `FieldDescriptor.IsPacked`, and the
newly-added `EnumDescriptor.IsClosed`. All other conditionals
that previously examined syntax have now been updated to be
editions-aware.

(cherry picked from commit 3024677)
kralicky pushed a commit to kralicky/protocompile that referenced this pull request Apr 12, 2024
… two phases (bufbuild#261)

There were several Editions-specific issues with the "canonical bytes"
feature, in cases where the custom option value being interpreted is in
the same file that defines the custom option. In order to know how to
encode "delimited" message fields, we have to have already interpreted
the options for that field, but when they are defined in the same file,
we may have not have done so yet. There was also a pre-existing issue
that was related: knowing whether or not a repeated field is packed also
relies on interpreting options. Finally, there was a pre-existing issue
that was unrelated: message literals that used the expanded `Any` form
were completely absent from the canonical bytes representation 🤦.

In addition to the "canonical bytes" encoding issues described above,
there was a related issue with the (non-canonical) encoding of `Any`
messages in an option value: if an expanded `Any` literal refers to a
packed field, it could previously have been encoded in non-packed form
since we may not yet know if it's packed (because we're still
interpreting options).

This issue of not yet knowing how to encode or handle a type because
we're still interpreting options also came up in bufbuild#260, where we needed
to know if an enum was open or closed. But in an Editions file, this
requires interpreting the enum's options. In that PR, we deferred the
decision until the end. So we could accumulate the set of enums that
needed to be open, and validate that they are open after all options
have been interpreted. I looked into solving the other issues in the
same way: defer checks until all options have been interpreted. But the
real problem with this approach is that the text format (which is what
is used in message literals) is influenced by whether a field is a group
or not. So if we don't yet know if the field is a group (because, in an
editions file, we have to interpret the `features.message_encoding`
option first), we can't even process message literals. One possible
solution I thought of would be to interpret options in dependency order
(so go interpret options for the element if we need to). This would be a
bit tricky to implement, and might incur a performance hit because we'd
need more book-keeping to track what's been interpreted and what hasn't.
But even this approach couldn't handle cases where this is a
**dependency cycle**: where interpreting the options for an element
depend on that element already having its options interpreted 🤯 (see
comment below for such an example).

This PR does two things.

1. The main solution to the "needing options in the same file to already
be interpreted" problem is to interpret the **non**-custom options
first. That way features and other relevant options get interpreted
first, so we can then assume, when interpreting **custom** options, that
we know enough to correctly interpret and serialize everything. This is
also [the strategy used by
`protoc`](https://github.com/protocolbuffers/protobuf/blob/main/src/google/protobuf/descriptor.cc#L5899-L5943)
and is adequate for now (not necessarily bullet-proof, but will work as
long as Google doesn't make certain kinds of changes to
`descriptor.proto`).

2. The bigger change in here, that accounts for the vast majority of
lines of code changed/deleted, is removal of the "canonical option
bytes" feature. This was a feature of protocompile to generate the same
output as `protoc`, byte-for-byte. Special handling was needed to do so
because the protobuf-go library serializes the option messages
differently than the way they are serialized by the C++ `libprotoc`. So
we basically re-implemented serialization in the `option` sub-package
with the `interpreted*` types. Since this was a source of bugs and
Editions-related issues, as described in the first paragraph above, and
the feature is basically unused (it was never actually incorporated into
the buf CLI), it will be a quality of life improvement to not maintain
this complicated code anymore.

(cherry picked from commit 869ef58)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants