Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BEAM-13939: Restructure Protos to fix namespace conflicts #16961

Merged
merged 8 commits into from
Apr 7, 2022

Conversation

thempatel
Copy link
Contributor

@thempatel thempatel commented Feb 26, 2022

Generated protobuf files contain additional information about the messages and services they were compiled from such as the file path to the original source proto file. The protobuf runtime for Golang maintains a global registry of all protobufs being used by registering the descriptors using the file path to the source proto file. If multiple descriptors with the same source file path are registered to the global registry, then the initialization code will prevent startup by panic-ing and printing a message like this to stdout:

panic: proto: file "metrics.proto" is already registered
	previously from: "github.com/apache/beam/sdks/v2/go/pkg/beam/model/pipeline_v1"
	currently from:  "..."
See https://developers.google.com/protocol-buffers/docs/reference/go/faq#namespace-conflict

This behavior in the Go Protocol Buffer SDK is unlikely to go away:

reflect/protoregistry: restore conflicting file names check

There are inconsistencies between implementations on what happens when a single program contains generated code for multiple files with the same source path. At least one canonical implementation (C++) will fail at link time. Others print warnings. Some silently resolve the registry conflict in favor of one file or the other. The protobuf maintainers agree, however, that the desired behavior is for this condition to be an error.

This change aims to bring the protobuf imports in Beam to follow the guidance from a comment in this issue filed with the protobuf repo:

To avoid conflicts, every source .proto that is part of a program must have a unique name. Practically speaking, this means that any .proto that is part of a public package should be placed in a directory that uniquely identifies the owner of the proto. For example, google/protobuf/descriptor.proto and k8s.io/apimachinery/pkg/types both have prefixes which specify the creating organization and a specific package within that organization.

As such, I've elected to place each protobuf package in a directory org/apache/beam/model relative to its respective module root and have updated the build system where necessary.

P.S. This is still a work in progress, but opened the PR to socialize since this will affect all of Beam.


Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:

  • Choose reviewer(s) and mention them in a comment (R: @username).
  • Format the pull request title like [BEAM-XXX] Fixes bug in ApproximateQuantiles, where you replace BEAM-XXX with the appropriate JIRA issue, if applicable. This will automatically link the pull request to the issue.
  • Update CHANGES.md with noteworthy changes.
  • If this contribution is large, please file an Apache Individual Contributor License Agreement.

See the Contributor Guide for more tips on how to make review process smoother.

To check the build health, please visit https://github.com/apache/beam/blob/master/.test-infra/BUILD_STATUS.md

GitHub Actions Tests Status (on master branch)

Build python source distribution and wheels
Python tests
Java tests

See CI.md for more information about GitHub Actions CI.

@codecov
Copy link

codecov bot commented Feb 28, 2022

Codecov Report

Merging #16961 (bbf3a61) into master (e596abf) will decrease coverage by 0.02%.
The diff coverage is 98.43%.

@@            Coverage Diff             @@
##           master   #16961      +/-   ##
==========================================
- Coverage   74.12%   74.09%   -0.03%     
==========================================
  Files         677      681       +4     
  Lines       89069    89209     +140     
==========================================
+ Hits        66019    66098      +79     
- Misses      21899    21960      +61     
  Partials     1151     1151              
Flag Coverage Δ
go 50.08% <ø> (ø)
python 83.61% <98.43%> (-0.06%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
...eam/runners/portability/fn_api_runner/execution.py 92.25% <ø> (ø)
...nners/portability/fn_api_runner/worker_handlers.py 79.34% <0.00%> (ø)
sdks/python/apache_beam/io/gcp/bigquery.py 63.63% <100.00%> (ø)
sdks/python/apache_beam/portability/common_urns.py 100.00% <100.00%> (ø)
...eam/runners/interactive/caching/streaming_cache.py 96.66% <100.00%> (+0.01%) ⬆️
...am/runners/interactive/options/capture_limiters.py 90.76% <100.00%> (-0.14%) ⬇️
...ache_beam/runners/interactive/recording_manager.py 96.58% <100.00%> (ø)
.../runners/interactive/testing/test_cache_manager.py 92.75% <100.00%> (-0.11%) ⬇️
...ks/python/apache_beam/runners/interactive/utils.py 95.41% <100.00%> (ø)
sdks/python/apache_beam/testing/test_stream.py 91.02% <100.00%> (-0.06%) ⬇️
... and 72 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update e596abf...bbf3a61. Read the comment docs.

@thempatel thempatel force-pushed the milan/fix-proto-import-paths branch from 8e08f62 to c75cee8 Compare February 28, 2022 03:49
@lostluck lostluck self-requested a review February 28, 2022 03:59
@thempatel thempatel force-pushed the milan/fix-proto-import-paths branch from c75cee8 to 0a60f2a Compare February 28, 2022 03:59
@thempatel
Copy link
Contributor Author

For python, thinking it might actually be better to make the structure of apache_beam.portability.api opaque, and find some way to hoist the modules within so we don't have to change imports in the rest of the SDK, hoping to find time to investigate

Copy link
Member

@lukecwik lukecwik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the proto package name be used as the basis for the directory name?
(e.g. org.apache.beam.model.job_management.v1 -> org.apache/beam/model/job_management/v1 and org.apache.beam.model.fn_execution.v1 -> org/apache/beam/model/fn_execution/v1)

@lukecwik
Copy link
Member

R: @lukecwik

@thempatel
Copy link
Contributor Author

Is there an easy way to run these CI tasks locally? I'm trying some gradle tasks and some work, some don't and some that used to work don't work 😬 . Want to speed up my iteration speed, if there's a doc/readme I can read somewhere, that would be great too!

@thempatel
Copy link
Contributor Author

I will start here

@@ -97,7 +97,7 @@ RUN rm /opt/apache/beam/third_party_licenses/golang/LICENSE

COPY target/license_scripts /tmp/license_scripts/
RUN if [ "$pull_licenses" = "true" ] ; then \
pip install 'pip-licenses<3.0.0' pyyaml tenacity && \
pip install 'pip-licenses>=3.5.3' pyyaml tenacity && \
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@thempatel
Copy link
Contributor Author

retest this please

1 similar comment
@thempatel
Copy link
Contributor Author

retest this please

@thempatel
Copy link
Contributor Author

@lostluck @lukecwik I think (🤞🏽 ) this is ready for a formal review

@thempatel
Copy link
Contributor Author

retest this please

@lukecwik
Copy link
Member

Run Java PreCommit

@lukecwik
Copy link
Member

Run GoPortable PreCommit

@lukecwik
Copy link
Member

Looks good for me for the proto file changes.
R: @tvalentyn Can you review the python changes?
R: @lostluck Can you review the go changes?

@lostluck
Copy link
Contributor

lostluck commented Apr 1, 2022

So the vast majority of Python changes seem to be unrelated to this (mostly go specific, already very large) change. Is there a way we could break them out and review them separately? The one change I see is that we (now) have to do renaming of the proto files in apache_beam/portability/api back to a flat structure (and fix their internal imports).

Milan can confirm, but since the changes are rooted in changing how the proto files import each other, I don't know how separable each change at a PR level.

Might be able to organize the changes as commits with: "proto file changes" "generator changes" "go generated code" "python things."... granularity though.

I'm ambivalent, since the go side is finished review at this point, and I don't understand the nature of the python side of the change.

@thempatel
Copy link
Contributor Author

So the vast majority of Python changes seem to be unrelated to this (mostly go specific, already very large) change. Is there a way we could break them out and review them separately?

I sympathize with this, it's usually good advice to break large changes into their logical components and land them separately. I think in this case, that will be a challenge; I see the change-set here as one logical component. The nature of building a polyglot SDK on top of proto is that if you change the proto, you have to change all the languages and there's really no way around this without some long migration path. To that end, I think the per-language changes here are isolated enough where you can likely hide the changes from other files and review just the python files, a similar experience to what it would be like if we housed them in their own PR. Maybe a question to ask is what we do if something inexplicably goes wrong with merging this PR: is it easier to recover quickly if we have this change set spread over multiple commits, or a single commit? A revert of 1 commit is easy, a revert of multiple commits with others interspersed is much harder. If you feel really strongly that the python changes should be in their own change set, I am happy to oblige.

The one change I see is that we (now) have to do renaming of the proto files in apache_beam/portability/api back to a flat structure (and fix their internal imports).

Let's chat about this, I'm not sure I understand what is being said here. Why would we have to change this back to a flat structure? I think that will be impossible, hence the extensive changes in the generation tooling.

Also, for auto-generated files, I think adding them to RAT is better than modifying these files--they're (clearly marked) machine output.

It looks like the generated go files are already in the RAT, though I do agree with Robert that maintaining the ASF license header is a better avenue, plus it reduces the diff on the PR

@thempatel thempatel force-pushed the milan/fix-proto-import-paths branch 6 times, most recently from 2954739 to 81100c1 Compare April 1, 2022 13:55
@robertwb
Copy link
Contributor

robertwb commented Apr 1, 2022

Just to confirm, imports of the form from apache_beam.portability.api.beam_runner_api_pb2 import TestStreamPayload are no longer supported, and that's why the diff is so huge?

@thempatel
Copy link
Contributor Author

thempatel commented Apr 1, 2022

Just to confirm, imports of the form from apache_beam.portability.api.beam_runner_api_pb2 import TestStreamPayload are no longer supported, and that's why the diff is so huge?

correct, this PR changes the the proto structure from a flat one to a hierarchical one, like so:

model/
  beam_fn_api.proto
  beam_provision_api.proto
  beam_interactive_api.proto
  beam_artifact_api.proto
  ...

apache_beam/portability/api/
  beam_fn_api_pb2.py
  beam_provision_api_pb2.py
  beam_interactive_api_pb2.py
  beam_artifact_api_pb2.py
  ...

to

model/org/apache/beam/model/
    beam_fn_api.proto
    beam_provision_api.proto
    beam_interactive_api.proto
    beam_artifact_api.proto
    ...

apache_beam/portability/api/org/apache/beam/model/
    beam_fn_api_pb2.py
    beam_provision_api_pb2.py
    beam_interactive_api_pb2.py
    beam_artifact_api_pb2.py
    ...

so the reason you can no longer do from apache_beam.portability.api.beam_runner_api_pb2 import TestStreamPayload is because there is no module there! The fully qualified import is now

from apache_beam.portability.api.org.apache.beam.model.pipeline.beam_runner_api_pb2 import TestStreamPayload

In order to make this easier to work with, I updated the proto generator to also generate module bindings in the __init__ file of apache_beam.portability.api, like so

from .org.apache.beam.model import pipeline
# ...

external_transforms_pb2 = pipeline.external_transforms_pb2
# ...

so that you don't need to provide the fully qualified path. If we didn't do this, this PR would be even more huge, since we'd have to update every import of the generated bindings in the SDK

sdks/go/pkg/beam/transforms/stats/util_gen.tmpl Outdated Show resolved Hide resolved
sdks/python/.yapfignore Show resolved Hide resolved
LOG = logging.getLogger()
LOG.setLevel(logging.INFO)

LICENSE_HEADER = """
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would just drop these two headers and skip the checks over adding logic to prepend them to the auto-generated files.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you're not strongly opposed, I suggest we keep these. My argument is less related to the build system, and more related to messaging to consumers of the SDK: if they are exploring the SDK, it is a useful notice to have so that they know this is Beam's posture

sdks/python/gen_protos.py Outdated Show resolved Hide resolved
sdks/python/gen_protos.py Outdated Show resolved Hide resolved
sdks/python/gen_protos.py Outdated Show resolved Hide resolved
@thempatel thempatel force-pushed the milan/fix-proto-import-paths branch 3 times, most recently from 23ce537 to 35f99b3 Compare April 2, 2022 21:45
Copy link
Contributor

@lostluck lostluck left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

Unless there are any further objections, I can merge this in later this week.

@lostluck
Copy link
Contributor

lostluck commented Apr 5, 2022

I guess @lukecwik needs to validate/approve his requested changes first.

…structure. This is so that the proto files require usage of a org/apache/beam/model namespace in their imports and so that the generated files also include this namespace in their source file metadata.
…enerate all proto files for the go sdk. This new tool will add any necessary options to the proto compiler and generate the proto files relative to the go sdk root to ensure that the generated files have a namespaced file path in their metadata. If you want to generate a proto file in the go sdk, simply use this script in the go:generate directive, the rest will be taken care of by the script.
…te proto bindings. Updates the README for how to generate the model proto bindings into the SDK
…w namespaced structure of the Beam model. It does this by supporting arbitrary directory structures of proto files by calculating and replacing the generated imports with relative imports with the generated source. Additionally, it will generate bindings that allow for imports of the form `from apache_beam.portability.api import beam_runner_api_pb2` so that the SDK is not dependent on the potentially changing structure of the generated bindings within `api`. Imports of the form `from apache_beam.portability.api.org.apache.beam.model import beam_runner_api_pb2` are still supported. setup.py now attempts to generate the proto bindings on invocation since the package structure must exist before the wheel can be created.
…rder to support the new python output structure
@thempatel thempatel force-pushed the milan/fix-proto-import-paths branch from 35f99b3 to bbf3a61 Compare April 7, 2022 02:12
@lukecwik lukecwik merged commit e44d8a7 into apache:master Apr 7, 2022
@thempatel
Copy link
Contributor Author

thanks @lostluck @lukecwik @tvalentyn @robertwb for the thoughtful reviews, i had fun on this one! 🙏🏽

@thempatel thempatel deleted the milan/fix-proto-import-paths branch April 8, 2022 12:56
@asf-ci
Copy link

asf-ci commented Apr 8, 2022

Can one of the admins verify this patch?

1 similar comment
@asf-ci
Copy link

asf-ci commented Apr 8, 2022

Can one of the admins verify this patch?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants